[SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes #29066

aokolnychyi · 2020-07-10T18:12:15Z

What changes were proposed in this pull request?

This PR implements the core part of the design doc for SPARK-23889.

Note: This PR contains all changes in one place to simplify the review. Once we agree on the approach, I am going to split it into smaller PRs.

Why are the changes needed?

Data sources should be able to request a specific distribution and ordering of data on write. In particular, these scenarios are considered useful:

global sort
cluster data and sort within partitions
local sort within partitions
no sort

Please see the design doc above for a more detailed explanation of requirements.

Does this PR introduce any user-facing change?

This PR introduces public changes to the DS V2 by adding a logical write abstraction as we have on the read path as well as additional interfaces to represent distribution and ordering of data (please see the doc for more info).

Important pieces:

Write - a logical representation of a data source write
RequiresDistributionAndOrdering - a write that requires a specific distribution/ordering
V2Writes - a rule that constructs a logical write and inserts repartition/sort nodes
WriteDistributionAndOrderingSuite - a test case with samples

How was this patch tested?

The patch comes with a new test case.

SparkQA · 2020-07-10T18:19:36Z

Test build #125632 has finished for PR 29066 at commit e5a42ab.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class Distributions

SparkQA · 2020-07-11T00:12:33Z

Test build #125634 has finished for PR 29066 at commit 6fea82b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class Distributions

dongjoon-hyun · 2020-07-11T20:49:47Z

Retest this please.

SparkQA · 2020-07-12T00:51:33Z

Test build #125690 has finished for PR 29066 at commit 6fea82b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class Distributions

dongjoon-hyun · 2020-07-12T16:04:36Z

Hi, @aokolnychyi .
Could you fix the following UT failure?

org.apache.spark.sql.execution.arrow.ArrowConvertersSuite.test Arrow Validator

cc @rdblue since this is DSv2.

aokolnychyi · 2020-07-13T17:08:25Z

The test failure in ArrowConvertersSuite is related to the optimization I did to dedup sorts. I've created another PR to address that separately. Will rebase this one afterwards.

aokolnychyi · 2020-07-14T20:13:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2Writes.scala

+/**
+ * A rule that constructs [[Write]]s.
+ */
+object V2Writes extends Rule[LogicalPlan] with PredicateHelper {


This rule contains the same logic we had before except it is applied earlier now.

Hmm, care to explain where are these rules before? I don't see anything moved to this new rule.

I think buildAndRun methods in exec nodes still contain the old logic. Previously, it was called run.

SparkQA · 2020-07-15T01:22:28Z

Test build #125856 has finished for PR 29066 at commit d75f0e4.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-15T01:26:18Z

Test build #125857 has finished for PR 29066 at commit 5e4d304.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-20T18:41:50Z

Test build #126190 has finished for PR 29066 at commit 98fb788.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class Distributions
case class WriteToDataSourceV2FallbackExec(
case class V2BatchWriteCommand(

aokolnychyi · 2020-07-20T18:50:44Z

retest this please

SparkQA · 2020-07-21T00:36:53Z

Test build #126195 has finished for PR 29066 at commit 98fb788.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class Distributions
case class WriteToDataSourceV2FallbackExec(
case class V2BatchWriteCommand(

dongjoon-hyun · 2020-08-06T22:19:18Z

Retest this please.

SparkQA · 2020-08-07T03:03:20Z

Test build #127158 has finished for PR 29066 at commit 98fb788.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class Distributions
case class WriteToDataSourceV2FallbackExec(
case class V2BatchWriteCommand(

dongjoon-hyun · 2020-08-07T05:46:34Z

Hi, @aokolnychyi .
I know that you've been waiting for a long time in WIP status. How do you want to proceed this proposal?

Is there any updates from your side?
Or, do you want to remove WIP from now?

viirya · 2020-08-08T23:42:54Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/WriteBuilder.java

+  /**
+   * Returns a logical {@link Write} shared between batch and streaming.
+   */
+  default Write build() {


This API looks like overlapping in function with buildForBatch and buildForStreaming? Which one we should use? build then toBatch/toStream or buildForBatch/buildForStreaming?

The buildForBatch method (and stream equivalent) are already released, so this generic Write implementation makes the new structure, build + toBatch, work for existing sources. It also allows sources to implement the version that they choose. So if none of the features that require the Write are used, I guess they could avoid a mostly-boilerplate class.

Correct, this method was introduced to keep the compatibility.

Have you considered to change the default impl for buildForBatch to:

default BatchWrite buildForBatch() { build().toBatch() }

and also the build() to just return a simple anonymous new Write() {}?

Otherwise, I can see that we'll have the buildForBatch (and similarly buildForStreaming) logic in two different places: WriteBuilder and Write. It is easy to miss one or another.

I am not sure I understood. Could you elaborate a bit more, @sunchao?

Spark will now always call build() and work with the Write abstraction. I added the default implementation so that existing data sources that already implement the current API will continue to work as before. Spark will is not supposed to call buildForBatch after this change.

New data sources should be encouraged to implement only build.

We should probably deprecate the other ones.

What I mean is now we can potentially have two copies of the toBatch implementation: one in WriteBuilder.buildForBatch and one in Write.toBatch, when users start to override build, buildForBatch and buildForStreaming. If moving forward we want build to be the canonical impl, perhaps we can make buildForBatch and buildForStreaming to just call build.toBatch() internally so that users just need to override build.

viirya · 2020-08-09T00:20:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2Writes.scala

+    case OverwriteByExpression(relation: DataSourceV2Relation, deleteExpr, query, options, _) =>
+      // fail if any filter cannot be converted. correctness depends on removing all matching data.
+      val filters = splitConjunctivePredicates(deleteExpr).map {
+        filter => DataSourceStrategy.translateFilter(deleteExpr,
+          supportNestedPredicatePushdown = true).getOrElse(
+          throw new AnalysisException(s"Cannot translate expression to source filter: $filter"))
+      }.toArray


By the change, we move catalyst expression -> sources.Filter conversion to logical plans. So we will see both catalyst expressions and sources.expressions in logical plans in optimization.

Is it more clear if we use only catalyst expressions in logical plans, and convert to sources.Filter in physical plans when we need to interact datasources?

And later in V2WriteRequirements, we also need to convert sources.Filter back to catalyst expressions.

I don't think that it is possible to do the conversion later, because we need the write builder to be fully configured to produce a Write in this rule. That way, the Write can expose its order and distribution requirements in the other rule.

I think it is fine to convert to Filter here. That's the public API for filter expressions, so I don't think there is a requirement for it to be used only in physical plans. Filter is already used in the optimizer on the read path as well, because the Scan is similarly built and added to an optimizer plan so that the optimizer can handle stats based on the pushed filters.

And later in V2WriteRequirements, we also need to convert sources.Filter back to catalyst expressions.

I'll take a look at this as well since it sounds odd. I think we should probably keep the original expressions around instead of converting Filter back to Expression.

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2Writes.scala

HeartSaVioR · 2020-09-11T03:16:59Z

This is definitely a major missing piece on DSv2 compared to DSv1, as DSv1 writer can deal with Dataframe directly, hence able to do arbitrary changes (including repartition/sort) before doing actual write, like I did for state data source - https://github.com/HeartSaVioR/spark-state-tools/blob/8a74bdb1bc7911a6f71785cf68b784b0a331a1d9/src/main/scala/net/heartsavior/spark/sql/state/StateStoreWriter.scala#L67-L74

Lots of data sources are blocked to migrate to DSv2. Shall we consider prioritizing this?

aokolnychyi · 2020-10-06T19:18:06Z

I'll go through the comments later this week and update the PR.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/NullOrdering.java

sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/SortDirection.java

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2Writes.scala

aokolnychyi · 2020-12-01T08:44:52Z

@dbtsai, I will rebase this one once PR #30558 is in.

aokolnychyi · 2020-12-01T08:50:32Z

I know deprecating and then removing is usually a better idea and I will be okay evolving read and write path separately. The only concern I have is that while we use these interfaces in the write path here, the concept isn't really write-specific. There will be a chance we will have to move these interfaces from write package breaking things again.

sunchao · 2020-11-25T22:20:07Z

...talyst/src/main/java/org/apache/spark/sql/connector/distributions/ClusteredDistribution.java

Since this is a public interface, do you think we should add some documents for the method?

NVM if you think this is obvious :)

Oh, yeah, docs and annotations should be added for sure. I'll fix while rebasing.

I added more docs. There may be places where we would want a bit more details but I think we can review those parts when I split this PR into smaller chunks.

sunchao · 2020-12-01T21:52:27Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/SortOrder.java

nit: should we add @since 3.1.0 following other existing expressions?

I am not sure whether this will be part of 3.1.0. Once we have clarity, I'll add the annotation.

sunchao · 2020-12-01T21:58:33Z

sql/core/src/main/java/org/apache/spark/sql/connector/write/V1Write.java

Similar to the Write interface, perhaps we should mention that data sources must implement this if it returns V1_BATCH_WRITE capability?

and also the @since tag.

Added a bit of description.

sunchao · 2020-12-02T20:02:33Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/WriteBuilder.java

+  /**
+   * Returns a logical {@link Write} shared between batch and streaming.
+   */
+  default Write build() {


Have you considered to change the default impl for buildForBatch to:

default BatchWrite buildForBatch() { build().toBatch() }

and also the build() to just return a simple anonymous new Write() {}?

Otherwise, I can see that we'll have the buildForBatch (and similarly buildForStreaming) logic in two different places: WriteBuilder and Write. It is easy to miss one or another.

sunchao · 2020-12-02T20:24:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2Writes.scala

Seems to me that what this rule does it to add the distribution/ordering info. Do we plan to add other functionalities to this in future? is V2Writes too general as a name?

This rule not only inserts shuffle/sort but also build Writes. It is only applied if Write has not been constructed before.

sunchao · 2020-12-03T01:36:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2Writes.scala

Should we put SupportsOverwrite before SupportsTruncate? if a builder class extends SupportsOverwrite (and isTruncate returns true) then it will be matched by the first clause and call the truncate method, but we want it to call overwrite(filters) right?

This is existing logic, just moved.

The reason for this is that SupportsOverwrite extends SupportsTruncate and calls overwrite(true). Calling truncate ensures that the source can implement either one if it chooses. Sometimes truncate may be preferred, and it is easier for a source to receive that call directly rather than writing its own equivalent of isTruncate.

Got it. I'm just not sure if some data source would extend SupportsOverwrite and decides to override/implement different behavior for overwrite and truncate (which no longer calls overwrite(true), and here we'd pick truncate but what they really want is overwrite.

If the source doesn't implement truncate, then it gets overwrite(true): https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/SupportsOverwrite.java#L46-L48

This logic did not change and should match the previous behavior.

sunchao

Sorry for the sporadic review comments. This PR looks mostly good to me and the main thing concerning me is the newly introduced Distribution interface and how that evolve with the existing one in read package. Happy to see discussions already happened on this and agree that we can move start moving in parallel.

sunchao · 2020-12-03T19:24:54Z

sql/core/src/test/scala/org/apache/spark/sql/connector/WriteDistributionAndOrderingSuite.scala

why is this ignored?

I am not sure whether it is safe to do. @dongjoon-hyun @viirya, what's your take on this?

sunchao · 2020-12-03T19:32:11Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala

Does this have to be an option? or it will always be non-empty?

I kept it as Option just in case we don't want to apply the new logic all the time and will introduce a flag to fallback to the old approach. If we are not going to have that flag, we can make this required.

Cool, it would simplify the logic if we know the new code will always be applied. Perhaps worth creating a JIRA to track this.

sunchao · 2020-12-03T19:34:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2Writes.scala

+/**
+ * A rule that constructs [[Write]]s.
+ */
+object V2Writes extends Rule[LogicalPlan] with PredicateHelper {


Hmm, care to explain where are these rules before? I don't see anything moved to this new rule.

…writes Lead-authored-by: Anton Okolnychyi <[email protected]> Co-authored-by: Ryan Blue <[email protected]>

aokolnychyi · 2020-12-07T14:54:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/distributions/distributions.scala

+}
+
+private[sql] final case class ClusteredDistributionImpl(
+    clusteringExprs: Seq[Expression]) extends ClusteredDistribution {


I switched to using Seq in fields to avoid reasoning about equality of arrays.

aokolnychyi · 2020-12-07T14:55:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V1FallbackWriters.scala

+    override val write: Option[V1Write] = None) extends V1FallbackWriters {

-  override protected def run(): Seq[InternalRow] = {
-    writeWithV1(newWriteBuilder().buildForV1Write(), refreshCache = refreshCache)


I moved refreshCache to parent.

aokolnychyi · 2020-12-07T15:00:00Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala


-  override protected def run(): Seq[InternalRow] = {
-    val writtenRows = writeWithV2(newWriteBuilder().buildForBatch())
-    refreshCache()


Refresh happens in V2ExistingTableWriteExec now.

aokolnychyi · 2020-12-07T15:04:53Z

I've updated this PR and I am ready to split it into smaller mergeable parts. It would be great if everyone could take another look to make sure we are on the same page.

aokolnychyi · 2020-12-07T15:10:30Z

Seems like there is consensus about evolving this API alongside the interfaces in read package. I am not sure whether we need to move new interfaces to write, though. This concept isn't really read or write specific. I'd try to make new interfaces generic enough so that we don't have move things again.

SparkQA · 2020-12-07T16:32:48Z

Test build #132373 has finished for PR 29066 at commit 32f5687.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class Distributions
trait V2ExistingTableWriteExec extends V2TableWriteExec

viirya · 2020-12-09T06:59:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala

  }
 }
+
+private[sql] final case class SortValue(


nit: SortValue sounds somehow confusing to me. Affected by the catalyst SortOrder, seems SortOrder sounds better. However you already define SortOrder as interface. Not strong option, can be ignored if you think it's okay.

I am open to alternatives here for sure.

It seems like we are giving synonyms in this file. For example, FieldReference implements NamedReference. Unfortunately, I cannot use SortOrder as it is already taken in the public expression API.

viirya · 2020-12-09T07:16:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2Writes.scala

+        // the conversion to catalyst expressions above produces SortOrder expressions
+        // for OrderedDistribution and generic expressions for ClusteredDistribution
+        // this allows RepartitionByExpression to pick either range or hash partitioning
+        RepartitionByExpression(distribution, query, numShufflePartitions)


Is it possible the required distribution be changed later by other optimization? The distribution requirement from data source is a hard requirement? Once if the distribution is changed and not matched the requirement, how will data source react to it?

We are inserting repartition/sort nodes directly before writing so my assumption that Spark will only remove them if the incoming plan already satisfies these requirements. WriteDistributionAndOrderingSuite is kind of meant for testing that. Do you have ideas when this assumption will break, @viirya?

Oh, I see. Mis-reading this part. Looks fine. Thanks.

viirya · 2020-12-09T07:25:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2Writes.scala

+      // we cannot perform this step in the analyzer since we need to optimize expressions
+      // in nodes like OverwriteByExpression before constructing a logical write


If we resolve it in the analyzer, cannot we optimize the resolved expressions later in the optimizer?

At this step, we construct a Write and pass the overwrite expressions to the data source. Expression optimization must have happened before.

Got it. Thanks for clarifying.

aokolnychyi · 2020-12-09T11:27:23Z

It is a bit hard to keep this large PR up-to-date since it touches many places. As it seems like a reasonable approach, I am going to split the work and submit smaller PRs. We can perform detailed reviews on individual PRs.

aokolnychyi · 2020-12-10T11:59:47Z

The first PR with interfaces only is out.

…ering on write ### What changes were proposed in this pull request? This PR adds connector interfaces proposed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889. **Note**: This PR contains a subset of changes discussed in PR #29066. ### Why are the changes needed? Data sources should be able to request a specific distribution and ordering of data on write. In particular, these scenarios are considered useful: - global sort - cluster data and sort within partitions - local sort within partitions - no sort Please see the design doc above for a more detailed explanation of requirements. ### Does this PR introduce _any_ user-facing change? This PR introduces public changes to the DS V2 by adding a logical write abstraction as we have on the read path as well as additional interfaces to represent distribution and ordering of data (please see the doc for more info). The existing `Distribution` interface in `read` package is read-specific and not flexible enough like discussed in the design doc. The current proposal is to evolve these interfaces separately until they converge. ### How was this patch tested? This patch adds only interfaces. Closes #30706 from aokolnychyi/spark-23889-interfaces. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Ryan Blue <[email protected]>

HyukjinKwon · 2020-12-15T01:33:12Z

@aokolnychyi, so #30706 and #30577 were the splits? Yeah, I think splitting is a good approach for a big change like this. Should we close this PR BTW?

aokolnychyi · 2020-12-15T08:55:24Z

Closing this one in favor of smaller PRs.

### What changes were proposed in this pull request? This PR adds logic to build logical writes introduced in SPARK-33779. Note: This PR contains a subset of changes discussed in PR #29066. ### Why are the changes needed? These changes are the next step as discussed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #30806 from aokolnychyi/spark-33808. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…red distribution and ordering ### What changes were proposed in this pull request? This PR adds repartition and sort nodes to satisfy the required distribution and ordering introduced in SPARK-33779. Note: This PR contains the final part of changes discussed in PR #29066. ### Why are the changes needed? These changes are the next step as discussed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR comes with a new test suite. Closes #31083 from aokolnychyi/spark-34026. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…red distribution and ordering ### What changes were proposed in this pull request? This PR adds repartition and sort nodes to satisfy the required distribution and ordering introduced in SPARK-33779. Note: This PR contains the final part of changes discussed in PR apache#29066. ### Why are the changes needed? These changes are the next step as discussed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR comes with a new test suite. Closes apache#31083 from aokolnychyi/spark-34026. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ering on write ### What changes were proposed in this pull request? This PR adds connector interfaces proposed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889. **Note**: This PR contains a subset of changes discussed in PR apache#29066. ### Why are the changes needed? Data sources should be able to request a specific distribution and ordering of data on write. In particular, these scenarios are considered useful: - global sort - cluster data and sort within partitions - local sort within partitions - no sort Please see the design doc above for a more detailed explanation of requirements. ### Does this PR introduce _any_ user-facing change? This PR introduces public changes to the DS V2 by adding a logical write abstraction as we have on the read path as well as additional interfaces to represent distribution and ordering of data (please see the doc for more info). The existing `Distribution` interface in `read` package is read-specific and not flexible enough like discussed in the design doc. The current proposal is to evolve these interfaces separately until they converge. ### How was this patch tested? This patch adds only interfaces. Closes apache#30706 from aokolnychyi/spark-23889-interfaces. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Ryan Blue <[email protected]> (cherry picked from commit 82aca7e) Signed-off-by: Dongjoon Hyun <[email protected]>

This PR adds logic to build logical writes introduced in SPARK-33779. Note: This PR contains a subset of changes discussed in PR apache#29066. These changes are the next step as discussed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889. No. Existing tests.

…red distribution and ordering (apache#905) ### What changes were proposed in this pull request? This PR adds repartition and sort nodes to satisfy the required distribution and ordering introduced in SPARK-33779. Note: This PR contains the final part of changes discussed in PR apache#29066. ### Why are the changes needed? These changes are the next step as discussed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR comes with a new test suite. Closes apache#31083 from aokolnychyi/spark-34026. Authored-by: Anton Okolnychyi <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

probot-autolabeler bot added AVRO DSTREAM SQL STRUCTURED STREAMING labels Jul 10, 2020

aokolnychyi force-pushed the spark-23889-wip branch from e5a42ab to 6fea82b Compare July 10, 2020 18:26

aokolnychyi commented Jul 14, 2020

View reviewed changes

dbtsai requested review from cloud-fan and viirya July 16, 2020 21:23

aokolnychyi force-pushed the spark-23889-wip branch from 5e4d304 to 98fb788 Compare July 20, 2020 16:03

viirya reviewed Aug 9, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2Writes.scala Outdated Show resolved Hide resolved

rdblue reviewed Oct 6, 2020

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/NullOrdering.java Outdated Show resolved Hide resolved

rdblue reviewed Oct 6, 2020

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/SortDirection.java Outdated Show resolved Hide resolved

rdblue reviewed Oct 6, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2Writes.scala Outdated Show resolved Hide resolved

aokolnychyi mentioned this pull request Dec 1, 2020

[SPARK-33612][SQL] Add dataSourceRewriteRules batch to Optimizer #30558

Closed

sunchao reviewed Dec 2, 2020

View reviewed changes

sunchao reviewed Dec 3, 2020

View reviewed changes

sunchao reviewed Dec 4, 2020

View reviewed changes

[SPARK-23889][SQL] DataSourceV2: required sorting and clustering for …

32f5687

…writes Lead-authored-by: Anton Okolnychyi <[email protected]> Co-authored-by: Ryan Blue <[email protected]>

aokolnychyi force-pushed the spark-23889-wip branch from f7e8444 to 32f5687 Compare December 7, 2020 14:50

aokolnychyi commented Dec 7, 2020

View reviewed changes

viirya reviewed Dec 9, 2020

View reviewed changes

aokolnychyi mentioned this pull request Dec 10, 2020

[SPARK-33779][SQL] DataSource V2: API to request distribution and ordering on write #30706

Closed

aokolnychyi closed this Dec 15, 2020

aokolnychyi mentioned this pull request Dec 16, 2020

[SPARK-33808][SQL] DataSource V2: Build logical writes in the optimizer #30806

Closed

aokolnychyi mentioned this pull request Jan 7, 2021

[SPARK-34026][SQL] Inject repartition and sort nodes to satisfy required distribution and ordering #31083

Closed

		// we cannot perform this step in the analyzer since we need to optimize expressions
		// in nodes like OverwriteByExpression before constructing a logical write

[SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes #29066

[SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes #29066

Uh oh!

Conversation

aokolnychyi commented Jul 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jul 10, 2020

Uh oh!

SparkQA commented Jul 11, 2020

Uh oh!

dongjoon-hyun commented Jul 11, 2020

Uh oh!

SparkQA commented Jul 12, 2020

Uh oh!

dongjoon-hyun commented Jul 12, 2020

Uh oh!

aokolnychyi commented Jul 13, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 15, 2020

Uh oh!

SparkQA commented Jul 15, 2020

Uh oh!

SparkQA commented Jul 20, 2020

Uh oh!

aokolnychyi commented Jul 20, 2020

Uh oh!

SparkQA commented Jul 21, 2020

Uh oh!

dongjoon-hyun commented Aug 6, 2020

Uh oh!

SparkQA commented Aug 7, 2020

Uh oh!

dongjoon-hyun commented Aug 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Oct 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HeartSaVioR commented Sep 11, 2020

Uh oh!

aokolnychyi commented Oct 6, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aokolnychyi commented Dec 1, 2020

Uh oh!

aokolnychyi commented Jul 10, 2020 •

edited

Loading

dongjoon-hyun commented Aug 7, 2020 •

edited

Loading

rdblue Oct 6, 2020 •

edited

Loading

aokolnychyi commented Dec 1, 2020 •

edited

Loading