Skip to content

Conversation

CodingCat
Copy link
Contributor

No description provided.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@CodingCat
Copy link
Contributor Author

Hi, @pwendell and @aarondav, is it good?

@pwendell
Copy link
Contributor

LGTM @aarondav?

@aarondav
Copy link
Contributor

LGTM, merged into master. Thanks!!

@asfgit asfgit closed this in 6bd2eaa Mar 13, 2014
@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13146/

@CodingCat CodingCat deleted the SPARK-1160-2 branch March 17, 2014 17:21
jhartlaub referenced this pull request in jhartlaub/spark May 27, 2014
update default github

(cherry picked from commit 41ead7a)
Signed-off-by: Reynold Xin <[email protected]>
ericl pushed a commit to ericl/spark that referenced this pull request Dec 21, 2016
## What changes were proposed in this pull request?
Currently streaming tests fail often because multiple SparkContexts are created in a single VM. This is both a problem in DB Spark as Apache Spark. This PR fixes by making sure the previous SparkContext is always destroyed before a new one is created, and preferably by reusing the existing SparkContext.

I went kind of overboard on the refactor.

## How was this patch tested?
It are only tests.

Author: Herman van Hovell <[email protected]>

Closes apache#133 from hvanhovell/fix_streaming_tests.
jlopezmalla pushed a commit to jlopezmalla/spark that referenced this pull request Jan 30, 2018
* removed mesos security

* Removed mesos security from HS

* added clear mesage
Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018
* table fix

* another effort to fix the table

a successful one
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
Some refactor for cloud-provider-openstack-acceptance-test-e2e-conformance
cloud-fan pushed a commit that referenced this pull request Jul 27, 2020
### What changes were proposed in this pull request?

Currently `BroadcastHashJoinExec` and `ShuffledHashJoinExec` do not preserve children output ordering information (inherit from `SparkPlan.outputOrdering`, which is Nil). This can add unnecessary sort in complex queries involved multiple joins.

Example:

```
withSQLConf(
      SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50") {
      val df1 = spark.range(100).select($"id".as("k1"))
      val df2 = spark.range(100).select($"id".as("k2"))
      val df3 = spark.range(3).select($"id".as("k3"))
      val df4 = spark.range(100).select($"id".as("k4"))
      val plan = df1.join(df2, $"k1" === $"k2")
        .join(df3, $"k1" === $"k3")
        .join(df4, $"k1" === $"k4")
        .queryExecution
        .executedPlan
}
```

Current physical plan (extra sort on `k1` before top sort merge join):

```
*(9) SortMergeJoin [k1#220L], [k4#232L], Inner
:- *(6) Sort [k1#220L ASC NULLS FIRST], false, 0
:  +- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight
:     :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner
:     :  :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0
:     :  :  +- Exchange hashpartitioning(k1#220L, 5), true, [id=#128]
:     :  :     +- *(1) Project [id#218L AS k1#220L]
:     :  :        +- *(1) Range (0, 100, step=1, splits=2)
:     :  +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0
:     :     +- Exchange hashpartitioning(k2#224L, 5), true, [id=#134]
:     :        +- *(3) Project [id#222L AS k2#224L]
:     :           +- *(3) Range (0, 100, step=1, splits=2)
:     +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#141]
:        +- *(5) Project [id#226L AS k3#228L]
:           +- *(5) Range (0, 3, step=1, splits=2)
+- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(k4#232L, 5), true, [id=#148]
      +- *(7) Project [id#230L AS k4#232L]
         +- *(7) Range (0, 100, step=1, splits=2)
```

Ideal physical plan (no extra sort on `k1` before top sort merge join):

```
*(9) SortMergeJoin [k1#220L], [k4#232L], Inner
:- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight
:  :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner
:  :  :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0
:  :  :  +- Exchange hashpartitioning(k1#220L, 5), true, [id=#127]
:  :  :     +- *(1) Project [id#218L AS k1#220L]
:  :  :        +- *(1) Range (0, 100, step=1, splits=2)
:  :  +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0
:  :     +- Exchange hashpartitioning(k2#224L, 5), true, [id=#133]
:  :        +- *(3) Project [id#222L AS k2#224L]
:  :           +- *(3) Range (0, 100, step=1, splits=2)
:  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#140]
:     +- *(5) Project [id#226L AS k3#228L]
:        +- *(5) Range (0, 3, step=1, splits=2)
+- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(k4#232L, 5), true, [id=#146]
      +- *(7) Project [id#230L AS k4#232L]
         +- *(7) Range (0, 100, step=1, splits=2)
```

### Why are the changes needed?

To avoid unnecessary sort in query, and it has most impact when users read sorted bucketed table.
Though the unnecessary sort is operating on already sorted data, it would have obvious negative impact on IO and query run time if the data is large and external sorting happens.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `JoinSuite`.

Closes #29181 from c21/ordering.

Authored-by: Cheng Su <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Sep 17, 2025
…nProtoConverter

### What changes were proposed in this pull request?

This PR refactors the `LiteralExpressionProtoConverter` to use `CatalystTypeConverters` for consistent type conversion, eliminating code duplication and improving maintainability.

**Key changes:**
1. **Simplified `LiteralExpressionProtoConverter.toCatalystExpression()`**: Replaced the large switch statement (86 lines) with a clean 3-line implementation that leverages existing conversion utilities
2. **Added TIME type support**: Added missing TIME literal type conversion in `LiteralValueProtoConverter.toScalaValue()`

### Why are the changes needed?

1. **Type conversion issues**: Some complex nested data structures (e.g., arrays of case classes) failed to convert to Catalyst's internal representation when using `expressions.Literal.create(...)`.
2. **Inconsistent behaviors**: Differences in behavior between Spark Connect and classic Spark for the same data types (e.g., Decimal).

### Does this PR introduce _any_ user-facing change?

**Yes** - Users can now successfully use `typedLit` with an array of case classes. Previously, attempting to use `typedlit(Array(CaseClass(1, "a")))` would fail (please see the code piece below for details), but now it works correctly and converts case classes to proper struct representations.

```scala
import org.apache.spark.sql.functions.typedlit
case class CaseClass(a: Int, b: String)
spark.sql("select 1").select(typedlit(Array(CaseClass(1, "a")))).collect()

// Below is the error message:
"""
org.apache.spark.SparkIllegalArgumentException: requirement failed: Literal must have a corresponding value to array<struct<a:int,b:string>>, but class GenericArrayData found.
  scala.Predef$.require(Predef.scala:337)
  org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:306)
  org.apache.spark.sql.catalyst.expressions.Literal.<init>(literals.scala:456)
  org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:206)
  org.apache.spark.sql.connect.planner.LiteralExpressionProtoConverter$.toCatalystExpression(LiteralExpressionProtoConverter.scala:103)
"""
```

Besides, some catalyst values (e.g., Decimal 89.97620 -> 89.976200000000000000) have changed. Please see the changes in `explain-results/` for details.
```scala
import org.apache.spark.sql.functions.typedlit

spark.sql("select 1").select(typedlit(BigDecimal(8997620, 5)),typedlit(Array(BigDecimal(8997620, 5), BigDecimal(8997621, 5)))).explain()
// Current explain() output:
"""
Project [89.97620 AS 89.97620#819, [89.97620,89.97621] AS ARRAY(89.97620BD, 89.97621BD)#820]
"""
// Expected explain() output (i.e., same as the classic mode):
"""
Project [89.976200000000000000 AS 89.976200000000000000#132, [89.976200000000000000,89.976210000000000000] AS ARRAY(89.976200000000000000BD, 89.976210000000000000BD)#133]
"""
```
### How was this patch tested?

`build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite"`
`build/sbt "connect/testOnly org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite"`

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.4.5

Closes #52188 from heyihong/SPARK-53438.

Authored-by: Yihong He <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants