forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 1
merge upstream changes #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nchammas
pushed a commit
that referenced
this pull request
Jan 23, 2015
…ommands. Adding support for defining schema in foreign DDL commands. Now foreign DDL support commands like: ``` CREATE TEMPORARY TABLE avroTable USING org.apache.spark.sql.avro OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro") ``` With this PR user can define schema instead of infer from file, so support ddl command as follows: ``` CREATE TEMPORARY TABLE avroTable(a int, b string) USING org.apache.spark.sql.avro OPTIONS (path "../hive/src/test/resources/data/files/episodes.avro") ``` Author: scwf <[email protected]> Author: Yin Huai <[email protected]> Author: Fei Wang <[email protected]> Author: wangfei <[email protected]> Closes apache#3431 from scwf/ddl and squashes the following commits: 7e79ce5 [Fei Wang] Merge pull request #22 from yhuai/pr3431yin 38f634e [Yin Huai] Remove Option from createRelation. 65e9c73 [Yin Huai] Revert all changes since applying a given schema has not been testd. a852b10 [scwf] remove cleanIdentifier f336a16 [Fei Wang] Merge pull request #21 from yhuai/pr3431yin baf79b5 [Yin Huai] Test special characters quoted by backticks. 50a03b0 [Yin Huai] Use JsonRDD.nullTypeToStringType to convert NullType to StringType. 1eeb769 [Fei Wang] Merge pull request #20 from yhuai/pr3431yin f5c22b0 [Yin Huai] Refactor code and update test cases. f1cffe4 [Yin Huai] Revert "minor refactory" b621c8f [scwf] minor refactory d02547f [scwf] fix HiveCompatibilitySuite test failure 8dfbf7a [scwf] more tests for complex data type ddab984 [Fei Wang] Merge pull request #19 from yhuai/pr3431yin 91ad91b [Yin Huai] Parse data types in DDLParser. cf982d2 [scwf] fixed test failure 445b57b [scwf] address comments 02a662c [scwf] style issue 44eb70c [scwf] fix decimal parser issue 83b6fc3 [scwf] minor fix 9bf12f8 [wangfei] adding test case 7787ec7 [wangfei] added SchemaRelationProvider 0ba70df [wangfei] draft version
nchammas
pushed a commit
that referenced
this pull request
Jul 11, 2016
## What changes were proposed in this pull request? This issue add a new optimizer `ReorderAssociativeOperator` by taking advantage of integral associative property. Currently, Spark works like the following. 1) Can optimize `1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + a` into `45 + a`. 2) Cannot optimize `a + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9`. This PR can handle Case 2 for **Add/Multiply** expression whose data types are `ByteType`, `ShortType`, `IntegerType`, and `LongType`. The followings are the plan comparison between `before` and `after` this issue. **Before** ```scala scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain == Physical Plan == WholeStageCodegen : +- Project [(((((((((a#7 + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) AS (((((((((a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8] : +- INPUT +- Generate explode([1]), false, false, [a#7] +- Scan OneRowRelation[] scala> sql("select a*1*2*3*4*5*6*7*8*9 from (select explode(array(1)) a)").explain == Physical Plan == *Project [(((((((((a#18 * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9) AS (((((((((a * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9)#19] +- Generate explode([1]), false, false, [a#18] +- Scan OneRowRelation[] ``` **After** ```scala scala> sql("select a+1+2+3+4+5+6+7+8+9 from (select explode(array(1)) a)").explain == Physical Plan == WholeStageCodegen : +- Project [(a#7 + 45) AS (((((((((a + 1) + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9)#8] : +- INPUT +- Generate explode([1]), false, false, [a#7] +- Scan OneRowRelation[] scala> sql("select a*1*2*3*4*5*6*7*8*9 from (select explode(array(1)) a)").explain == Physical Plan == *Project [(a#18 * 362880) AS (((((((((a * 1) * 2) * 3) * 4) * 5) * 6) * 7) * 8) * 9)#19] +- Generate explode([1]), false, false, [a#18] +- Scan OneRowRelation[] ``` This PR is greatly generalized by cloud-fan 's key ideas; he should be credited for the work he did. ## How was this patch tested? Pass the Jenkins tests including new testsuite. Author: Dongjoon Hyun <[email protected]> Closes apache#12850 from dongjoon-hyun/SPARK-15076.
nchammas
pushed a commit
that referenced
this pull request
Apr 25, 2022
…aceable ### What changes were proposed in this pull request? This PR uses a manual recursion to replace `RuntimeReplaceable` expressions instead of `transformAllExpressionsWithPruning`. The problem of `transformAllExpressionsWithPruning` is it will automatically make the replacement expression inherit the function alias name from the parent node, which is quite misleading. For example, `select date_part('month', c) from t`, the optimized plan in EXPLAIN before this PR is ``` Project [date_part(cast(c#18 as date)) AS date_part(month, c)#19] +- Relation default.t[c#18] parquet ``` Now it's ``` Project [month(cast(c#9 as date)) AS date_part(month, c)#10] +- Relation default.t[c#9] parquet ``` ### Why are the changes needed? fix misleading EXPLAIN result ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes apache#35821 from cloud-fan/follow2. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
nchammas
pushed a commit
that referenced
this pull request
Sep 17, 2024
…eption ### What changes were proposed in this pull request? This pr reworks the group by map type to fix issues: - Can not bind reference excpetion at runtume since the attribute was wrapped by `MapSort` and we didi not transform the plan with new output - The add `MapSort` rule should be put before `PullOutGroupingExpressions` to avoid complex expr existing in grouping keys ### Why are the changes needed? To fix issues. for example: ``` select map(1, id) from range(10) group by map(1, id); [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000 org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000 at org.apache.spark.SparkException$.internalError(SparkException.scala:92) at org.apache.spark.SparkException$.internalError(SparkException.scala:96) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:81) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:470) ``` ### Does this PR introduce _any_ user-facing change? no, not released ### How was this patch tested? improve the tests to add more cases ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47545 from ulysses-you/maptype. Authored-by: ulysses-you <[email protected]> Signed-off-by: youxiduo <[email protected]>
nchammas
pushed a commit
that referenced
this pull request
Jan 29, 2025
…IN-subquery ### What changes were proposed in this pull request? This PR adds code to `RewritePredicateSubquery#apply` to explicitly handle the case where an `Aggregate` node contains an aggregate expression in the left-hand operand of an IN-subquery expression. The explicit handler moves the IN-subquery expressions out of the `Aggregate` and into a parent `Project` node. The `Aggregate` will continue to perform the aggregations that were used as an operand to the IN-subquery expression, but will not include the IN-subquery expression itself. After pulling up IN-subquery expressions into a Project node, `RewritePredicateSubquery#apply` is called again to handle the `Project` as a `UnaryNode`. The `Join` will now be inserted between the `Project` and the `Aggregate` node, and the join condition will use an attribute rather than an aggregate expression, e.g.: ``` Project [col1#32, exists#42 AS (sum(col2) IN (listquery()))apache#40] +- Join ExistenceJoin(exists#42), (sum(col2)#41L = c2#39L) :- Aggregate [col1#32], [col1#32, sum(col2#33) AS sum(col2)#41L] : +- LocalRelation [col1#32, col2#33] +- LocalRelation [c2#39L] ``` `sum(col2)#41L` in the above join condition, despite how it looks, is the name of the attribute, not an aggregate expression. ### Why are the changes needed? The following query fails: ``` create or replace temp view v1(c1, c2) as values (1, 2), (1, 3), (2, 2), (3, 7), (3, 1); create or replace temp view v2(col1, col2) as values (1, 2), (1, 3), (2, 2), (3, 7), (3, 1); select col1, sum(col2) in (select c2 from v1) from v2 group by col1; ``` It fails with this error: ``` [INTERNAL_ERROR] Cannot generate code for expression: sum(input[1, int, false]) SQLSTATE: XX000 ``` With SPARK_TESTING=1, it fails with this error: ``` [PLAN_VALIDATION_FAILED_RULE_IN_BATCH] Rule org.apache.spark.sql.catalyst.optimizer.RewritePredicateSubquery in batch RewriteSubquery generated an invalid plan: Special expressions are placed in the wrong plan: Aggregate [col1#11], [col1#11, first(exists#20, false) AS (sum(col2) IN (listquery()))#19] +- Join ExistenceJoin(exists#20), (sum(col2#12) = c2#18L) :- LocalRelation [col1#11, col2#12] +- LocalRelation [c2#18L] ``` The issue is that `RewritePredicateSubquery` builds a `Join` operator where the join condition contains an aggregate expression. The bug is in the handler for `UnaryNode` in `RewritePredicateSubquery#apply`, which adds a `Join` below the `Aggregate` and assumes that the left-hand operand of IN-subquery can be used in the join condition. This works fine for most cases, but not when the left-hand operand is an aggregate expression. This PR moves the offending IN-subqueries to a `Project` node, with the aggregates replaced by attributes referring to the aggregate expressions. The resulting join condition now uses those attributes rather than the actual aggregate expressions. ### Does this PR introduce _any_ user-facing change? No, other than allowing this type of query to succeed. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48627 from bersprockets/aggregate_in_set_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.