forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 2
[pull] master from apache:master #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ld of the hive UTs to info ### What changes were proposed in this pull request? This pr restore the file appender log level threshold of the hive UTs from debug to info ### Why are the changes needed? Reduce the disk space occupied by `sql/hive/target/unit-tests.log`. After this pr, the file size reduced from 12G to 119M ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #37976 from LuciferYang/hive-test-loglevel. Authored-by: yangjie01 <[email protected]> Signed-off-by: Yuming Wang <[email protected]>
…Suite` execution ### What changes were proposed in this pull request? This pr adds the cleaning operation of ` sql/hive-thriftserver/spark_derby/` directory after `SparkSQLEnvSuite` execution. ### Why are the changes needed? Clean the test directory after the UT ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass Github Actions - Manual test: ``` mvn clean install -Phive-thriftserver -pl sql/hive-thriftserver -DskipTests -am mvn clean install -pl sql/hive-thriftserver -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.SparkSQLEnvSuite -Phive-thriftserver git status ``` **Before** Residual `sql/hive-thriftserver/spark_derby/` directory **After** No directory residue Closes #37979 from LuciferYang/SPARK-40545. Authored-by: yangjie01 <[email protected]> Signed-off-by: Yuming Wang <[email protected]>
…ed Dependencies ### What changes were proposed in this pull request? This is a draft change of the current state of the Spark Connect prototype implemented as a driver plugin to separate the classpaths and shaded dependent libraries. The benefit of a driver plugin is that the plugin has access to everything inside Spark but can itself be a leaf-build. This allows us to shade it's own minimal set of dependencies without brining new dependencies into Spark. #### How is Spark Connect Integrated? Spark Connect is implemented as a driver plugin for Apache Spark. This means that it can be optionally loaded by specifying the following option during startup. ``` --conf spark.plugins=org.apache.spark.sql.sparkconnect.service.SparkConnectPlugin ``` The way that the plugin is integrated into the build system does not require any additional changes and the properly configured Jar files are produced as part of the build. #### How are GRPC and Guava dependencies handled? Spark Connect relies both on GRPC and Guava. Since the version of Guava is incompatible to the version used in Spark, the plugin shades this dependency directly as part of the build. In the case of Maven this is very easy in the case of SBT it's a bit trickier. For SBT we hook into the `copyDeps` stage of the build and instead of copying the artifact dependency we copy the shaded dependency instead. ### Why are the changes needed? https://issues.apache.org/jira/browse/SPARK-39375 ### Does this PR introduce _any_ user-facing change? Experimental API for Spark Connect ### How was this patch tested? This patch was added by adding two new test surface areas. On the driver plugin side this patch adds rudimentary test infrastructure that makes sure that Proto plans can be translated into Spark logical plans. Secondly, there are Python integration tests that test the DataFrame to Proto conversion and test the end-to-end execution of Spark Connect queries. To avoid issues with other parts of the build, only tests that require Spark Connect will be started with the driver plugin. **Note:** * The pyspark tests are only support for regular CPython and disabled for pypy * `mypy` was initially disabled for this patch to reduce friction. Will be enabled again. ### Caveats * `mypy` is disabled on the Python prototype until follow-up patches fix the warnings. * The unit-test coverage in Scala is not good. Follow up tests will increases coverage, but to keep this test reasonable I did not add more code. ### New Dependencies This patch adds GRPC, Protobuf, and an updated Guava to the leaf-build of Spark Connect. These dependencies are not available to the top-level Spark build and are not available to other Spark projects. To avoid dependency issues, the Maven and SBT build will shad GRPC, Protobuf and Guava explicitly. ### Follow Ups Due to the size of the PR, smaller clean-up changes will be submitted as follow up PRs. Closes #37710 from grundprinzip/spark-connect-grpc-shaded. Lead-authored-by: Martin Grund <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? This PR aims to upgrade RocksDB JNI library from 7.5.3 to 7.6.0. ### Why are the changes needed? This version bring performance improvements(related to read) and some bug fix, [The Release Note](https://github.com/facebook/rocksdb/releases/tag/v7.6.0) --- <img width="1161" alt="image" src="https://user-images.githubusercontent.com/15246973/192100406-1b6a9980-1595-42d7-ad48-781435cf979f.png"> <img width="1188" alt="image" src="https://user-images.githubusercontent.com/15246973/192100447-e30ee2d3-2e81-4491-b92b-a0c615443cfc.png"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #37985 from panbingkun/upgrade_rocksdbjni. Authored-by: panbingkun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ples self-contained (FINAL) ### What changes were proposed in this pull request? It's part of the Pyspark docstrings improvement series (#37592, #37662, #37686, #37786, #37797, #37850) In this PR I mainly covered missing parts in the docstrings adding some more examples where it needed. I have also made all examples self explanatory by providing DataFrame creation command where it was missing for clarity to a user. This should complete "my take" on `functions.py` docstrings & example improvements. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? Yes, documentation ### How was this patch tested? ``` PYTHON_EXECUTABLE=python3.9 ./dev/lint-python ./python/run-tests --testnames pyspark.sql.functions bundle exec jekyll build ``` Closes #37988 from khalidmammadov/docstrings_funcs_part_8. Authored-by: Khalid Mammadov <[email protected]> Signed-off-by: Sean Owen <[email protected]>
pull bot
pushed a commit
that referenced
this pull request
Jul 3, 2025
…pressions in `buildAggExprList` ### What changes were proposed in this pull request? Trim aliases before matching Sort/Having/Filter expressions with semantically equal expression from the Aggregate below in `buildAggExprList` ### Why are the changes needed? For a query like: ``` SELECT course, year, GROUPING(course) FROM courseSales GROUP BY CUBE(course, year) ORDER BY GROUPING(course) ``` Plan after `ResolveReferences` and before `ResolveAggregateFunctions` looks like: ``` !Sort [cast((shiftright(tempresolvedcolumn(spark_grouping_id#18L, spark_grouping_id, false), 1) & 1) as tinyint) AS grouping(course)#22 ASC NULLS FIRST], true +- Aggregate [course#19, year#20, spark_grouping_id#18L], [course#19, year#20, cast((shiftright(spark_grouping_id#18L, 1) & 1) as tinyint) AS grouping(course)#21 AS grouping(course)#15] .... ``` Because aggregate list has `Alias(Alias(cast((shiftright(spark_grouping_id#18L, 1) & 1) as tinyint))` expression from `SortOrder` won't get matched as semantically equal and it will result in adding an unnecessary `Project`. By stripping inner aliases from aggregate list (that are going to get removed anyways in `CleanupAliases`) we can match `SortOrder` expression and resolve it as `grouping(course)#15` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51339 from mihailotim-db/mihailotim-db/fix_inner_aliases_semi_structured. Authored-by: Mihailo Timotic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )