-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies #37710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies #37710
Conversation
Initial version that processes the proto messages and converts them into logical plans.
|
Can one of the admins verify this patch? |
connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
Outdated
Show resolved
Hide resolved
connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
Outdated
Show resolved
Hide resolved
connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
Outdated
Show resolved
Hide resolved
connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
Show resolved
Hide resolved
connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
Show resolved
Hide resolved
connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
Outdated
Show resolved
Hide resolved
connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
Outdated
Show resolved
Hide resolved
connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
Outdated
Show resolved
Hide resolved
connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
Outdated
Show resolved
Hide resolved
connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
Outdated
Show resolved
Hide resolved
connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
Outdated
Show resolved
Hide resolved
connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala
Outdated
Show resolved
Hide resolved
connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala
Outdated
Show resolved
Hide resolved
connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala
Outdated
Show resolved
Hide resolved
connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala
Outdated
Show resolved
Hide resolved
|
Merged to master. I will follow up and actively work on cleaning up and followup tasks from tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late on the party; while I still haven't have time to look into the code change, my quick glance is that the generated files from proto may not be reflected with changed proto files from review comments. Looks like they are out of sync.
e.g. generated files still mention databricks whereas proto files don't have any notion of databricks.
@grundprinzip Could you please check this and submit a follow-up PR to fix if I'm not mistaken? Thanks in advance!
|
Ack, I will regenerate the protos and update. |
|
There's an outstanding comment: #37710 (comment). I am working on this. |
| <!-- Add protobuf-maven-plugin and provide ScalaPB as a code generation plugin --> | ||
| <plugin> | ||
| <groupId>org.xolstice.maven.plugins</groupId> | ||
| <artifactId>protobuf-maven-plugin</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon @grundprinzip Report another issue:
Compile connect module on CentOS release 6.3, the default glibc version is 2.12, this will cause compilation to fail as follows:
[ERROR] PROTOC FAILED: /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe)
/home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.18' not found (required by /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe)
/home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.14' not found (required by /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe)
/home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.5' not found (required by /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe)
Already file a jira SPARK-40593, I think at least we should explicitly point out the compilation dependency somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @LuciferYang for checking this out.
|
FYI, I made a PR #38109 to address #37710 (comment) |
…nenet ### What changes were proposed in this pull request? This PR proposes: 1. Fix the code style in `SparkConnectPlugin.scala` and `SparkBuild.scala` to be consistent with others. 2. Rename `data_frame` to `dataframe` to be consistent with existing PySpark codes. This Pr is a sort of a followup of #37710 ### Why are the changes needed? To follow existing codebase, and style. ### Does this PR introduce _any_ user-facing change? No, the codes are not released yet. The only notable change would be renaming `data_frame` to `dataframe` to be consistent. ### How was this patch tested? Ci in this PR should validate the changes. Closes #38121 from HyukjinKwon/minor-cleanup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ed Dependencies This is a draft change of the current state of the Spark Connect prototype implemented as a driver plugin to separate the classpaths and shaded dependent libraries. The benefit of a driver plugin is that the plugin has access to everything inside Spark but can itself be a leaf-build. This allows us to shade it's own minimal set of dependencies without brining new dependencies into Spark. Spark Connect is implemented as a driver plugin for Apache Spark. This means that it can be optionally loaded by specifying the following option during startup. ``` --conf spark.plugins=org.apache.spark.sql.sparkconnect.service.SparkConnectPlugin ``` The way that the plugin is integrated into the build system does not require any additional changes and the properly configured Jar files are produced as part of the build. Spark Connect relies both on GRPC and Guava. Since the version of Guava is incompatible to the version used in Spark, the plugin shades this dependency directly as part of the build. In the case of Maven this is very easy in the case of SBT it's a bit trickier. For SBT we hook into the `copyDeps` stage of the build and instead of copying the artifact dependency we copy the shaded dependency instead. https://issues.apache.org/jira/browse/SPARK-39375 Experimental API for Spark Connect This patch was added by adding two new test surface areas. On the driver plugin side this patch adds rudimentary test infrastructure that makes sure that Proto plans can be translated into Spark logical plans. Secondly, there are Python integration tests that test the DataFrame to Proto conversion and test the end-to-end execution of Spark Connect queries. To avoid issues with other parts of the build, only tests that require Spark Connect will be started with the driver plugin. **Note:** * The pyspark tests are only support for regular CPython and disabled for pypy * `mypy` was initially disabled for this patch to reduce friction. Will be enabled again. * `mypy` is disabled on the Python prototype until follow-up patches fix the warnings. * The unit-test coverage in Scala is not good. Follow up tests will increases coverage, but to keep this test reasonable I did not add more code. This patch adds GRPC, Protobuf, and an updated Guava to the leaf-build of Spark Connect. These dependencies are not available to the top-level Spark build and are not available to other Spark projects. To avoid dependency issues, the Maven and SBT build will shad GRPC, Protobuf and Guava explicitly. Due to the size of the PR, smaller clean-up changes will be submitted as follow up PRs. Closes apache#37710 from grundprinzip/spark-connect-grpc-shaded. Lead-authored-by: Martin Grund <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? This PR aims to remove `(any|empty).proto` from RAT exclusion. ### Why are the changes needed? `(any|empty).proto` files were never a part of Apache Spark repository. Those files were only used in the initial `Connect` PR and removed before merging. - #37710 - Added: 45c7bc5 - Excluded from RAT check: cf6b19a - Removed: 4971980 ### Does this PR introduce _any_ user-facing change? No. This is a dev-only change. ### How was this patch tested? Pass the CIs or manual check. ``` $ ./dev/check-license Ignored 0 lines in your exclusion files as comments or empty lines. RAT checks passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48837 from dongjoon-hyun/SPARK-50304. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? This PR aims to remove `(any|empty).proto` from RAT exclusion. ### Why are the changes needed? `(any|empty).proto` files were never a part of Apache Spark repository. Those files were only used in the initial `Connect` PR and removed before merging. - #37710 - Added: 45c7bc5 - Excluded from RAT check: cf6b19a - Removed: 4971980 ### Does this PR introduce _any_ user-facing change? No. This is a dev-only change. ### How was this patch tested? Pass the CIs or manual check. ``` $ ./dev/check-license Ignored 0 lines in your exclusion files as comments or empty lines. RAT checks passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48837 from dongjoon-hyun/SPARK-50304. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 33378a6) Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This is a draft change of the current state of the Spark Connect prototype implemented as a driver plugin to separate the classpaths and shaded dependent libraries. The benefit of a driver plugin is that the plugin has access to everything inside Spark but can itself be a leaf-build. This allows us to shade it's own minimal set of dependencies without brining new dependencies into Spark.
How is Spark Connect Integrated?
Spark Connect is implemented as a driver plugin for Apache Spark. This means that it can be optionally loaded by specifying the following option during startup.
The way that the plugin is integrated into the build system does not require any additional changes and the properly configured Jar files are produced as part of the build.
How are GRPC and Guava dependencies handled?
Spark Connect relies both on GRPC and Guava. Since the version of Guava is incompatible to the version used in Spark, the plugin shades this dependency directly as part of the build. In the case of Maven this is very easy in the case of SBT it's a bit trickier. For SBT we hook into the
copyDepsstage of the build and instead of copying the artifact dependency we copy the shaded dependency instead.Why are the changes needed?
https://issues.apache.org/jira/browse/SPARK-39375
Does this PR introduce any user-facing change?
Experimental API for Spark Connect
How was this patch tested?
This patch was added by adding two new test surface areas. On the driver plugin side this patch adds rudimentary test infrastructure that makes sure that Proto plans can be translated into Spark logical plans. Secondly, there are Python integration tests that test the DataFrame to Proto conversion and test the end-to-end execution of Spark Connect queries.
To avoid issues with other parts of the build, only tests that require Spark Connect will be started with the driver plugin.
Note:
mypywas initially disabled for this patch to reduce friction. Will be enabled again.Caveats
mypyis disabled on the Python prototype until follow-up patches fix the warnings.New Dependencies
This patch adds GRPC, Protobuf, and an updated Guava to the leaf-build of Spark Connect. These dependencies are not available to the top-level Spark build and are not available to other Spark projects. To avoid dependency issues, the Maven and SBT build will shad GRPC, Protobuf and Guava explicitly.
Follow Ups
Due to the size of the PR, smaller clean-up changes will be submitted as follow up PRs.