[SPARK-40448][CONNECT] Spark Connect build as Driver Plugin with Shaded Dependencies #37710

grundprinzip · 2022-08-29T16:34:29Z

What changes were proposed in this pull request?

This is a draft change of the current state of the Spark Connect prototype implemented as a driver plugin to separate the classpaths and shaded dependent libraries. The benefit of a driver plugin is that the plugin has access to everything inside Spark but can itself be a leaf-build. This allows us to shade it's own minimal set of dependencies without brining new dependencies into Spark.

How is Spark Connect Integrated?

Spark Connect is implemented as a driver plugin for Apache Spark. This means that it can be optionally loaded by specifying the following option during startup.

--conf spark.plugins=org.apache.spark.sql.sparkconnect.service.SparkConnectPlugin

The way that the plugin is integrated into the build system does not require any additional changes and the properly configured Jar files are produced as part of the build.

How are GRPC and Guava dependencies handled?

Spark Connect relies both on GRPC and Guava. Since the version of Guava is incompatible to the version used in Spark, the plugin shades this dependency directly as part of the build. In the case of Maven this is very easy in the case of SBT it's a bit trickier. For SBT we hook into the copyDeps stage of the build and instead of copying the artifact dependency we copy the shaded dependency instead.

Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-39375

Does this PR introduce any user-facing change?

Experimental API for Spark Connect

How was this patch tested?

This patch was added by adding two new test surface areas. On the driver plugin side this patch adds rudimentary test infrastructure that makes sure that Proto plans can be translated into Spark logical plans. Secondly, there are Python integration tests that test the DataFrame to Proto conversion and test the end-to-end execution of Spark Connect queries.

To avoid issues with other parts of the build, only tests that require Spark Connect will be started with the driver plugin.

Note:

The pyspark tests are only support for regular CPython and disabled for pypy
mypy was initially disabled for this patch to reduce friction. Will be enabled again.

Caveats

mypy is disabled on the Python prototype until follow-up patches fix the warnings.
The unit-test coverage in Scala is not good. Follow up tests will increases coverage, but to keep this test reasonable I did not add more code.

New Dependencies

This patch adds GRPC, Protobuf, and an updated Guava to the leaf-build of Spark Connect. These dependencies are not available to the top-level Spark build and are not available to other Spark projects. To avoid dependency issues, the Maven and SBT build will shad GRPC, Protobuf and Guava explicitly.

Follow Ups

Due to the size of the PR, smaller clean-up changes will be submitted as follow up PRs.

Initial version that processes the proto messages and converts them into logical plans.

AmplabJenkins · 2022-08-30T13:34:31Z

Can one of the admins verify this patch?

connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectService.scala

connect/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectStreamHandler.scala

HyukjinKwon · 2022-09-25T05:55:16Z

Merged to master.

I will follow up and actively work on cleaning up and followup tasks from tomorrow.

HeartSaVioR

Sorry for late on the party; while I still haven't have time to look into the code change, my quick glance is that the generated files from proto may not be reflected with changed proto files from review comments. Looks like they are out of sync.
e.g. generated files still mention databricks whereas proto files don't have any notion of databricks.

@grundprinzip Could you please check this and submit a follow-up PR to fix if I'm not mistaken? Thanks in advance!

grundprinzip · 2022-09-26T03:49:15Z

Ack, I will regenerate the protos and update.

HyukjinKwon · 2022-09-27T13:34:58Z

There's an outstanding comment: #37710 (comment). I am working on this.

LuciferYang · 2022-09-28T06:39:31Z

connect/pom.xml

+      <!-- Add protobuf-maven-plugin and provide ScalaPB as a code generation plugin -->
+      <plugin>
+        <groupId>org.xolstice.maven.plugins</groupId>
+        <artifactId>protobuf-maven-plugin</artifactId>


@HyukjinKwon @grundprinzip Report another issue:

Compile connect module on CentOS release 6.3, the default glibc version is 2.12, this will cause compilation to fail as follows:

[ERROR] PROTOC FAILED: /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe) /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.18' not found (required by /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe) /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.14' not found (required by /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe) /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.5' not found (required by /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe)

Already file a jira SPARK-40593, I think at least we should explicitly point out the compilation dependency somewhere

Thanks @LuciferYang for checking this out.

HyukjinKwon · 2022-10-05T11:32:06Z

FYI, I made a PR #38109 to address #37710 (comment)

…nenet ### What changes were proposed in this pull request? This PR proposes: 1. Fix the code style in `SparkConnectPlugin.scala` and `SparkBuild.scala` to be consistent with others. 2. Rename `data_frame` to `dataframe` to be consistent with existing PySpark codes. This Pr is a sort of a followup of #37710 ### Why are the changes needed? To follow existing codebase, and style. ### Does this PR introduce _any_ user-facing change? No, the codes are not released yet. The only notable change would be renaming `data_frame` to `dataframe` to be consistent. ### How was this patch tested? Ci in this PR should validate the changes. Closes #38121 from HyukjinKwon/minor-cleanup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ed Dependencies This is a draft change of the current state of the Spark Connect prototype implemented as a driver plugin to separate the classpaths and shaded dependent libraries. The benefit of a driver plugin is that the plugin has access to everything inside Spark but can itself be a leaf-build. This allows us to shade it's own minimal set of dependencies without brining new dependencies into Spark. Spark Connect is implemented as a driver plugin for Apache Spark. This means that it can be optionally loaded by specifying the following option during startup. ``` --conf spark.plugins=org.apache.spark.sql.sparkconnect.service.SparkConnectPlugin ``` The way that the plugin is integrated into the build system does not require any additional changes and the properly configured Jar files are produced as part of the build. Spark Connect relies both on GRPC and Guava. Since the version of Guava is incompatible to the version used in Spark, the plugin shades this dependency directly as part of the build. In the case of Maven this is very easy in the case of SBT it's a bit trickier. For SBT we hook into the `copyDeps` stage of the build and instead of copying the artifact dependency we copy the shaded dependency instead. https://issues.apache.org/jira/browse/SPARK-39375 Experimental API for Spark Connect This patch was added by adding two new test surface areas. On the driver plugin side this patch adds rudimentary test infrastructure that makes sure that Proto plans can be translated into Spark logical plans. Secondly, there are Python integration tests that test the DataFrame to Proto conversion and test the end-to-end execution of Spark Connect queries. To avoid issues with other parts of the build, only tests that require Spark Connect will be started with the driver plugin. **Note:** * The pyspark tests are only support for regular CPython and disabled for pypy * `mypy` was initially disabled for this patch to reduce friction. Will be enabled again. * `mypy` is disabled on the Python prototype until follow-up patches fix the warnings. * The unit-test coverage in Scala is not good. Follow up tests will increases coverage, but to keep this test reasonable I did not add more code. This patch adds GRPC, Protobuf, and an updated Guava to the leaf-build of Spark Connect. These dependencies are not available to the top-level Spark build and are not available to other Spark projects. To avoid dependency issues, the Maven and SBT build will shad GRPC, Protobuf and Guava explicitly. Due to the size of the PR, smaller clean-up changes will be submitted as follow up PRs. Closes apache#37710 from grundprinzip/spark-connect-grpc-shaded. Lead-authored-by: Martin Grund <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? This PR aims to remove `(any|empty).proto` from RAT exclusion. ### Why are the changes needed? `(any|empty).proto` files were never a part of Apache Spark repository. Those files were only used in the initial `Connect` PR and removed before merging. - #37710 - Added: 45c7bc5 - Excluded from RAT check: cf6b19a - Removed: 4971980 ### Does this PR introduce _any_ user-facing change? No. This is a dev-only change. ### How was this patch tested? Pass the CIs or manual check. ``` $ ./dev/check-license Ignored 0 lines in your exclusion files as comments or empty lines. RAT checks passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48837 from dongjoon-hyun/SPARK-50304. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? This PR aims to remove `(any|empty).proto` from RAT exclusion. ### Why are the changes needed? `(any|empty).proto` files were never a part of Apache Spark repository. Those files were only used in the initial `Connect` PR and removed before merging. - #37710 - Added: 45c7bc5 - Excluded from RAT check: cf6b19a - Removed: 4971980 ### Does this PR introduce _any_ user-facing change? No. This is a dev-only change. ### How was this patch tested? Pass the CIs or manual check. ``` $ ./dev/check-license Ignored 0 lines in your exclusion files as comments or empty lines. RAT checks passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48837 from dongjoon-hyun/SPARK-50304. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 33378a6) Signed-off-by: Dongjoon Hyun <[email protected]>

grundprinzip added 18 commits August 29, 2022 18:31

Adding the proto and the dependencies

45c7bc5

Trying to fix aarch64

c2c6de0

Moving connect to its own module

034a44d

[SparkConnect] Spark Connect Planner

0cedf00

Initial version that processes the proto messages and converts them into logical plans.

adding the planner

5d4b8dd

adding the python code

506fc44

fixing tests

0f6a687

Adding some very basic unit tests

9adf4bd

adding explain to the service

ed2fe24

adding apache header and small col function

786efbb

more apache headers

8693efb

Adding some more tests

d336b2a

adding some more doc

dbe3d1f

some basic tests for literals

3ea55ca

Fixing the pom files

127a492

Moving to grpc-java

4c63000

Fixing some python tests

b713969

Fixing more python tests

0547e07

grundprinzip marked this pull request as draft August 29, 2022 16:34

grundprinzip marked this pull request as ready for review August 29, 2022 16:47

github-actions bot added BUILD CORE DOCS PYTHON SQL labels Aug 29, 2022

grundprinzip added 2 commits August 29, 2022 21:53

making connect build by default

074e1f7

fixing sbt build

e77a018

HyukjinKwon marked this pull request as draft August 30, 2022 05:02

More sbt stuff

21224ec