Skip to content

Conversation

@grundprinzip
Copy link
Contributor

@grundprinzip grundprinzip commented Aug 29, 2022

What changes were proposed in this pull request?

This is a draft change of the current state of the Spark Connect prototype implemented as a driver plugin to separate the classpaths and shaded dependent libraries. The benefit of a driver plugin is that the plugin has access to everything inside Spark but can itself be a leaf-build. This allows us to shade it's own minimal set of dependencies without brining new dependencies into Spark.

How is Spark Connect Integrated?

Spark Connect is implemented as a driver plugin for Apache Spark. This means that it can be optionally loaded by specifying the following option during startup.

--conf spark.plugins=org.apache.spark.sql.sparkconnect.service.SparkConnectPlugin

The way that the plugin is integrated into the build system does not require any additional changes and the properly configured Jar files are produced as part of the build.

How are GRPC and Guava dependencies handled?

Spark Connect relies both on GRPC and Guava. Since the version of Guava is incompatible to the version used in Spark, the plugin shades this dependency directly as part of the build. In the case of Maven this is very easy in the case of SBT it's a bit trickier. For SBT we hook into the copyDeps stage of the build and instead of copying the artifact dependency we copy the shaded dependency instead.

Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-39375

Does this PR introduce any user-facing change?

Experimental API for Spark Connect

How was this patch tested?

This patch was added by adding two new test surface areas. On the driver plugin side this patch adds rudimentary test infrastructure that makes sure that Proto plans can be translated into Spark logical plans. Secondly, there are Python integration tests that test the DataFrame to Proto conversion and test the end-to-end execution of Spark Connect queries.

To avoid issues with other parts of the build, only tests that require Spark Connect will be started with the driver plugin.

Note:

  • The pyspark tests are only support for regular CPython and disabled for pypy
  • mypy was initially disabled for this patch to reduce friction. Will be enabled again.

Caveats

  • mypy is disabled on the Python prototype until follow-up patches fix the warnings.
  • The unit-test coverage in Scala is not good. Follow up tests will increases coverage, but to keep this test reasonable I did not add more code.

New Dependencies

This patch adds GRPC, Protobuf, and an updated Guava to the leaf-build of Spark Connect. These dependencies are not available to the top-level Spark build and are not available to other Spark projects. To avoid dependency issues, the Maven and SBT build will shad GRPC, Protobuf and Guava explicitly.

Follow Ups

Due to the size of the PR, smaller clean-up changes will be submitted as follow up PRs.

@grundprinzip grundprinzip marked this pull request as draft August 29, 2022 16:34
@grundprinzip grundprinzip marked this pull request as ready for review August 29, 2022 16:47
@HyukjinKwon HyukjinKwon marked this pull request as draft August 30, 2022 05:02
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

Merged to master.

I will follow up and actively work on cleaning up and followup tasks from tomorrow.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late on the party; while I still haven't have time to look into the code change, my quick glance is that the generated files from proto may not be reflected with changed proto files from review comments. Looks like they are out of sync.
e.g. generated files still mention databricks whereas proto files don't have any notion of databricks.

@grundprinzip Could you please check this and submit a follow-up PR to fix if I'm not mistaken? Thanks in advance!

@grundprinzip
Copy link
Contributor Author

Ack, I will regenerate the protos and update.

@HyukjinKwon
Copy link
Member

There's an outstanding comment: #37710 (comment). I am working on this.

<!-- Add protobuf-maven-plugin and provide ScalaPB as a code generation plugin -->
<plugin>
<groupId>org.xolstice.maven.plugins</groupId>
<artifactId>protobuf-maven-plugin</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon @grundprinzip Report another issue:

Compile connect module on CentOS release 6.3, the default glibc version is 2.12, this will cause compilation to fail as follows:

[ERROR] PROTOC FAILED: /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe)
/home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.18' not found (required by /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe)
/home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.14' not found (required by /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe)
/home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.5' not found (required by /home/disk0/spark-source/connect/target/protoc-plugins/protoc-3.21.1-linux-x86_64.exe) 

Already file a jira SPARK-40593, I think at least we should explicitly point out the compilation dependency somewhere

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @LuciferYang for checking this out.

@HyukjinKwon
Copy link
Member

FYI, I made a PR #38109 to address #37710 (comment)

dongjoon-hyun pushed a commit that referenced this pull request Oct 6, 2022
…nenet

### What changes were proposed in this pull request?

This PR proposes:
1. Fix the code style in `SparkConnectPlugin.scala` and `SparkBuild.scala` to be consistent with others.
2. Rename `data_frame` to `dataframe` to be consistent with existing PySpark codes.

This Pr is a sort of a followup of #37710

### Why are the changes needed?

To follow existing codebase, and style.

### Does this PR introduce _any_ user-facing change?

No, the codes are not released yet. The only notable change would be renaming `data_frame` to `dataframe` to be consistent.

### How was this patch tested?

Ci in this PR should validate the changes.

Closes #38121 from HyukjinKwon/minor-cleanup.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
DeZepTup pushed a commit to DeZepTup/spark-custom that referenced this pull request Oct 31, 2022
…ed Dependencies

This is a draft change of the current state of the Spark Connect prototype implemented as a driver plugin to separate the classpaths and shaded dependent libraries. The benefit of a driver plugin is that the plugin has access to everything inside Spark but can itself be a leaf-build. This allows us to shade it's own minimal set of dependencies without brining new dependencies into Spark.

Spark Connect is implemented as a driver plugin for Apache Spark. This means that it can be optionally loaded by specifying the following option during startup.

```
--conf spark.plugins=org.apache.spark.sql.sparkconnect.service.SparkConnectPlugin
```

The way that the plugin is integrated into the build system does not require any additional changes and the properly configured Jar files are produced as part of the build.

Spark Connect relies both on GRPC and Guava. Since the version of Guava is incompatible to the version used in Spark, the plugin shades this dependency directly as part of the build. In the case of Maven this is very easy in the case of SBT it's a bit trickier. For SBT we hook into the `copyDeps` stage of the build and instead of copying the artifact dependency we copy the shaded dependency instead.

https://issues.apache.org/jira/browse/SPARK-39375

Experimental API for Spark Connect

This patch was added by adding two new test surface areas. On the driver plugin side this patch adds rudimentary test infrastructure that makes sure that Proto plans can be translated into Spark logical plans. Secondly, there are Python integration tests that test the DataFrame to Proto conversion and test the end-to-end execution of Spark Connect queries.

To avoid issues with other parts of the build, only tests that require Spark Connect will be started with the driver plugin.

**Note:**
* The pyspark tests are only support for regular CPython and disabled for pypy
* `mypy` was initially disabled for this patch to reduce friction. Will be enabled again.

* `mypy` is disabled on the Python prototype until follow-up patches fix the warnings.
* The unit-test coverage in Scala is not good. Follow up tests will increases coverage, but to keep this test reasonable I did not add more code.

This patch adds GRPC, Protobuf, and an updated Guava to the leaf-build of Spark Connect. These dependencies are not available to the top-level Spark build and are not available to other Spark projects. To avoid dependency issues, the Maven and SBT build will shad GRPC, Protobuf and Guava explicitly.

Due to the size of the PR, smaller clean-up changes will be submitted as follow up PRs.

Closes apache#37710 from grundprinzip/spark-connect-grpc-shaded.

Lead-authored-by: Martin Grund <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Nov 14, 2024
### What changes were proposed in this pull request?

This PR aims to remove `(any|empty).proto` from RAT exclusion.

### Why are the changes needed?

`(any|empty).proto` files were never a part of Apache Spark repository. Those files were only used in the initial `Connect` PR and removed before merging.
- #37710
  - Added: 45c7bc5
  - Excluded from RAT check: cf6b19a
  - Removed: 4971980

### Does this PR introduce _any_ user-facing change?

No. This is a dev-only change.

### How was this patch tested?

Pass the CIs or manual check.

```
$ ./dev/check-license
Ignored 0 lines in your exclusion files as comments or empty lines.
RAT checks passed.
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #48837 from dongjoon-hyun/SPARK-50304.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Nov 14, 2024
### What changes were proposed in this pull request?

This PR aims to remove `(any|empty).proto` from RAT exclusion.

### Why are the changes needed?

`(any|empty).proto` files were never a part of Apache Spark repository. Those files were only used in the initial `Connect` PR and removed before merging.
- #37710
  - Added: 45c7bc5
  - Excluded from RAT check: cf6b19a
  - Removed: 4971980

### Does this PR introduce _any_ user-facing change?

No. This is a dev-only change.

### How was this patch tested?

Pass the CIs or manual check.

```
$ ./dev/check-license
Ignored 0 lines in your exclusion files as comments or empty lines.
RAT checks passed.
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #48837 from dongjoon-hyun/SPARK-50304.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 33378a6)
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.