Skip to content

Conversation

@jacobsimpson
Copy link

Which issue does this PR close?

Closes #1706.

Rationale for this change

I'd like to include the Flight service in the reflection service in one of my projects.

What changes are included in this PR?

  • Added generation of the necessary binary data to the protobuf compile steps.
  • Added constants representing the generated binary data in the compiled library.
  • Rustdoc for the new constants.
  • Added tonic-reflection to the dev dependencies so cargo test --doc and cargo test --doc --features "flight-sql-experimental" work.

Are there any user-facing changes?

New constants should be visible to library consumers. Should be backwards compatible.

Testing

Using a simple test service initialised using the reflection code in the rustdocs, grpcurl says:

$ grpcurl --plaintext localhost:50051 list
arrow.flight.protocol.FlightService
grpc.reflection.v1.ServerReflection```

Addresses apache#1706, adding a
constant to the gRPC generated code so it is possible to include the
Flight and Flight SQL APIs in a `tonic-reflection` service.

Testing:

```
grpcurl --plaintext localhost:8082 list
```
@github-actions github-actions bot added arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Jan 22, 2025
Addresses apache#1706, adding a
constant to the gRPC generated code so it is possible to include the
Flight and Flight SQL APIs in a `tonic-reflection` service.

Testing:

```
grpcurl --plaintext localhost:8082 list
```
tracing-subscriber = { version = "0.3.1", default-features = false, features = ["ansi", "env-filter", "fmt"] }
tokio = { version = "1.0", default-features = false, features = ["macros", "rt", "rt-multi-thread"] }
tokio-stream = { version = "0.1", features = ["net"] }
tonic-reflection = "0.12.3"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should put the addition in this PR behind a feature flag and mark this dependency as optional?

Copy link
Contributor

@alamb alamb Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please

Update: I see this is only a dev dependency -- so I think it is ok not to feature flag it

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jacobsimpson

When I tried to re-create the checked in binrary (on a mac), it seems to be different

How did you generate the binary in this PR?

$ rm arrow-flight/src/sql/flight_sql_descriptor.bin
$ ./arrow-flight/regen.sh
...
$ git status
On branch add-file-descriptor-set
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   arrow-flight/src/sql/flight_sql_descriptor.bin

(I think this is the root cause of the CI failure as well: https://github.com/apache/arrow-rs/actions/runs/12917873056/job/36070514891?pr=7009)

@jacobsimpson
Copy link
Author

I used ./regen.sh to generate it. I'll have access to a Mac tomorrow and I can try this procedure again and look at the file differences.

In the mean time, following that test procedure locally I see:

$ uname -a
Linux silverfox 5.15.0-53-generic #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ cd arrow-rs/arrow-flight
$  git status
On branch add-file-descriptor-set
Your branch is up to date with 'origin/add-file-descriptor-set'.

nothing to commit, working tree clean
$ rm src/sql/flight_sql_descriptor.bin
$ ./regen.sh
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.05s
     Running `/home/jsimpson/src/arrow-rs/target/debug/gen`
cargo:rerun-if-changed=../format/Flight.proto
cargo:rerun-if-changed=../format
cargo:rerun-if-changed=../format/FlightSql.proto
cargo:rerun-if-changed=../format
$  git status
On branch add-file-descriptor-set
Your branch is up to date with 'origin/add-file-descriptor-set'.

nothing to commit, working tree clean

@jacobsimpson
Copy link
Author

I used ./regen.sh to generate it. I'll have access to a Mac tomorrow and I can try this procedure again and look at the file differences.

There might be a challenge here. From what I can tell, the file descriptor set is itself a serialized proto message.

protoc --decode_raw < src/flight_descriptor.bin

And serialized proto messages are not intended to be cannonical. The docs have a list of example conditions that can cause the serialization to change, however it is intentionally incomplete.

I'll continue to look into this, however at the moment it appears that we can't depend on one machine being able to produce the same file descriptor sets as another machine.

@alamb
Copy link
Contributor

alamb commented Jan 24, 2025

I'll continue to look into this, however at the moment it appears that we can't depend on one machine being able to produce the same file descriptor sets as another machine.

Can the descriptor be used by a machine on which it wasn't created 🤔

Maybe we would have to always regenerate this file on build (rather than have it checked in)

That comes with its own problems (e.g. increasing the reuqired depenednecies to build arrow-flight)

@jacobsimpson
Copy link
Author

I'll continue to look into this, however at the moment it appears that we can't depend on one machine being able to produce the same file descriptor sets as another machine.

Can the descriptor be used by a machine on which it wasn't created 🤔

It is my understanding that it is not the machine where it was created that determines compatibility, it is the protoc compiler. There descriptor.proto can change even in patch versions. I think, as long as the file descriptors are generated at the same time as the other generated code, it should all work together correctly.

Compatibility alone is not enough to pass the build though. The way I understand it, no two machines/protoc-versions/whatever-additional-unspecified-conditions are guaranteed to produce the same proto serializations. And the diff step means we depend on them being the same, not just compatible.

Maybe we would have to always regenerate this file on build (rather than have it checked in)

That comes with its own problems (e.g. increasing the reuqired depenednecies to build arrow-flight)

According to the CONTRIBUTING documentation, the current solution (where the change author required to run protoc and commit generated files) is there because requiring the toolchain be present for all source code consumers was difficult. So, having the build always generate the proto files is a possibility, however it looks like it would be a behavior regression (so to speak).

Two other possibilities:
2. Change the Github check so those two files are excluded from the 'up-to-date' check. Basically, have faith that if the change author correctly regenerated the other proto files, then the file descriptor set files were also correctly regenerated. I think there would be a lot of trust in this option. However, it would be easy. Something like (untested): git diff --exit-code -- . ':(exclude)*.bin' here
3. Change/add a Github action to regenerate the file descriptor set files (or even all proto files) and add the generated files to the PR. If it's possible to give that behavior to a Github action, I think it would make for an awkward contributor experience, the remote branch would have changes even when the author is the only one working on it.

@alamb
Copy link
Contributor

alamb commented Jan 29, 2025

A third possibility might be to have instructions to run regen.sh in a docker container / etc that had the same protoc compiler version as the github runner

@alamb
Copy link
Contributor

alamb commented Jan 29, 2025

  1. Change the Github check so those two files are excluded from the 'up-to-date' check. Basically, have faith that if the change author correctly regenerated the other proto files, then the file descriptor set files were also correctly regenerated. I think there would be a lot of trust in this option. However, it would be easy. Something like (untested): git diff --exit-code -- . ':(exclude)*.bin' here

This sounds like a pragmatic solution to me -- is there some way we can verify that the checked in descriptor matches the rebuilt one (not byte-for-byte identical but some sort of semantic check 🤔 )

@jacobsimpson
Copy link
Author

  1. Change the Github check so those two files are excluded from the 'up-to-date' check. Basically, have faith that if the change author correctly regenerated the other proto files, then the file descriptor set files were also correctly regenerated. I think there would be a lot of trust in this option. However, it would be easy. Something like (untested): git diff --exit-code -- . ':(exclude)*.bin' here

This sounds like a pragmatic solution to me -- is there some way we can verify that the checked in descriptor matches the rebuilt one (not byte-for-byte identical but some sort of semantic check 🤔 )

I'll do some research/experimentation.

@tustvold tustvold marked this pull request as draft February 20, 2025 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add gRPC reflection support to arrow flight

3 participants