Skip to content

Conversation

@Xuanwo
Copy link
Member

@Xuanwo Xuanwo commented May 9, 2022

Signed-off-by: Xuanwo [email protected]

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

This PR will allow databend-query to use hdfs as storage backend.

Part of #5215

Changelog

  • New Feature

@vercel
Copy link

vercel bot commented May 9, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Updated
databend ⬜️ Ignored (Inspect) May 9, 2022 at 4:25AM (UTC)

@mergify
Copy link
Contributor

mergify bot commented May 9, 2022

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

@mergify mergify bot added the pr-feature this PR introduces a new feature to the codebase label May 9, 2022
@Xuanwo
Copy link
Member Author

Xuanwo commented May 9, 2022

@dantengsky Maybe I need to borrow some work from your PR to get JAVA setup correctly?

@BohuTANG BohuTANG requested a review from dantengsky May 9, 2022 04:09
@BohuTANG
Copy link
Member

BohuTANG commented May 9, 2022

This PR only makes the HDFS as a normal storage backend like AWS S3, it's not related to @dantengsky work on hive?

@Xuanwo
Copy link
Member Author

Xuanwo commented May 9, 2022

From I understand, this PR makes the HDFS as a normal storage backend like AWS S3, it's not related to @dantengsky work on hive?

Yes. Hive's integration should happen in another PR.

Signed-off-by: Xuanwo <[email protected]>

// hdfs storage backend config
#[clap(flatten)]
pub hdfs: HdfsConfig,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't add the HdfsConfig item to *.toml, the config deserialize looks will crash?

Copy link
Member

@BohuTANG BohuTANG May 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another problem, we don't want unused config items explicitly:

[storage]
# fs|s3
type = "s3"

[storage.fs] -- this config

[storage.s3]
bucket = "databend"
endpoint_url = "https://s3.amazonaws.com"
access_key_id = "<your-key-id>"
secret_access_key = "<your-access-key>"

[storage.azblob] -- this config

Is it possible to configure the item only we used, like [storage.s3] here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't add the HdfsConfig item to *.toml, the config deserialize looks will crash?

If HdfsConfig is not added in *.toml, we will use the default value instead.

I tested this behavior locally: query is able to start without adding hdfs-related staff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great, I am out, thank you.
I will remove all the unused config items from the documents.

@dantengsky
Copy link
Member

@dantengsky Maybe I need to borrow some work from your PR to get JAVA setup correctly?

Yeah, currently, hive PR's ut/it is not integrated with the github workflows yet. Just a local hadoop + hive cluster for the testings. there are no build-time dependencies on the JDK/jar files. A docker image seems to be able to cover it.

But for this PR, a docker image may not be enough. hope I get it right:

  • to enable feature storage-hdfs at compiling
    we need a JDK(for libjvm.so and some header files of jvm)
  • to enable feature storage-hdfs at runtime (or ut/it)
    an HDFS cluster (for the specified version) is needed

@Xuanwo
Copy link
Member Author

Xuanwo commented May 9, 2022

@dantengsky Maybe I need to borrow some work from your PR to get JAVA setup correctly?

Yeah, currently, hive PR's ut/it is not integrated with the github workflows yet. Just a local hadoop + hive cluster for the testings. there are no build-time dependencies on the JDK/jar files. A docker image seems to be able to cover it.

But for this PR, a docker image may not be enough. hope I get it right:

  • to enable feature storage-hdfs at compiling
    we need a JDK(for libjvm.so and some header files of jvm)
  • to enable feature storage-hdfs at runtime (or ut/it)
    an HDFS cluster (for the specified version) is needed

Thanks for the advice! Maybe we can make databend-query compilable in this PR and test it in the next PR.

Copy link
Member

@BohuTANG BohuTANG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@BohuTANG BohuTANG merged commit 4fd1b64 into databendlabs:main May 9, 2022
@Xuanwo Xuanwo deleted the hdfs branch May 9, 2022 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature this PR introduces a new feature to the codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants