Skip to content

Conversation

@Xuanwo
Copy link
Member

@Xuanwo Xuanwo commented Jul 12, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

This PR will close #6103 #6104

This PR adds endpoint URL support for COPY, user can specify the endpoint URL like the following:

COPY INTO mytable
    FROM 's3://mybucket/data.csv'
    CONNECTION = (
        ENDPOINT_URL = 'http://127.0.0.1:9900'
    )
    FILE_FORMAT = (
        type = 'CSV'
        field_delimiter = ','
        record_delimiter = '\n'
        skip_header = 1
    )
    size_limit=10;

Within this PR, we introduced all storage backends support for databend. Now, we can copy data from azblob, hdfs, fs, and so on:

From azblob:

COPY INTO mytable
    FROM 'azblob://mybucket/data.csv'
    CONNECTION = (
        ENDPOINT_URL = 'http://127.0.0.1:9900'
    )
    FILE_FORMAT = (
        type = 'CSV'
        field_delimiter = ','
        record_delimiter = '\n'
        skip_header = 1
    )
    size_limit=10;

To avoid users accessor internal network, we added config for databend:

[storage]
allow_insecure = false

Users can only use an endpoint that starts with https:// unless allow_insecure has been enabled during deployment.

Also, in this PR, we unify all connection-related options into CONNECTION:

COPY INTO mytable
    FROM 'azblob://mybucket/data.csv'
    CONNECTION = (
        ENDPOINT_URL = 'http://127.0.0.1:9900'
        ACCESS_KEY_ID = 'access_key_id'
        SECRET_ACCESS_KEY = 'secret_access_key'
    )

CREDENTIALS and ENCRYPTION are still supported for backward compatibility.

Remaining Work

@vercel
Copy link

vercel bot commented Jul 12, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Updated
databend ⬜️ Ignored (Inspect) Jul 13, 2022 at 3:12AM (UTC)

@mergify mergify bot added the pr-feature this PR introduces a new feature to the codebase label Jul 12, 2022
@Xuanwo Xuanwo linked an issue Jul 12, 2022 that may be closed by this pull request
Signed-off-by: Xuanwo <[email protected]>
@Xuanwo Xuanwo requested review from junnplus and sundy-li July 12, 2022 08:22
@Xuanwo
Copy link
Member Author

Xuanwo commented Jul 12, 2022

Stateful tests need an update, I will fix them later.

Signed-off-by: Xuanwo <[email protected]>
@BohuTANG
Copy link
Member

We need a stateful test for the new COPY style :)

@Xuanwo Xuanwo changed the title feat: Add endpoint_url support fot COPY feat: Allow COPY FROM/INTO across different services Jul 12, 2022
@Xuanwo Xuanwo changed the title feat: Allow COPY FROM/INTO across different services feat: Allow COPY FROM/INTO different storage services Jul 12, 2022
Xuanwo added 3 commits July 12, 2022 18:44
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
@Xuanwo Xuanwo requested a review from BohuTANG July 12, 2022 13:52
@BohuTANG
Copy link
Member

The Dev MacOS / test_stateless_cluster_macos failure is not related to this PR

@Xuanwo
Copy link
Member Author

Xuanwo commented Jul 13, 2022

@BohuTANG raised some security concerns, here are my test results:

What happens if I start databend-query with a minio running in 127.0.0.0:9900

Copy via http://127.0.0.1:9900

MySQL [(none)]> copy into x from 's3://testbucket/' connection = (endpoint_url='http://127.0.0.1:9900' access_key_id='minioadmin' secret_access_key='minioadmin')  PATTERN = 'ontime_200.csv$' FILE_FORMAT = (type = CSV field_delimiter = ','  record_delimiter = '');
ERROR 1105 (HY000): Code: 3903, displayText = copy from insecure storage is not allowed.

Copy via https://127.0.0.1:9900

MySQL [(none)]> copy into x from 's3://testbucket/' connection = (endpoint_url='https://127.0.0.1:9900' access_key_id='xxxx' secret_access_key='yyyy')  PATTERN = 'ontime_200.csv$' FILE_FORMAT = (type = CSV field_delimiter = ','  record_delimiter = '');
ERROR 1105 (HY000): Code: 4000, displayText = other error (backend error: (context: {"bucket": "testbucket"}, source: sending request: https://127.0.0.1:9900/testbucket: hyper::Error(Connect, Ssl(Error { code: ErrorCode(1), cause: Some(Ssl(ErrorStack([Error { code: 336130315, library: "SSL routines", function: "ssl3_get_record", reason: "wrong version number", file: "ssl/record/ssl3_record.c", line: 331 }]))) }, X509VerifyResult { code: 0, error: "ok" })))).

Improvements that could be done in the future:

  • Omit the error message for connect.
  • Treat IP address as insecure.

@BohuTANG BohuTANG merged commit c5310ed into databendlabs:main Jul 13, 2022
@Xuanwo Xuanwo deleted the endpoint_url branch July 13, 2022 03:58
@BohuTANG
Copy link
Member

This is the enhanced version of COPY, we are making no silos of data on the cloud, let's document cc @soyeric128

@soyeric128
Copy link
Contributor

I have a few questions about this new feature:

@Xuanwo
Copy link
Member Author

Xuanwo commented Jul 14, 2022

I have a few questions about this new feature:

I replied in #6620

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature this PR introduces a new feature to the codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add endpoint_url so we can copy from external s3 services Support other stage storage types

5 participants