Skip to content

Conversation

corwinjoy
Copy link
Contributor

@corwinjoy corwinjoy commented Jun 10, 2025

Which issue does this PR close?

What changes are included in this PR?

This PR adds support for encryption in DataFusion’s Parquet implementation. The changes introduce new configuration options for file encryption and decryption properties, update various components (including proto conversion, file reading/writing, and tests), and add an end-to-end encrypted Parquet example.

Are these changes tested?

Tests and examples have been added to demonstrate and test functionality versus Parquet modular encryption. These could use feedback since there may be additional DataFusion usage cases that should be covered.

Are there any user-facing changes?

Additional options have been added to allow encryption/decryption configuration. We are soliciting additional feedback on how to handle key columns in a way that best fits the existing API.

Catalog of changes via copilot

Show a summary per file
File Description
docs/source/user-guide/configs.md Added documentation for new encryption and decryption properties.
datafusion/sqllogictest/test_files/information_schema.slt Updated SQL logic tests to include encryption/decryption settings.
datafusion/proto/src/logical_plan/file_formats.rs Added default None values for encryption properties in proto.
datafusion/proto-common/src/from_proto/mod.rs Updated conversion to set encryption properties to None.
datafusion/datasource-parquet/src/source.rs Passed file decryption properties from table options.
datafusion/datasource-parquet/src/opener.rs Disabled page index when encryption is enabled.
datafusion/datasource-parquet/src/file_format.rs Propagated decryption properties in schema and statistics fetching.
datafusion/core/tests/parquet/mod.rs Added the encryption test module reference.
datafusion/core/tests/parquet/encryption.rs Added tests for round-trip encryption.
datafusion/core/tests/parquet/custom_reader.rs Updated custom reader to pass None for decryption properties.
datafusion/core/src/datasource/file_format/parquet.rs Added extra parameters for decryption in metadata/statistics fetching.
datafusion/core/src/dataframe/parquet.rs Added roundtrip test for encrypted Parquet files.
datafusion/common/src/file_options/parquet_writer.rs Updated writer options to support encryption properties.
datafusion/common/Cargo.toml Added the hex dependency.
datafusion-examples/examples/parquet_encrypted.rs Introduced an example that reads/writes encrypted Parquet files.
datafusion-examples/README.md Updated examples list to include the encrypted Parquet demo.
Cargo.toml Enabled encryption feature for the parquet crate.

corwinjoy and others added 30 commits May 29, 2025 19:04
2. Fixed unused header warning in config.rs.
3. Fix test case in encryption.rs to call conversion to ConfigFileDecryption properties correctly.
Add an example to read and write encrypted parquet files.
@corwinjoy corwinjoy force-pushed the parquet_encryption branch from a3f0721 to 6a77842 Compare June 25, 2025 22:12
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jun 26, 2025
store: &dyn ObjectStore,
meta: &ObjectMeta,
size_hint: Option<usize>,
decryption_properties: Option<&FileDecryptionProperties>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a breaking change (and same for fetch_statistics). Should we introduce a new method with this parameter added to avoid breaking the API? It is documented as "subject to change" though so maybe this is ok...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this probably needs to change anyway to handle having an encryption feature.

.unwrap(),
column_specific_options,
key_value_metadata: Default::default(),
crypto: Default::default(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we aren't going to support conversion to protobuf yet, we should probably raise an error in impl TryFrom<&TableParquetOptions> for protobuf::TableParquetOptions if any crypto options are set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we? I see that key_value_metadata is always set to empty here despite being defined in datafusion/proto-common/src/generated/prost.rs. So, it's not clear to me that we should throw an error here.

Use ConfigFileDecryptionProperties in ParquetReadOptions

Signed-off-by: Corwin Joy <[email protected]>
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jun 26, 2025
@alamb
Copy link
Contributor

alamb commented Jun 27, 2025

I merged up to resolve a conflict

/// See ConfigFileEncryptionProperties and ConfigFileDecryptionProperties in datafusion/common/src/config.rs
/// These can be set via 'format.crypto', for example:
/// ```sql
/// OPTIONS (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @corwinjoy -- this looks wonderful to me

@alamb
Copy link
Contributor

alamb commented Jun 27, 2025

https://github.com/apache/datafusion/actions/runs/15936215612/job/44956521967?pr=16351

That is a new failure for me

     Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-19519efa72c14141)
External error: 1 errors in file /Users/runner/work/datafusion/datafusion/datafusion/sqllogictest/test_files/union_by_name.slt

1. query is expected to fail, but actually succeed:
[SQL] select x, y, z from t3 union all by name select z, y, x, 'd' as zz from t3;
at /Users/runner/work/datafusion/datafusion/datafusion/sqllogictest/test_files/union_by_name.slt:343

I'll trigger it again and see if I can reproduce

@alamb alamb merged commit d66d6b9 into apache:main Jun 28, 2025
30 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 28, 2025

Thanks again @corwinjoy / @adamreeve and everyone else. This is great

@corwinjoy
Copy link
Contributor Author

Thanks @alamb much appreciated for the review and helpful feedback! We hope to have a followup PR soon with a config to make encryption optional.

@alamb
Copy link
Contributor

alamb commented Jul 1, 2025

Thank you @corwinjoy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation proto Related to proto crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support integration with Parquet modular encryption
4 participants