Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
e668b99
Initial commit to form PR for datafusion encryption support
corwinjoy May 30, 2025
d38dba4
Add tests for encryption configuration
corwinjoy May 30, 2025
5a2b456
Apply cargo fmt
corwinjoy May 30, 2025
c972676
Add a roundtrip encryption test to the parquet tests.
corwinjoy May 30, 2025
ec3f828
cargo fmt
corwinjoy May 30, 2025
3538a27
Update test to add decryption parameter to called functions.
corwinjoy May 31, 2025
a754992
Try to get DataFrame.write_parquet to work with encryption. Doesn't q…
corwinjoy Jun 3, 2025
e430672
Update datafusion/datasource-parquet/src/opener.rs
corwinjoy Jun 4, 2025
7fcba70
Update datafusion/datasource-parquet/src/source.rs
corwinjoy Jun 4, 2025
d6b1fca
Fix write test in parquet.rs
corwinjoy Jun 4, 2025
3353186
Simplify encryption test. Remove unused imports.
corwinjoy Jun 4, 2025
e4bc0e3
Run cargo fmt.
corwinjoy Jun 4, 2025
f52e79c
Further streamline roundtrip test.
corwinjoy Jun 4, 2025
5615ac8
Change From methods for FileEncryptionProperties and FileDecryptionPr…
corwinjoy Jun 4, 2025
61bc78e
Change encryption config to directly hold column keys using custom co…
corwinjoy Jun 5, 2025
a81855f
Fix generated field names in visit for encryptor and decryptor to use…
corwinjoy Jun 5, 2025
4cf12b3
1. Disable parallel writes with enccryption.
corwinjoy Jun 6, 2025
f29bec3
cargo fmt
corwinjoy Jun 6, 2025
86fe04b
Update datafusion/common/src/file_options/parquet_writer.rs
corwinjoy Jun 6, 2025
d4ea63f
fix variables shown in information schema test.
corwinjoy Jun 6, 2025
0fcc4a5
Merge remote-tracking branch 'origin/parquet_encryption' into parquet…
corwinjoy Jun 6, 2025
86db3a5
Backout bad suggestion from copilot
corwinjoy Jun 6, 2025
b34441a
Remove unused serde reference
corwinjoy Jun 6, 2025
668d728
cargo fmt
corwinjoy Jun 6, 2025
ec1e8da
change file_format.rs to use global encryption options in struct.
corwinjoy Jun 9, 2025
e233408
Turn off page_index for encrypted example. Get encrypted example work…
corwinjoy Jun 9, 2025
9ffaae4
Tidy up example output.
corwinjoy Jun 9, 2025
8e244e9
Add missing license. Run taplo format
corwinjoy Jun 9, 2025
2871d51
Update configs.md by running dev/update_config_docs.sh
corwinjoy Jun 9, 2025
c405167
Cargo fmt + clippy changes.
corwinjoy Jun 9, 2025
506801e
Add filter test for encrypted files.
corwinjoy Jun 9, 2025
3058a90
Cargo clippy changes.
corwinjoy Jun 10, 2025
e7e521a
Merge remote-tracking branch 'origin/main' into parquet_encryption
corwinjoy Jun 10, 2025
bbeecfe
Fix link in README.md
corwinjoy Jun 10, 2025
4ceb072
Add issue tag for parallel writes.
corwinjoy Jun 10, 2025
c998378
Move file encryption and decryption properties out of global options
adamreeve Jun 16, 2025
7780b33
Use config_namespace_with_hashmap for column encryption/decryption props
adamreeve Jun 16, 2025
219d0b3
Merge pull request #5 from adamreeve/crypto_config_namespace
corwinjoy Jun 18, 2025
3af85bc
Remove outdated docs on crypto settings.
corwinjoy Jun 25, 2025
3255293
1. Add docs for using encryption configuration.
corwinjoy Jun 25, 2025
6a77842
Merge branch 'main' of github.com:corwinjoy/datafusion into parquet_e…
corwinjoy Jun 25, 2025
9972ecb
Update code to add missing ParquetOpener parameter due to merge from …
corwinjoy Jun 25, 2025
068e65d
Add CLI documentation for Parquet options and provide an encryption e…
corwinjoy Jun 26, 2025
ecd5f93
Use ConfigFileDecryptionProperties in ParquetReadOptions
adamreeve Jun 26, 2025
d79fdf2
Merge pull request #6 from adamreeve/parquet_encryption_fix
corwinjoy Jun 26, 2025
f3e6945
Implement default for ConfigFileEncryptionProperties
corwinjoy Jun 26, 2025
a7005a3
Add sqllogictest for parquet with encryption
corwinjoy Jun 26, 2025
48166f4
Apply prettier changes from CI
corwinjoy Jun 26, 2025
29821d4
Begin adding config for encryption.
corwinjoy Jun 27, 2025
6add82c
Merge remote-tracking branch 'origin/main' into config_encryption
corwinjoy Jun 30, 2025
0129130
Fix merge errors.
corwinjoy Jun 30, 2025
511c94b
Fix config dependency in datafusion/core/Cargo.toml
corwinjoy Jul 1, 2025
efd38d1
ci reformat
corwinjoy Jul 1, 2025
69f7c57
Fix missing Cargo dependency.
corwinjoy Jul 1, 2025
81375dc
Update Cargo.toml
corwinjoy Jul 2, 2025
7fcfd4d
Merge branch 'main' into config_encryption
corwinjoy Jul 2, 2025
5f0f1b0
Revert bad suggestion from copilot
corwinjoy Jul 2, 2025
6de2af0
taplo format Cargo.toml
corwinjoy Jul 2, 2025
464e74d
1. Add CI test for parquet_encryption. 2. Document new configuration
corwinjoy Jul 3, 2025
b0cad10
Refactor cfg parquet_encryption feature into a few functions defined …
corwinjoy Jul 4, 2025
321596f
Remove #allow(unusued) that is no longer needed
corwinjoy Jul 4, 2025
411d0a5
cargo fmt
corwinjoy Jul 4, 2025
3414141
Fix error when cfg parquet_encryption is disabled.
corwinjoy Jul 4, 2025
842aa89
cargo fmt
corwinjoy Jul 4, 2025
60d37ac
Remove unused mut.
corwinjoy Jul 10, 2025
cca1cd5
Merge branch 'main' into config_encryption
corwinjoy Jul 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,8 @@ jobs:
run: cargo check --profile ci --no-default-features -p datafusion --features=string_expressions
- name: Check datafusion (unicode_expressions)
run: cargo check --profile ci --no-default-features -p datafusion --features=unicode_expressions
- name: Check parquet encryption (parquet_encryption)
run: cargo check --profile ci --no-default-features -p datafusion --features=parquet_encryption

# Check datafusion-functions crate features
#
Expand Down
2 changes: 2 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,7 @@ env_logger = "0.11"
futures = "0.3"
half = { version = "2.6.0", default-features = false }
hashbrown = { version = "0.14.5", features = ["raw"] }
hex = { version = "0.4.3" }
indexmap = "2.10.0"
itertools = "0.14"
log = "^0.4"
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ Default features:
- `datetime_expressions`: date and time functions such as `to_timestamp`
- `encoding_expressions`: `encode` and `decode` functions
- `parquet`: support for reading the [Apache Parquet] format
- `parquet_encryption`: support for using [Parquet Modular Encryption]
- `regex_expressions`: regular expression functions, such as `regexp_match`
- `unicode_expressions`: Include unicode aware functions such as `character_length`
- `unparser`: enables support to reverse LogicalPlans back into SQL
Expand All @@ -134,6 +135,7 @@ Optional features:

[apache avro]: https://avro.apache.org/
[apache parquet]: https://parquet.apache.org/
[parquet modular encryption]: https://parquet.apache.org/docs/file-format/data-pages/encryption/

## DataFusion API Evolution and Deprecation Guidelines

Expand Down
8 changes: 7 additions & 1 deletion datafusion/common/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,15 @@ name = "datafusion_common"
[features]
avro = ["apache-avro"]
backtrace = []
parquet_encryption = [
"parquet",
"parquet/encryption",
"dep:hex",
]
pyarrow = ["pyo3", "arrow/pyarrow", "parquet"]
force_hash_collisions = []
recursive_protection = ["dep:recursive"]
parquet = ["dep:parquet"]

[dependencies]
ahash = { workspace = true }
Expand All @@ -58,7 +64,7 @@ base64 = "0.22.1"
chrono = { workspace = true }
half = { workspace = true }
hashbrown = { workspace = true }
hex = "0.4.3"
hex = { workspace = true, optional = true }
indexmap = { workspace = true }
libc = "0.2.174"
log = { workspace = true }
Expand Down
18 changes: 8 additions & 10 deletions datafusion/common/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@

use arrow_ipc::CompressionType;

#[cfg(feature = "parquet_encryption")]
use crate::encryption::{FileDecryptionProperties, FileEncryptionProperties};
use crate::error::_config_err;
use crate::parsers::CompressionTypeVariant;
use crate::utils::get_available_parallelism;
Expand All @@ -29,12 +31,8 @@ use std::error::Error;
use std::fmt::{self, Display};
use std::str::FromStr;

#[cfg(feature = "parquet")]
#[cfg(feature = "parquet_encryption")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can move all the crypto code into its own module, so we only have to #cfg the module rather than so many individual lines 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to do this entirely, but I have refactored most of this into methods in encryption.rs, which cleans this up more.

use hex;
#[cfg(feature = "parquet")]
use parquet::encryption::decrypt::FileDecryptionProperties;
#[cfg(feature = "parquet")]
use parquet::encryption::encrypt::FileEncryptionProperties;

/// A macro that wraps a configuration struct and automatically derives
/// [`Default`] and [`ConfigField`] for it, allowing it to be used
Expand Down Expand Up @@ -2148,7 +2146,7 @@ impl ConfigField for ConfigFileEncryptionProperties {
}
}

#[cfg(feature = "parquet")]
#[cfg(feature = "parquet_encryption")]
impl From<ConfigFileEncryptionProperties> for FileEncryptionProperties {
fn from(val: ConfigFileEncryptionProperties) -> Self {
let mut fep = FileEncryptionProperties::builder(
Expand Down Expand Up @@ -2194,7 +2192,7 @@ impl From<ConfigFileEncryptionProperties> for FileEncryptionProperties {
}
}

#[cfg(feature = "parquet")]
#[cfg(feature = "parquet_encryption")]
impl From<&FileEncryptionProperties> for ConfigFileEncryptionProperties {
fn from(f: &FileEncryptionProperties) -> Self {
let (column_names_vec, column_keys_vec, column_metas_vec) = f.column_keys();
Expand Down Expand Up @@ -2308,7 +2306,7 @@ impl ConfigField for ConfigFileDecryptionProperties {
}
}

#[cfg(feature = "parquet")]
#[cfg(feature = "parquet_encryption")]
impl From<ConfigFileDecryptionProperties> for FileDecryptionProperties {
fn from(val: ConfigFileDecryptionProperties) -> Self {
let mut column_names: Vec<&str> = Vec::new();
Expand Down Expand Up @@ -2342,7 +2340,7 @@ impl From<ConfigFileDecryptionProperties> for FileDecryptionProperties {
}
}

#[cfg(feature = "parquet")]
#[cfg(feature = "parquet_encryption")]
impl From<&FileDecryptionProperties> for ConfigFileDecryptionProperties {
fn from(f: &FileDecryptionProperties) -> Self {
let (column_names_vec, column_keys_vec) = f.column_keys();
Expand Down Expand Up @@ -2688,7 +2686,7 @@ mod tests {
);
}

#[cfg(feature = "parquet")]
#[cfg(feature = "parquet_encryption")]
#[test]
fn parquet_table_encryption() {
use crate::config::{
Expand Down
76 changes: 76 additions & 0 deletions datafusion/common/src/encryption.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

// Support optional features for encryption in Parquet files.
//! This module provides types and functions related to encryption in Parquet files.
#[cfg(feature = "parquet_encryption")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having so may cfg lines in this file, you could have the module implemented in two separate files, one for when encryption is enabled and one for when it's disabled.

Eg. see https://github.com/apache/arrow-rs/blob/ff3a2f2c59f0355f8afedb3e9258e1d6307f21ae/parquet/src/column/mod.rs#L121-L125

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That's an interesting idea, but I will stick with the current approach for now.

  1. The file and functions are relatively simple, at under 100 lines long.
  2. Having two separate files, which may or may not be used, is somewhat confusing.
  3. I don't like the idea of having to maintain the same API in two separate places/files. Here, the two versions are displayed side by side, making it straightforward to update them simultaneously.

pub use parquet::encryption::decrypt::FileDecryptionProperties;
#[cfg(feature = "parquet_encryption")]
pub use parquet::encryption::encrypt::FileEncryptionProperties;

#[cfg(not(feature = "parquet_encryption"))]
pub struct FileDecryptionProperties;
#[cfg(not(feature = "parquet_encryption"))]
pub struct FileEncryptionProperties;

#[cfg(feature = "parquet")]
use crate::config::ParquetEncryptionOptions;
pub use crate::config::{ConfigFileDecryptionProperties, ConfigFileEncryptionProperties};
#[cfg(feature = "parquet")]
use parquet::file::properties::WriterPropertiesBuilder;

#[cfg(feature = "parquet")]
pub fn add_crypto_to_writer_properties(
#[allow(unused)] crypto: &ParquetEncryptionOptions,
#[allow(unused_mut)] mut builder: WriterPropertiesBuilder,
) -> WriterPropertiesBuilder {
#[cfg(feature = "parquet_encryption")]
if let Some(file_encryption_properties) = &crypto.file_encryption {
builder = builder
.with_file_encryption_properties(file_encryption_properties.clone().into());
}
builder
}

#[cfg(feature = "parquet_encryption")]
pub fn map_encryption_to_config_encryption(
encryption: Option<&FileEncryptionProperties>,
) -> Option<ConfigFileEncryptionProperties> {
encryption.map(|fe| fe.into())
}

#[cfg(not(feature = "parquet_encryption"))]
pub fn map_encryption_to_config_encryption(
_encryption: Option<&FileEncryptionProperties>,
) -> Option<ConfigFileEncryptionProperties> {
None
}

#[cfg(feature = "parquet_encryption")]
pub fn map_config_decryption_to_decryption(
decryption: Option<&ConfigFileDecryptionProperties>,
) -> Option<FileDecryptionProperties> {
decryption.map(|fd| fd.clone().into())
}

#[cfg(not(feature = "parquet_encryption"))]
pub fn map_config_decryption_to_decryption(
_decryption: Option<&ConfigFileDecryptionProperties>,
) -> Option<FileDecryptionProperties> {
None
}
21 changes: 9 additions & 12 deletions datafusion/common/src/file_options/parquet_writer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ use crate::{

use arrow::datatypes::Schema;
// TODO: handle once deprecated
use crate::encryption::add_crypto_to_writer_properties;
#[allow(deprecated)]
use parquet::{
arrow::ARROW_SCHEMA_META_KEY,
Expand Down Expand Up @@ -100,11 +101,7 @@ impl TryFrom<&TableParquetOptions> for WriterPropertiesBuilder {

let mut builder = global.into_writer_properties_builder()?;

if let Some(file_encryption_properties) = &crypto.file_encryption {
builder = builder.with_file_encryption_properties(
file_encryption_properties.clone().into(),
);
}
builder = add_crypto_to_writer_properties(crypto, builder);

// check that the arrow schema is present in the kv_metadata, if configured to do so
if !global.skip_arrow_metadata
Expand Down Expand Up @@ -456,12 +453,10 @@ mod tests {
};
use std::collections::HashMap;

use crate::config::{
ConfigFileEncryptionProperties, ParquetColumnOptions, ParquetEncryptionOptions,
ParquetOptions,
};

use super::*;
use crate::config::{ParquetColumnOptions, ParquetEncryptionOptions, ParquetOptions};
#[cfg(feature = "parquet_encryption")]
use crate::encryption::map_encryption_to_config_encryption;

const COL_NAME: &str = "configured";

Expand Down Expand Up @@ -590,8 +585,10 @@ mod tests {
HashMap::from([(COL_NAME.into(), configured_col_props)])
};

let fep: Option<ConfigFileEncryptionProperties> =
props.file_encryption_properties().map(|fe| fe.into());
#[cfg(feature = "parquet_encryption")]
let fep = map_encryption_to_config_encryption(props.file_encryption_properties());
#[cfg(not(feature = "parquet_encryption"))]
let fep = None;

#[allow(deprecated)] // max_statistics_size
TableParquetOptions {
Expand Down
1 change: 1 addition & 0 deletions datafusion/common/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ pub mod config;
pub mod cse;
pub mod diagnostic;
pub mod display;
pub mod encryption;
pub mod error;
pub mod file_options;
pub mod format;
Expand Down
9 changes: 9 additions & 0 deletions datafusion/core/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,21 @@ default = [
"unicode_expressions",
"compression",
"parquet",
"parquet_encryption",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this really be a default feature? It's still marked experimental in arrow-rs.

Copy link
Contributor Author

@corwinjoy corwinjoy Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I set it to the default so that users can have this feature enabled as per the updated docs and available in the CLI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a question for @alamb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally prefer if it was not part of the default features but given it is already on by default on main, I think it is fine to leave it on by default in this PR

We can discuss disabling the feature by default as a follow on ticket / PR perhaps

Here are some notes I wrote about how to make it a non default feature:

  1. Don't add it to the default features
  2. Update the docs to mention the config setting requires the parquet_encryption feature --
    /// Options for configuring Parquet modular encryption
  3. Enable it in datafusion-cli (add parquet_encryption in https://github.com/apache/datafusion/blob/ca16255e725bf6676a8ddd4b9948496d99b5bd88/datafusion-cli/Cargo.toml#L45-L44)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would make for a good follow-up PR. Thanks for the helpful notes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made an issue for this for further discussion: #16777

"recursive_protection",
]
encoding_expressions = ["datafusion-functions/encoding_expressions"]
# Used for testing ONLY: causes all values to hash to the same value (test for collisions)
force_hash_collisions = ["datafusion-physical-plan/force_hash_collisions", "datafusion-common/force_hash_collisions"]
math_expressions = ["datafusion-functions/math_expressions"]
parquet = ["datafusion-common/parquet", "dep:parquet", "datafusion-datasource-parquet"]
parquet_encryption = [
"dep:parquet",
"parquet/encryption",
"datafusion-common/parquet_encryption",
"datafusion-datasource-parquet/parquet_encryption",
"dep:hex",
]
pyarrow = ["datafusion-common/pyarrow", "parquet"]
regex_expressions = [
"datafusion-functions/regex_expressions",
Expand Down Expand Up @@ -127,6 +135,7 @@ datafusion-session = { workspace = true }
datafusion-sql = { workspace = true }
flate2 = { version = "1.1.2", optional = true }
futures = { workspace = true }
hex = { workspace = true, optional = true }
itertools = { workspace = true }
log = { workspace = true }
object_store = { workspace = true }
Expand Down
1 change: 1 addition & 0 deletions datafusion/core/src/dataframe/parquet.rs
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,7 @@ mod tests {
Ok(())
}

#[cfg(feature = "parquet_encryption")]
#[tokio::test]
async fn roundtrip_parquet_with_encryption() -> Result<()> {
use parquet::encryption::decrypt::FileDecryptionProperties;
Expand Down
1 change: 1 addition & 0 deletions datafusion/core/src/test/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ use crate::test_util::{aggr_test_schema, arrow_test_data};
use arrow::array::{self, Array, ArrayRef, Decimal128Builder, Int32Array};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
#[cfg(feature = "compression")]
use datafusion_common::DataFusionError;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used by compression. Got warning via

cargo check --profile ci --no-default-features -p datafusion --features=parquet_encryption

Figured I may as well fix it while I am here.

use datafusion_datasource::source::DataSourceExec;

Expand Down
1 change: 1 addition & 0 deletions datafusion/core/tests/parquet/encryption.rs
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ pub fn write_batches(
Ok(num_rows)
}

#[cfg(feature = "parquet_encryption")]
#[tokio::test]
async fn round_trip_encryption() {
let ctx: SessionContext = SessionContext::new();
Expand Down
8 changes: 8 additions & 0 deletions datafusion/datasource-parquet/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ datafusion-physical-plan = { workspace = true }
datafusion-pruning = { workspace = true }
datafusion-session = { workspace = true }
futures = { workspace = true }
hex = { workspace = true, optional = true }
itertools = { workspace = true }
log = { workspace = true }
object_store = { workspace = true }
Expand All @@ -65,3 +66,10 @@ workspace = true
[lib]
name = "datafusion_datasource_parquet"
path = "src/mod.rs"

[features]
parquet_encryption = [
"parquet/encryption",
"datafusion-common/parquet_encryption",
"dep:hex",
]
Loading