Skip to content

Commit 6965fd3

Browse files
corwinjoyadamreeveCopilot
authored
feat: Add a configuration to make parquet encryption optional (#16649)
* Initial commit to form PR for datafusion encryption support * Add tests for encryption configuration * Apply cargo fmt * Add a roundtrip encryption test to the parquet tests. * cargo fmt * Update test to add decryption parameter to called functions. * Try to get DataFrame.write_parquet to work with encryption. Doesn't quite, column encryption is broken. * Update datafusion/datasource-parquet/src/opener.rs Co-authored-by: Adam Reeve <[email protected]> * Update datafusion/datasource-parquet/src/source.rs Co-authored-by: Adam Reeve <[email protected]> * Fix write test in parquet.rs * Simplify encryption test. Remove unused imports. * Run cargo fmt. * Further streamline roundtrip test. * Change From methods for FileEncryptionProperties and FileDecryptionProperties to use references. * Change encryption config to directly hold column keys using custom config fields. * Fix generated field names in visit for encryptor and decryptor to use "." instead of "::" * 1. Disable parallel writes with enccryption. 2. Fixed unused header warning in config.rs. 3. Fix test case in encryption.rs to call conversion to ConfigFileDecryption properties correctly. * cargo fmt * Update datafusion/common/src/file_options/parquet_writer.rs Co-authored-by: Copilot <[email protected]> * fix variables shown in information schema test. * Backout bad suggestion from copilot * Remove unused serde reference Add an example to read and write encrypted parquet files. * cargo fmt * change file_format.rs to use global encryption options in struct. * Turn off page_index for encrypted example. Get encrypted example working with filter. * Tidy up example output. * Add missing license. Run taplo format * Update configs.md by running dev/update_config_docs.sh * Cargo fmt + clippy changes. * Add filter test for encrypted files. * Cargo clippy changes. * Fix link in README.md * Add issue tag for parallel writes. * Move file encryption and decryption properties out of global options * Use config_namespace_with_hashmap for column encryption/decryption props * Remove outdated docs on crypto settings. Signed-off-by: Corwin Joy <[email protected]> * 1. Add docs for using encryption configuration. 2. Add example SQL for using encryption from CLI. 3. Fix removed variables in test for configuration information. 4. Clippy and cargo fmt. Signed-off-by: Corwin Joy <[email protected]> * Update code to add missing ParquetOpener parameter due to merge from main Signed-off-by: Corwin Joy <[email protected]> * Add CLI documentation for Parquet options and provide an encryption example Signed-off-by: Corwin Joy <[email protected]> * Use ConfigFileDecryptionProperties in ParquetReadOptions Signed-off-by: Adam Reeve <[email protected]> * Implement default for ConfigFileEncryptionProperties Signed-off-by: Corwin Joy <[email protected]> * Add sqllogictest for parquet with encryption Signed-off-by: Corwin Joy <[email protected]> * Apply prettier changes from CI Signed-off-by: Corwin Joy <[email protected]> * Begin adding config for encryption. * Fix merge errors. * Fix config dependency in datafusion/core/Cargo.toml * ci reformat * Fix missing Cargo dependency. * Update Cargo.toml Co-authored-by: Copilot <[email protected]> * Revert bad suggestion from copilot Signed-off-by: Corwin Joy <[email protected]> * taplo format Cargo.toml Signed-off-by: Corwin Joy <[email protected]> * 1. Add CI test for parquet_encryption. 2. Document new configuration Signed-off-by: Corwin Joy <[email protected]> * Refactor cfg parquet_encryption feature into a few functions defined in encryption.rs Signed-off-by: Corwin Joy <[email protected]> * Remove #allow(unusued) that is no longer needed Signed-off-by: Corwin Joy <[email protected]> * cargo fmt Signed-off-by: Corwin Joy <[email protected]> * Fix error when cfg parquet_encryption is disabled. Signed-off-by: Corwin Joy <[email protected]> * cargo fmt Signed-off-by: Corwin Joy <[email protected]> * Remove unused mut. Signed-off-by: Corwin Joy <[email protected]> --------- Signed-off-by: Corwin Joy <[email protected]> Signed-off-by: Adam Reeve <[email protected]> Co-authored-by: Adam Reeve <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Adam Reeve <[email protected]>
1 parent 04b006c commit 6965fd3

File tree

17 files changed

+155
-51
lines changed

17 files changed

+155
-51
lines changed

.github/workflows/rust.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,8 @@ jobs:
218218
run: cargo check --profile ci --no-default-features -p datafusion --features=string_expressions
219219
- name: Check datafusion (unicode_expressions)
220220
run: cargo check --profile ci --no-default-features -p datafusion --features=unicode_expressions
221+
- name: Check parquet encryption (parquet_encryption)
222+
run: cargo check --profile ci --no-default-features -p datafusion --features=parquet_encryption
221223

222224
# Check datafusion-functions crate features
223225
#

Cargo.lock

Lines changed: 2 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,7 @@ env_logger = "0.11"
150150
futures = "0.3"
151151
half = { version = "2.6.0", default-features = false }
152152
hashbrown = { version = "0.14.5", features = ["raw"] }
153+
hex = { version = "0.4.3" }
153154
indexmap = "2.10.0"
154155
itertools = "0.14"
155156
log = "^0.4"

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,7 @@ Default features:
120120
- `datetime_expressions`: date and time functions such as `to_timestamp`
121121
- `encoding_expressions`: `encode` and `decode` functions
122122
- `parquet`: support for reading the [Apache Parquet] format
123+
- `parquet_encryption`: support for using [Parquet Modular Encryption]
123124
- `regex_expressions`: regular expression functions, such as `regexp_match`
124125
- `unicode_expressions`: Include unicode aware functions such as `character_length`
125126
- `unparser`: enables support to reverse LogicalPlans back into SQL
@@ -134,6 +135,7 @@ Optional features:
134135

135136
[apache avro]: https://avro.apache.org/
136137
[apache parquet]: https://parquet.apache.org/
138+
[parquet modular encryption]: https://parquet.apache.org/docs/file-format/data-pages/encryption/
137139

138140
## DataFusion API Evolution and Deprecation Guidelines
139141

datafusion/common/Cargo.toml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,15 @@ name = "datafusion_common"
4040
[features]
4141
avro = ["apache-avro"]
4242
backtrace = []
43+
parquet_encryption = [
44+
"parquet",
45+
"parquet/encryption",
46+
"dep:hex",
47+
]
4348
pyarrow = ["pyo3", "arrow/pyarrow", "parquet"]
4449
force_hash_collisions = []
4550
recursive_protection = ["dep:recursive"]
51+
parquet = ["dep:parquet"]
4652

4753
[dependencies]
4854
ahash = { workspace = true }
@@ -58,7 +64,7 @@ base64 = "0.22.1"
5864
chrono = { workspace = true }
5965
half = { workspace = true }
6066
hashbrown = { workspace = true }
61-
hex = "0.4.3"
67+
hex = { workspace = true, optional = true }
6268
indexmap = { workspace = true }
6369
libc = "0.2.174"
6470
log = { workspace = true }

datafusion/common/src/config.rs

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@
1919
2020
use arrow_ipc::CompressionType;
2121

22+
#[cfg(feature = "parquet_encryption")]
23+
use crate::encryption::{FileDecryptionProperties, FileEncryptionProperties};
2224
use crate::error::_config_err;
2325
use crate::parsers::CompressionTypeVariant;
2426
use crate::utils::get_available_parallelism;
@@ -29,12 +31,8 @@ use std::error::Error;
2931
use std::fmt::{self, Display};
3032
use std::str::FromStr;
3133

32-
#[cfg(feature = "parquet")]
34+
#[cfg(feature = "parquet_encryption")]
3335
use hex;
34-
#[cfg(feature = "parquet")]
35-
use parquet::encryption::decrypt::FileDecryptionProperties;
36-
#[cfg(feature = "parquet")]
37-
use parquet::encryption::encrypt::FileEncryptionProperties;
3836

3937
/// A macro that wraps a configuration struct and automatically derives
4038
/// [`Default`] and [`ConfigField`] for it, allowing it to be used
@@ -2148,7 +2146,7 @@ impl ConfigField for ConfigFileEncryptionProperties {
21482146
}
21492147
}
21502148

2151-
#[cfg(feature = "parquet")]
2149+
#[cfg(feature = "parquet_encryption")]
21522150
impl From<ConfigFileEncryptionProperties> for FileEncryptionProperties {
21532151
fn from(val: ConfigFileEncryptionProperties) -> Self {
21542152
let mut fep = FileEncryptionProperties::builder(
@@ -2194,7 +2192,7 @@ impl From<ConfigFileEncryptionProperties> for FileEncryptionProperties {
21942192
}
21952193
}
21962194

2197-
#[cfg(feature = "parquet")]
2195+
#[cfg(feature = "parquet_encryption")]
21982196
impl From<&FileEncryptionProperties> for ConfigFileEncryptionProperties {
21992197
fn from(f: &FileEncryptionProperties) -> Self {
22002198
let (column_names_vec, column_keys_vec, column_metas_vec) = f.column_keys();
@@ -2308,7 +2306,7 @@ impl ConfigField for ConfigFileDecryptionProperties {
23082306
}
23092307
}
23102308

2311-
#[cfg(feature = "parquet")]
2309+
#[cfg(feature = "parquet_encryption")]
23122310
impl From<ConfigFileDecryptionProperties> for FileDecryptionProperties {
23132311
fn from(val: ConfigFileDecryptionProperties) -> Self {
23142312
let mut column_names: Vec<&str> = Vec::new();
@@ -2342,7 +2340,7 @@ impl From<ConfigFileDecryptionProperties> for FileDecryptionProperties {
23422340
}
23432341
}
23442342

2345-
#[cfg(feature = "parquet")]
2343+
#[cfg(feature = "parquet_encryption")]
23462344
impl From<&FileDecryptionProperties> for ConfigFileDecryptionProperties {
23472345
fn from(f: &FileDecryptionProperties) -> Self {
23482346
let (column_names_vec, column_keys_vec) = f.column_keys();
@@ -2688,7 +2686,7 @@ mod tests {
26882686
);
26892687
}
26902688

2691-
#[cfg(feature = "parquet")]
2689+
#[cfg(feature = "parquet_encryption")]
26922690
#[test]
26932691
fn parquet_table_encryption() {
26942692
use crate::config::{
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
// Licensed to the Apache Software Foundation (ASF) under one
2+
// or more contributor license agreements. See the NOTICE file
3+
// distributed with this work for additional information
4+
// regarding copyright ownership. The ASF licenses this file
5+
// to you under the Apache License, Version 2.0 (the
6+
// "License"); you may not use this file except in compliance
7+
// with the License. You may obtain a copy of the License at
8+
//
9+
// http://www.apache.org/licenses/LICENSE-2.0
10+
//
11+
// Unless required by applicable law or agreed to in writing,
12+
// software distributed under the License is distributed on an
13+
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
// KIND, either express or implied. See the License for the
15+
// specific language governing permissions and limitations
16+
// under the License.
17+
18+
// Support optional features for encryption in Parquet files.
19+
//! This module provides types and functions related to encryption in Parquet files.
20+
21+
#[cfg(feature = "parquet_encryption")]
22+
pub use parquet::encryption::decrypt::FileDecryptionProperties;
23+
#[cfg(feature = "parquet_encryption")]
24+
pub use parquet::encryption::encrypt::FileEncryptionProperties;
25+
26+
#[cfg(not(feature = "parquet_encryption"))]
27+
pub struct FileDecryptionProperties;
28+
#[cfg(not(feature = "parquet_encryption"))]
29+
pub struct FileEncryptionProperties;
30+
31+
#[cfg(feature = "parquet")]
32+
use crate::config::ParquetEncryptionOptions;
33+
pub use crate::config::{ConfigFileDecryptionProperties, ConfigFileEncryptionProperties};
34+
#[cfg(feature = "parquet")]
35+
use parquet::file::properties::WriterPropertiesBuilder;
36+
37+
#[cfg(feature = "parquet")]
38+
pub fn add_crypto_to_writer_properties(
39+
#[allow(unused)] crypto: &ParquetEncryptionOptions,
40+
#[allow(unused_mut)] mut builder: WriterPropertiesBuilder,
41+
) -> WriterPropertiesBuilder {
42+
#[cfg(feature = "parquet_encryption")]
43+
if let Some(file_encryption_properties) = &crypto.file_encryption {
44+
builder = builder
45+
.with_file_encryption_properties(file_encryption_properties.clone().into());
46+
}
47+
builder
48+
}
49+
50+
#[cfg(feature = "parquet_encryption")]
51+
pub fn map_encryption_to_config_encryption(
52+
encryption: Option<&FileEncryptionProperties>,
53+
) -> Option<ConfigFileEncryptionProperties> {
54+
encryption.map(|fe| fe.into())
55+
}
56+
57+
#[cfg(not(feature = "parquet_encryption"))]
58+
pub fn map_encryption_to_config_encryption(
59+
_encryption: Option<&FileEncryptionProperties>,
60+
) -> Option<ConfigFileEncryptionProperties> {
61+
None
62+
}
63+
64+
#[cfg(feature = "parquet_encryption")]
65+
pub fn map_config_decryption_to_decryption(
66+
decryption: Option<&ConfigFileDecryptionProperties>,
67+
) -> Option<FileDecryptionProperties> {
68+
decryption.map(|fd| fd.clone().into())
69+
}
70+
71+
#[cfg(not(feature = "parquet_encryption"))]
72+
pub fn map_config_decryption_to_decryption(
73+
_decryption: Option<&ConfigFileDecryptionProperties>,
74+
) -> Option<FileDecryptionProperties> {
75+
None
76+
}

datafusion/common/src/file_options/parquet_writer.rs

Lines changed: 9 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ use crate::{
2727

2828
use arrow::datatypes::Schema;
2929
// TODO: handle once deprecated
30+
use crate::encryption::add_crypto_to_writer_properties;
3031
#[allow(deprecated)]
3132
use parquet::{
3233
arrow::ARROW_SCHEMA_META_KEY,
@@ -100,11 +101,7 @@ impl TryFrom<&TableParquetOptions> for WriterPropertiesBuilder {
100101

101102
let mut builder = global.into_writer_properties_builder()?;
102103

103-
if let Some(file_encryption_properties) = &crypto.file_encryption {
104-
builder = builder.with_file_encryption_properties(
105-
file_encryption_properties.clone().into(),
106-
);
107-
}
104+
builder = add_crypto_to_writer_properties(crypto, builder);
108105

109106
// check that the arrow schema is present in the kv_metadata, if configured to do so
110107
if !global.skip_arrow_metadata
@@ -456,12 +453,10 @@ mod tests {
456453
};
457454
use std::collections::HashMap;
458455

459-
use crate::config::{
460-
ConfigFileEncryptionProperties, ParquetColumnOptions, ParquetEncryptionOptions,
461-
ParquetOptions,
462-
};
463-
464456
use super::*;
457+
use crate::config::{ParquetColumnOptions, ParquetEncryptionOptions, ParquetOptions};
458+
#[cfg(feature = "parquet_encryption")]
459+
use crate::encryption::map_encryption_to_config_encryption;
465460

466461
const COL_NAME: &str = "configured";
467462

@@ -590,8 +585,10 @@ mod tests {
590585
HashMap::from([(COL_NAME.into(), configured_col_props)])
591586
};
592587

593-
let fep: Option<ConfigFileEncryptionProperties> =
594-
props.file_encryption_properties().map(|fe| fe.into());
588+
#[cfg(feature = "parquet_encryption")]
589+
let fep = map_encryption_to_config_encryption(props.file_encryption_properties());
590+
#[cfg(not(feature = "parquet_encryption"))]
591+
let fep = None;
595592

596593
#[allow(deprecated)] // max_statistics_size
597594
TableParquetOptions {

datafusion/common/src/lib.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ pub mod config;
4141
pub mod cse;
4242
pub mod diagnostic;
4343
pub mod display;
44+
pub mod encryption;
4445
pub mod error;
4546
pub mod file_options;
4647
pub mod format;

datafusion/core/Cargo.toml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,13 +61,21 @@ default = [
6161
"unicode_expressions",
6262
"compression",
6363
"parquet",
64+
"parquet_encryption",
6465
"recursive_protection",
6566
]
6667
encoding_expressions = ["datafusion-functions/encoding_expressions"]
6768
# Used for testing ONLY: causes all values to hash to the same value (test for collisions)
6869
force_hash_collisions = ["datafusion-physical-plan/force_hash_collisions", "datafusion-common/force_hash_collisions"]
6970
math_expressions = ["datafusion-functions/math_expressions"]
7071
parquet = ["datafusion-common/parquet", "dep:parquet", "datafusion-datasource-parquet"]
72+
parquet_encryption = [
73+
"dep:parquet",
74+
"parquet/encryption",
75+
"datafusion-common/parquet_encryption",
76+
"datafusion-datasource-parquet/parquet_encryption",
77+
"dep:hex",
78+
]
7179
pyarrow = ["datafusion-common/pyarrow", "parquet"]
7280
regex_expressions = [
7381
"datafusion-functions/regex_expressions",
@@ -127,6 +135,7 @@ datafusion-session = { workspace = true }
127135
datafusion-sql = { workspace = true }
128136
flate2 = { version = "1.1.2", optional = true }
129137
futures = { workspace = true }
138+
hex = { workspace = true, optional = true }
130139
itertools = { workspace = true }
131140
log = { workspace = true }
132141
object_store = { workspace = true }

0 commit comments

Comments
 (0)