-
Notifications
You must be signed in to change notification settings - Fork 538
Closed
Labels
bugSomething isn't workingSomething isn't working
Milestone
Description
Environment
Delta-rs version: master
Binding: rust
Environment: local test
Bug
What happened:
When writing a file with large binary columns, the delta log json for the commit is very large due to a large statistics object.
What you expected to happen:
These are expected to receive truncated statistics due to a PR to arrow-rs (apache/arrow-rs#4389).
How to reproduce it:
A test case (to put in stats.rs):
#[tokio::test]
async fn test_delta_stats_truncation() -> Result<(), crate::DeltaTableError> {
let temp_dir = tempfile::tempdir().unwrap();
let table_path = temp_dir.path().to_owned();
let schema_fields = vec![
crate::schema::SchemaField::new(
"long_string".to_owned(),
crate::SchemaDataType::primitive("string".to_owned()),
false,
Default::default(),
),
crate::schema::SchemaField::new(
"long_binary".to_owned(),
crate::SchemaDataType::primitive("binary".to_owned()),
false,
Default::default(),
),
];
let table = crate::operations::create::CreateBuilder::new()
.with_table_name("temp")
.with_location(table_path.to_str().unwrap())
.with_columns(schema_fields.clone())
.await?;
let mut writer = RecordBatchWriter::for_table(&table).unwrap();
writer = writer.with_writer_properties(
WriterProperties::builder()
.set_compression(Compression::SNAPPY)
.set_max_row_group_size(128)
.build(),
);
let fields = arrow::datatypes::Fields::from(
schema_fields
.into_iter()
.map(|f| arrow::datatypes::Field::try_from(&f).unwrap())
.collect::<Vec<_>>(),
);
let long_field_len =
10 * parquet::file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH.unwrap();
const ROW_COUNT: usize = 10;
let mut string_builder = arrow::array::StringBuilder::new();
let mut binary_builder = arrow::array::BinaryBuilder::new();
for i in 0..ROW_COUNT {
let long_string = std::iter::repeat(i.to_string())
.take(long_field_len)
.collect::<Vec<_>>()
.join("");
string_builder.append_value(&long_string);
let long_binary = std::iter::repeat(i as u8)
.take(long_field_len)
.collect::<Vec<_>>();
binary_builder.append_value(&long_binary);
}
let arrays: Vec<Arc<dyn arrow::array::Array>> = vec![
Arc::new(string_builder.finish()),
Arc::new(binary_builder.finish()),
];
let file_contents: arrow::record_batch::RecordBatch =
StructArray::new(fields, arrays, None).into();
writer.write(file_contents).await?;
let mut actions = writer.flush().await?;
assert!(actions.len() == 1);
let action = actions.remove(0);
let stats = action.get_stats()?.expect("stats");
match stats.min_values.get("long_string").unwrap() {
ColumnValueStat::Value(serde_json::Value::String(s)) => {
assert_eq!(
s.len(),
parquet::file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH.unwrap()
);
}
x => panic!("invalid stats format: {x:?}"),
}
match stats.min_values.get("long_binary").unwrap() {
ColumnValueStat::Value(serde_json::Value::String(s)) => {
assert_eq!(
s.len(),
parquet::file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH.unwrap()
);
}
x => panic!("invalid stats format: {x:?}"),
}
Ok(())
}More details:
I think this is due to the underlying parquet write truncating the values it uses for the column index, but not for the column metadata statistics. I'm going to open a companion issue to arrow-rs to track that. (EDIT: apache/arrow-rs#5037)
wjones127
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working