Delta Stats for binary columns are not truncated

# Environment

**Delta-rs version**: master

**Binding**: rust

**Environment**: local test

***
# Bug

**What happened**:

When writing a file with large binary columns, the delta log json for the commit is very large due to a large statistics object.

**What you expected to happen**:

These are expected to receive truncated statistics due to a PR to arrow-rs (https://github.com/apache/arrow-rs/pull/4389).

**How to reproduce it**:

A test case (to put in `stats.rs`):

```rust


    #[tokio::test]
    async fn test_delta_stats_truncation() -> Result<(), crate::DeltaTableError> {
        let temp_dir = tempfile::tempdir().unwrap();
        let table_path = temp_dir.path().to_owned();

        let schema_fields = vec![
            crate::schema::SchemaField::new(
                "long_string".to_owned(),
                crate::SchemaDataType::primitive("string".to_owned()),
                false,
                Default::default(),
            ),
            crate::schema::SchemaField::new(
                "long_binary".to_owned(),
                crate::SchemaDataType::primitive("binary".to_owned()),
                false,
                Default::default(),
            ),
        ];

        let table = crate::operations::create::CreateBuilder::new()
            .with_table_name("temp")
            .with_location(table_path.to_str().unwrap())
            .with_columns(schema_fields.clone())
            .await?;
        let mut writer = RecordBatchWriter::for_table(&table).unwrap();
        writer = writer.with_writer_properties(
            WriterProperties::builder()
                .set_compression(Compression::SNAPPY)
                .set_max_row_group_size(128)
                .build(),
        );

        let fields = arrow::datatypes::Fields::from(
            schema_fields
                .into_iter()
                .map(|f| arrow::datatypes::Field::try_from(&f).unwrap())
                .collect::<Vec<_>>(),
        );

        let long_field_len =
            10 * parquet::file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH.unwrap();

        const ROW_COUNT: usize = 10;

        let mut string_builder = arrow::array::StringBuilder::new();
        let mut binary_builder = arrow::array::BinaryBuilder::new();

        for i in 0..ROW_COUNT {
            let long_string = std::iter::repeat(i.to_string())
                .take(long_field_len)
                .collect::<Vec<_>>()
                .join("");
            string_builder.append_value(&long_string);

            let long_binary = std::iter::repeat(i as u8)
                .take(long_field_len)
                .collect::<Vec<_>>();
            binary_builder.append_value(&long_binary);
        }

        let arrays: Vec<Arc<dyn arrow::array::Array>> = vec![
            Arc::new(string_builder.finish()),
            Arc::new(binary_builder.finish()),
        ];

        let file_contents: arrow::record_batch::RecordBatch =
            StructArray::new(fields, arrays, None).into();

        writer.write(file_contents).await?;

        let mut actions = writer.flush().await?;

        assert!(actions.len() == 1);

        let action = actions.remove(0);

        let stats = action.get_stats()?.expect("stats");

        match stats.min_values.get("long_string").unwrap() {
            ColumnValueStat::Value(serde_json::Value::String(s)) => {
                assert_eq!(
                    s.len(),
                    parquet::file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH.unwrap()
                );
            }
            x => panic!("invalid stats format: {x:?}"),
        }

        match stats.min_values.get("long_binary").unwrap() {
            ColumnValueStat::Value(serde_json::Value::String(s)) => {
                assert_eq!(
                    s.len(),
                    parquet::file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH.unwrap()
                );
            }
            x => panic!("invalid stats format: {x:?}"),
        }

        Ok(())
    }

```

**More details**:

I think this is due to the underlying parquet write truncating the values it uses for the column index, but not for the column metadata statistics. I'm going to open a companion issue to arrow-rs to track that. (EDIT: https://github.com/apache/arrow-rs/issues/5037)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Delta Stats for binary columns are not truncated #1805

Environment

Bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Delta Stats for binary columns are not truncated #1805

Description

Environment

Bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions