Skip to content

Delta Stats for binary columns are not truncated #1805

@emcake

Description

@emcake

Environment

Delta-rs version: master

Binding: rust

Environment: local test


Bug

What happened:

When writing a file with large binary columns, the delta log json for the commit is very large due to a large statistics object.

What you expected to happen:

These are expected to receive truncated statistics due to a PR to arrow-rs (apache/arrow-rs#4389).

How to reproduce it:

A test case (to put in stats.rs):

    #[tokio::test]
    async fn test_delta_stats_truncation() -> Result<(), crate::DeltaTableError> {
        let temp_dir = tempfile::tempdir().unwrap();
        let table_path = temp_dir.path().to_owned();

        let schema_fields = vec![
            crate::schema::SchemaField::new(
                "long_string".to_owned(),
                crate::SchemaDataType::primitive("string".to_owned()),
                false,
                Default::default(),
            ),
            crate::schema::SchemaField::new(
                "long_binary".to_owned(),
                crate::SchemaDataType::primitive("binary".to_owned()),
                false,
                Default::default(),
            ),
        ];

        let table = crate::operations::create::CreateBuilder::new()
            .with_table_name("temp")
            .with_location(table_path.to_str().unwrap())
            .with_columns(schema_fields.clone())
            .await?;
        let mut writer = RecordBatchWriter::for_table(&table).unwrap();
        writer = writer.with_writer_properties(
            WriterProperties::builder()
                .set_compression(Compression::SNAPPY)
                .set_max_row_group_size(128)
                .build(),
        );

        let fields = arrow::datatypes::Fields::from(
            schema_fields
                .into_iter()
                .map(|f| arrow::datatypes::Field::try_from(&f).unwrap())
                .collect::<Vec<_>>(),
        );

        let long_field_len =
            10 * parquet::file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH.unwrap();

        const ROW_COUNT: usize = 10;

        let mut string_builder = arrow::array::StringBuilder::new();
        let mut binary_builder = arrow::array::BinaryBuilder::new();

        for i in 0..ROW_COUNT {
            let long_string = std::iter::repeat(i.to_string())
                .take(long_field_len)
                .collect::<Vec<_>>()
                .join("");
            string_builder.append_value(&long_string);

            let long_binary = std::iter::repeat(i as u8)
                .take(long_field_len)
                .collect::<Vec<_>>();
            binary_builder.append_value(&long_binary);
        }

        let arrays: Vec<Arc<dyn arrow::array::Array>> = vec![
            Arc::new(string_builder.finish()),
            Arc::new(binary_builder.finish()),
        ];

        let file_contents: arrow::record_batch::RecordBatch =
            StructArray::new(fields, arrays, None).into();

        writer.write(file_contents).await?;

        let mut actions = writer.flush().await?;

        assert!(actions.len() == 1);

        let action = actions.remove(0);

        let stats = action.get_stats()?.expect("stats");

        match stats.min_values.get("long_string").unwrap() {
            ColumnValueStat::Value(serde_json::Value::String(s)) => {
                assert_eq!(
                    s.len(),
                    parquet::file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH.unwrap()
                );
            }
            x => panic!("invalid stats format: {x:?}"),
        }

        match stats.min_values.get("long_binary").unwrap() {
            ColumnValueStat::Value(serde_json::Value::String(s)) => {
                assert_eq!(
                    s.len(),
                    parquet::file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH.unwrap()
                );
            }
            x => panic!("invalid stats format: {x:?}"),
        }

        Ok(())
    }

More details:

I think this is due to the underlying parquet write truncating the values it uses for the column index, but not for the column metadata statistics. I'm going to open a companion issue to arrow-rs to track that. (EDIT: apache/arrow-rs#5037)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions