Skip to content

Conversation

@jayzhan211
Copy link
Contributor

Which issue does this PR close?

Closes #.
We will need this for #7893

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @jayzhan211

};

for i in valid_indices {
let mut values_hashes = vec![0u64; array.len()];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you create a new vec for each iteration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create each values_hashes for each column. And combine those hashed element in column to one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values_hashes can be created upfront, outside of the loop.

@alamb alamb requested a review from Dandandan December 15, 2023 19:14
let column_values = values[i].clone();
// create hashes for each column
create_hashes(&[column_values], random_state, &mut values_hashes)?;
let hash = &mut hashes_buffer[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not correct, as i here to a column in the struct and hashes_buffer to a hash of one value?

Copy link
Contributor Author

@jayzhan211 jayzhan211 Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan. Not fully got your point. I do consider each column as one value (iterate each row in one column with get one final hash hashes_buffer[i]).

i is the column index in the struct array and is also the hashes_buffer index.
For example, column1, null, column3, column4. We iterate 0, 2, and 3. And hash each of them to hashes_buffer[0], hashes_buffer[2], and hashes_buffer[3] respectively. A column-wise hashes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I mistake how the null buffer works in StructArray. It seems to be row-wise, but I consider it as column-wise...

Copy link
Contributor Author

@jayzhan211 jayzhan211 Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks pretty much better now :)

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
let valid_indices: Vec<usize> = if let Some(nulls) = nulls {
nulls.valid_indices().collect()
} else {
(0..num_columns).collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't yet seem correct, valid_indices contains the rows that are valid, not the columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hash each row now, so we iterate the valid row and hash each column at that row to one hash value.

};

let mut values_hashes = vec![0u64; array.len()];
create_hashes(array.columns(), random_state, &mut values_hashes)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to use hashes_buffer here directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that, we should be able to remove the code related to valid_indices, as this already takes care of nulls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think so.

values_hashes: [9258723240401068087, 9258723240401068087, 8502738074356456021, 8502738074356456021, 4222447303697976283, 9753707356376286577]
hashes_buffer: [9258723240401091360, 9258723240401091360, 0, 8502738074356479294, 0, 0]

We can see that create_hashes does not consider if the row is valid or not. Therefore, we need to iterate valid_indices once .

}
DataType::Struct(_) => {
let array = as_struct_array(array)?;
hash_struct_array(array, random_state, hashes_buffer)?;
Copy link
Contributor

@Dandandan Dandandan Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work?

Suggested change
hash_struct_array(array, random_state, hashes_buffer)?;
create_hashes(array.columns(), random_state, hashes_buffer)?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we consider nulls, this doesn't work

Signed-off-by: jayzhan211 <[email protected]>
@alamb
Copy link
Contributor

alamb commented Dec 20, 2023

@Dandandan do you think this PR is ready to merge?

@jayzhan211
Copy link
Contributor Author

@alamb @Dandandan Any update on this?

@alamb alamb merged commit 1179a76 into apache:main Jan 3, 2024
@alamb
Copy link
Contributor

alamb commented Jan 3, 2024

Thank you @jayzhan211 @waynexia and @Dandandan

@jayzhan211
Copy link
Contributor Author

I found the current implementation failed this

    #[test]
    // Tests actual values of hashes, which are different if forcing collisions
    #[cfg(not(feature = "force_hash_collisions"))]
    fn create_hashes_for_struct_arrays_more_column_than_row() {
        let struct_array = StructArray::from(
            vec![
                (
                    Arc::new(Field::new("bool", DataType::Boolean, false)),
                    Arc::new(BooleanArray::from(vec![
                        false, false
                    ])) as ArrayRef,
                ),
                (
                    Arc::new(Field::new("i32-1", DataType::Int32, false)),
                    Arc::new(Int32Array::from(vec![10, 10])) as ArrayRef,
                ),
                (
                    Arc::new(Field::new("i32-2", DataType::Int32, false)),
                    Arc::new(Int32Array::from(vec![10, 10])) as ArrayRef,
                ),
                (
                    Arc::new(Field::new("i32-3", DataType::Int32, false)),
                    Arc::new(Int32Array::from(vec![10, 10])) as ArrayRef,
                ),
            ],
        );

        assert!(struct_array.is_valid(0));
        assert!(struct_array.is_valid(1));

        let array = Arc::new(struct_array) as ArrayRef;

        let random_state = RandomState::with_seeds(0, 0, 0, 0);
        let mut hashes = vec![0; array.len()];
        create_hashes(&[array], &random_state, &mut hashes).unwrap();
        assert_eq!(hashes[0], hashes[1]);
    }

Let me checkout the reason

thinkharderdev pushed a commit to coralogix/arrow-datafusion that referenced this pull request Feb 26, 2024
* hash struct

Signed-off-by: jayzhan211 <[email protected]>

* fmt

Signed-off-by: jayzhan211 <[email protected]>

* row-wise hash

Signed-off-by: jayzhan211 <[email protected]>

* fmt

Signed-off-by: jayzhan211 <[email protected]>

* create hashes once

Signed-off-by: jayzhan211 <[email protected]>

* cleanup

Signed-off-by: jayzhan211 <[email protected]>

* add comment

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants