-
Couldn't load subscription status.
- Fork 1.7k
Minor: Introduce utils::hash for StructArray #8552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks @jayzhan211
datafusion/common/src/hash_utils.rs
Outdated
| }; | ||
|
|
||
| for i in valid_indices { | ||
| let mut values_hashes = vec![0u64; array.len()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you create a new vec for each iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create each values_hashes for each column. And combine those hashed element in column to one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values_hashes can be created upfront, outside of the loop.
| let column_values = values[i].clone(); | ||
| // create hashes for each column | ||
| create_hashes(&[column_values], random_state, &mut values_hashes)?; | ||
| let hash = &mut hashes_buffer[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems not correct, as i here to a column in the struct and hashes_buffer to a hash of one value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dandandan. Not fully got your point. I do consider each column as one value (iterate each row in one column with get one final hash hashes_buffer[i]).
i is the column index in the struct array and is also the hashes_buffer index.
For example, column1, null, column3, column4. We iterate 0, 2, and 3. And hash each of them to hashes_buffer[0], hashes_buffer[2], and hashes_buffer[3] respectively. A column-wise hashes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I mistake how the null buffer works in StructArray. It seems to be row-wise, but I consider it as column-wise...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks pretty much better now :)
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
3d7b5d3 to
fc36232
Compare
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
| let valid_indices: Vec<usize> = if let Some(nulls) = nulls { | ||
| nulls.valid_indices().collect() | ||
| } else { | ||
| (0..num_columns).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't yet seem correct, valid_indices contains the rows that are valid, not the columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hash each row now, so we iterate the valid row and hash each column at that row to one hash value.
| }; | ||
|
|
||
| let mut values_hashes = vec![0u64; array.len()]; | ||
| create_hashes(array.columns(), random_state, &mut values_hashes)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to use hashes_buffer here directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do that, we should be able to remove the code related to valid_indices, as this already takes care of nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think so.
values_hashes: [9258723240401068087, 9258723240401068087, 8502738074356456021, 8502738074356456021, 4222447303697976283, 9753707356376286577]
hashes_buffer: [9258723240401091360, 9258723240401091360, 0, 8502738074356479294, 0, 0]
We can see that create_hashes does not consider if the row is valid or not. Therefore, we need to iterate valid_indices once .
| } | ||
| DataType::Struct(_) => { | ||
| let array = as_struct_array(array)?; | ||
| hash_struct_array(array, random_state, hashes_buffer)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work?
| hash_struct_array(array, random_state, hashes_buffer)?; | |
| create_hashes(array.columns(), random_state, hashes_buffer)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we consider nulls, this doesn't work
Signed-off-by: jayzhan211 <[email protected]>
|
@Dandandan do you think this PR is ready to merge? |
|
@alamb @Dandandan Any update on this? |
|
Thank you @jayzhan211 @waynexia and @Dandandan |
|
I found the current implementation failed this Let me checkout the reason |
* hash struct Signed-off-by: jayzhan211 <[email protected]> * fmt Signed-off-by: jayzhan211 <[email protected]> * row-wise hash Signed-off-by: jayzhan211 <[email protected]> * fmt Signed-off-by: jayzhan211 <[email protected]> * create hashes once Signed-off-by: jayzhan211 <[email protected]> * cleanup Signed-off-by: jayzhan211 <[email protected]> * add comment Signed-off-by: jayzhan211 <[email protected]> --------- Signed-off-by: jayzhan211 <[email protected]>
Which issue does this PR close?
Closes #.
We will need this for #7893
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?