-
Notifications
You must be signed in to change notification settings - Fork 381
Implement Sorted Writes #871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This PR solves for:
Decisions taken:
What is not in the scope of this PR?
|
|
@Fokko This PR is ready for review |
| ) | ||
|
|
||
|
|
||
| def get_sort_indices_arrow_table(arrow_table: pa.Table, sort_seq: List[Tuple[str, PyArrowSortOptions]]) -> List[int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to clarify on the separate implementation for sort_indices other than the one provided by pyarrow.
This is because pyarrow sort_indices or Sort Options only supports one order for null placement across keys.
More details here:
- https://arrow.apache.org/docs/python/generated/pyarrow.compute.sort_indices.html
- https://arrow.apache.org/docs/python/generated/pyarrow.compute.SortOptions.html
While, the iceberg spec doesn't discriminate of having different null ordering across keys: https://iceberg.apache.org/spec/#sort-orders
This function specifically helps to implement the above functionality by sorting across keys and utilizing the stable nature of the sort_indices algo from pyarrow.
We can raise another issue to improve the performance of this function.
In future, if pyarrow sort_indices does support different null ordering across, we can mark this function as obsolete and keep the implementation clean in the iceberg table append and overwrite methods.
Fixes: #271