Skip to content

Conversation

@vinjai
Copy link
Contributor

@vinjai vinjai commented Jun 29, 2024

Fixes: #271

@vinjai vinjai marked this pull request as ready for review July 1, 2024 03:11
@vinjai vinjai marked this pull request as draft July 1, 2024 03:12
@vinjai vinjai marked this pull request as ready for review July 5, 2024 00:07
@vinjai
Copy link
Contributor Author

vinjai commented Jul 5, 2024

This PR solves for:

  1. Writing sorted datasets to a partitioned or non-partitioned iceberg table.
  2. Generating manifests with correct sort-order-id.
  3. Integration tests to make sure sorted datasets are generated similar to spark sorting.

Decisions taken:

  • If a sort transformation is not supported in PyIceberg, we will raise a warning related to the same and move ahead by writing the unsorted data with unsorted sort-order-id.

What is not in the scope of this PR?

  • Performance improvement of the new sort function. (We will raise a separate issue for the same.)

@vinjai
Copy link
Contributor Author

vinjai commented Jul 5, 2024

@Fokko This PR is ready for review

)


def get_sort_indices_arrow_table(arrow_table: pa.Table, sort_seq: List[Tuple[str, PyArrowSortOptions]]) -> List[int]:
Copy link
Contributor Author

@vinjai vinjai Jul 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to clarify on the separate implementation for sort_indices other than the one provided by pyarrow.
This is because pyarrow sort_indices or Sort Options only supports one order for null placement across keys.
More details here:

While, the iceberg spec doesn't discriminate of having different null ordering across keys: https://iceberg.apache.org/spec/#sort-orders

This function specifically helps to implement the above functionality by sorting across keys and utilizing the stable nature of the sort_indices algo from pyarrow.


We can raise another issue to improve the performance of this function.


In future, if pyarrow sort_indices does support different null ordering across, we can mark this function as obsolete and keep the implementation clean in the iceberg table append and overwrite methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support writing to a table with sort-order

2 participants