Skip to content

Conversation

@castedice
Copy link
Contributor

This PR is to address an issue that prevented the to_arrow method from handling binaries larger than 2GB when used as mentioned in #344.

With this change in place, all binary types must be defined via pa.large_binary when defining a pyarrow schema.

I considered leaving it as pa.binary and casting it in pyiceberg, but since defining it as pa.large_binary is essential for pyarrow to handle 2GB of data, I implemented it this way.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@castedice Thanks for raising this. I think this is fine since Polars does it as well:

python3
Python 3.11.7 (main, Dec  4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import polars as pl
>>> df = pl.DataFrame(
...     {"foo": [1, 2, 3, 4, 5, 6], "bar": [b"a", b"b", b"c", b"d", b"e", b"f"]}
... )
>>> df.to_arrow()
pyarrow.Table
foo: int64
bar: large_binary
----
foo: [[1,2,3,4,5,6]]
bar: [[61,62,63,64,65,66]]

@Fokko Fokko added this to the PyIceberg 0.6.0 release milestone Feb 10, 2024
@Fokko Fokko merged commit a576fc9 into apache:main Feb 10, 2024
@castedice
Copy link
Contributor Author

Thanks for review
This change will require a few changes to the documentation.
After checking the documentation, I'll create an additional PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants