Arrow: Support Large Binary when using `to_arrow` #409

castedice · 2024-02-10T12:17:59Z

This PR is to address an issue that prevented the to_arrow method from handling binaries larger than 2GB when used as mentioned in #344.

With this change in place, all binary types must be defined via pa.large_binary when defining a pyarrow schema.

I considered leaving it as pa.binary and casting it in pyiceberg, but since defining it as pa.large_binary is essential for pyarrow to handle 2GB of data, I implemented it this way.

Fokko

@castedice Thanks for raising this. I think this is fine since Polars does it as well:

python3
Python 3.11.7 (main, Dec  4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import polars as pl
>>> df = pl.DataFrame(
...     {"foo": [1, 2, 3, 4, 5, 6], "bar": [b"a", b"b", b"c", b"d", b"e", b"f"]}
... )
>>> df.to_arrow()
pyarrow.Table
foo: int64
bar: large_binary
----
foo: [[1,2,3,4,5,6]]
bar: [[61,62,63,64,65,66]]

castedice · 2024-02-10T16:06:43Z

Thanks for review
This change will require a few changes to the documentation.
After checking the documentation, I'll create an additional PR.

castedice and others added 2 commits February 10, 2024 12:10

Arrow: Support Large Binary

08f7926

Merge with binary

72df7c0

Fokko approved these changes Feb 10, 2024

View reviewed changes

Fokko added this to the PyIceberg 0.6.0 release milestone Feb 10, 2024

Fokko merged commit a576fc9 into apache:main Feb 10, 2024

Fokko mentioned this pull request Feb 16, 2024

Cannot load a binary column of many rows via the to_arrow method. #344

Closed

sungwy mentioned this pull request Jun 3, 2024

Upcasting and Downcasting inconsistencies with PyArrow Schema #791

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Arrow: Support Large Binary when using `to_arrow` #409

Arrow: Support Large Binary when using `to_arrow` #409

Uh oh!

castedice commented Feb 10, 2024

Uh oh!

Fokko left a comment

Uh oh!

castedice commented Feb 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Arrow: Support Large Binary when using to_arrow #409

Arrow: Support Large Binary when using to_arrow #409

Uh oh!

Conversation

castedice commented Feb 10, 2024

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

castedice commented Feb 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Arrow: Support Large Binary when using `to_arrow` #409

Arrow: Support Large Binary when using `to_arrow` #409