-
Notifications
You must be signed in to change notification settings - Fork 379
Closed
Milestone
Description
Question
I'm using the new pyiceberg write functionality. I wonder if there is any way to make it faster in my scenario:
I have around 1 TiB of Parquet files (zstd 3 compressed) that I want to ingest into Iceberg.
Table sizes are ~ power law distributed: The largest table is 25 % of total size, and there are ~ 100 tables.
Since Iceberg wants to repartition data I don't see a way to have it use my Parquet files without rewriting them.
Is it possible to use multiple cores for writing the Parquet files? I don't think that's something that PyArrow supports natively but it might be possible to run multiple PyArrow writers?
Metadata
Metadata
Assignees
Labels
No labels