Faster ingestion from Parquet

### Question

I'm using the new pyiceberg write functionality. I wonder if there is any way to make it faster in my scenario:

I have around 1 TiB of Parquet files (zstd 3 compressed) that I want to ingest into Iceberg.

Table sizes are ~ power law distributed: The largest table is 25 % of total size, and there are ~ 100 tables.

Since Iceberg wants to repartition data I don't see a way to have it use my Parquet files without rewriting them.

Is it possible to use multiple cores for writing the Parquet files? I don't think that's something that PyArrow supports natively but it might be possible to run multiple PyArrow writers?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster ingestion from Parquet #346

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Faster ingestion from Parquet #346

Description

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions