Skip to content

Faster ingestion from Parquet #346

@jonashaag

Description

@jonashaag

Question

I'm using the new pyiceberg write functionality. I wonder if there is any way to make it faster in my scenario:

I have around 1 TiB of Parquet files (zstd 3 compressed) that I want to ingest into Iceberg.

Table sizes are ~ power law distributed: The largest table is 25 % of total size, and there are ~ 100 tables.

Since Iceberg wants to repartition data I don't see a way to have it use my Parquet files without rewriting them.

Is it possible to use multiple cores for writing the Parquet files? I don't think that's something that PyArrow supports natively but it might be possible to run multiple PyArrow writers?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions