Skip to content

Stop encoding schema with each batch in shuffle writer #1186

@andygrove

Description

@andygrove

What is the problem the feature request solves?

We use Arrow IPC to write shuffle output. We create a new writer for each batch and this means that we seralize the schema for each batch.

let mut arrow_writer = StreamWriter::try_new(zstd::Encoder::new(output, 1)?, &batch.schema())?;
arrow_writer.write(batch)?;
arrow_writer.finish()?;

The schema is guaranteed to be the same for every batch because the input is a DataFusion ExecutionPlan so we should be able to use a single writer for all batches and avoid the cost of serializing the schema each time.

Based on one benchmarks in #1180 I am seeing a 4x speedup in encoding time by re-using the writer.

Describe the potential solution

No response

Additional context

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions