-
Notifications
You must be signed in to change notification settings - Fork 253
Closed
Description
What is the problem the feature request solves?
We use Arrow IPC to write shuffle output. We create a new writer for each batch and this means that we seralize the schema for each batch.
let mut arrow_writer = StreamWriter::try_new(zstd::Encoder::new(output, 1)?, &batch.schema())?;
arrow_writer.write(batch)?;
arrow_writer.finish()?;The schema is guaranteed to be the same for every batch because the input is a DataFusion ExecutionPlan so we should be able to use a single writer for all batches and avoid the cost of serializing the schema each time.
Based on one benchmarks in #1180 I am seeing a 4x speedup in encoding time by re-using the writer.
Describe the potential solution
No response
Additional context
No response