Skip to content

Way to share SchemaDescriptorPtr across ParquetMetadata objects #5999

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In low latency parquet based query applications, it is important to be able to cache / reuse the ParquetMetaData from parquet files (to supply via ArrowReaderBuilder::new_with_metadata instead of re-reading / parsing it from the parquet footer while reading the parquet data)

For many such systems (including InfluxDB 3.0) many of the files have the same schema so storing the same schema information for each parquet file is wasteful

Describe the solution you'd like
I would like a way to share SchemaDescriptorPtr -- e.g. the schema is already wrapped in an Arc so it is likely possibly to avoid storing the same schema over and over again

https://docs.rs/parquet/latest/src/parquet/file/metadata.rs.html#197 .

Describe alternatives you've considered

Perhaps we could add an API like with_schema to ParquetMetadata:

impl ParquetMetaData { 
... 
  /// Set the internal schema pointers
  fn with_schema(self, schema_descr: SchemaDescPtr) -> Self {
   ..
  }
...
}

It could be used like this:

let mut metadata: PaquetMetadata = ... // load metadata from a parquet file
// Check if we already have the same schema loaded
if let Some(existing_schema) = find_existing_schema(&catalog, &metadata) {
  // if so, use the existing schema 
  metadata = metadata.with_schema()
}

Additional context

This infrastructure is a natural follow on to #1729 to track the memory used

This API would likely be be tricky to implement given there are several references to the schema in ParquetMetadata child fields (e.g. https://docs.rs/parquet/latest/src/parquet/file/metadata.rs.html#299)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions