[DISCUSSION] Parquet Metadata Improvements

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
As we work on various features of Parquet metadata it is becoming clear that working with the current code organization is challenging.

I just wanted to write down some of my thoughts about how it all fits together

Here are some challenges:

1. The naming is challenging https://github.com/apache/arrow-rs/issues/6097
2. There is no way to easily write to bytes outside the context of a parquet file: https://github.com/apache/arrow-rs/pull/6000
3. It is complicated to understand how to read optional parts of the metadata that are not inlined (e.g. OffsetIndexes) - https://github.com/apache/arrow-rs/pull/5887
4. If we ever wanted to speed up (e.g. https://github.com/apache/arrow-rs/issues/5854) it would be hard with the current structure
5. There is not always a 1-1 correspondence between `file::metadata` and the thrift structures in `format::metadata`, 

**Describe the solution you'd like**
I would like to propose 
1. We continue to clarify the distinction between `file::metadata` and `format::metadata` 
2. Improve the API to translate back and forth between them and bytes and de-emphasize the conversion between thrift structures


Maybe this is clear to others but it is not to me

Here is how I see the structures involved:

```text
                                ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐               ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─    
                                  ┌──────────────┐                         ┌───────────────────────┐ │   
                                │ │ ColumnIndex  │        │               ││    ParquetMetaData    │     
                                  └──────────────┘                         └───────────────────────┘ │   
  ┌──────────────┐              │ ┌────────────────┐      │               │┌───────────────────────┐     
  │   ..0x24..   │  ◀────────▶    │  OffsetIndex   │          ◀────────▶   │    ParquetMetaData    │ │   
  └──────────────┘              │ └────────────────┘      │               │└───────────────────────┘     
                                           ...                                       ...             │   
                                │ ┌──────────────────┐    │               │ ┌──────────────────┐         
bytes                             │  FileMetaData*   │                      │  FileMetaData*   │     │   
(thrift encoded)                │ └──────────────────┘    │               │ └──────────────────┘         
                                 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─                 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘   
                                                                                                         
                                     format::meta structures               file::metadata structures         
                                                                                                         
                                                                                                         
                                                     * Same name, different struct                       
                                                                                                         
```

I would like to focus on improving the API for going back/forth between bytes and the `file::metadata` structures



```
                                                  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─    
                                                   ┌───────────────────────┐ │   
┌──────────────┐                                  ││    ParquetMetaData    │     
│   ..0x24..   │           ◀────────▶              └───────────────────────┘ │   
└──────────────┘                                  │┌───────────────────────┐     
                                                   │    ParquetMetaData    │ │   
                        Would like to focus       │└───────────────────────┘     
 bytes                  on this API to/from                                  │   
 (thrift encoded)       bytes and the             │ ┌──────────────────┐         
                        file::metadata              │  FileMetaData*   │     │   
                                                  │ └──────────────────┘         
                                                   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘   
                                                                                 
                                                   file::metadata structures     
                                                                                 
```

**Describe alternatives you've considered**
I think we probably need at least two different APIs:

# Reading
1. One that writes to `[u8]` buffered in memory ( [decode_footer](https://docs.rs/parquet/latest/parquet/file/footer/fn.decode_footer.html) and [decode_metadata](https://docs.rs/parquet/latest/parquet/file/footer/fn.decode_metadata.html))
2. One that reads from an `AsyncReader` or something equivalent ([`MetadataLoader`](https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.MetadataLoader.html) is enough / needs some more information)

# Writing 
1. Writes to `[u8]`  https://github.com/apache/arrow-rs/issues/6002)
2. Writes to an `AsyncWriter` perhaps


**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DISCUSSION] Parquet Metadata Improvements #6129

Reading

Writing

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[DISCUSSION] Parquet Metadata Improvements #6129

Description

Reading

Writing

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions