Skip to content

Conversation

trueleo
Copy link
Contributor

@trueleo trueleo commented Nov 28, 2023

Fixes #XXXX.

Description

This PR introduces a new table format that keeps track of all the data files in the data storage. The format is inspired by Apache Iceberg so has similar naming scheme for things.

  • Snapshot which is stored in the stream metadata file is the main entry point to a table.
  • A snapshot essentially is a list of url to manifest file and primary time statistics for pruning said manifest during query.
  • A manifest file contains list of all the actual files along with their file level statistics.

Currently a manifest file is generated per top level partition ( i.e date ).


This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

@nitisht nitisht marked this pull request as ready for review December 1, 2023 17:45
Copy link
Member

@nitisht nitisht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not able to query any data that was ingested on older versions.

Copy link
Contributor

@theteachr theteachr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use documentation comments.

trueleo and others added 5 commits December 6, 2023 15:51
This PR introduces a new table format that keeps track of data files in the data storage.
The format is inspired by Apache Iceberg so has similar naming scheme for things.
A snapshot is the main entry point to a table.
A snapshot consists of list of list of url to manifest file and primary time stats for pruning.
A manifest file contains list of all the actual files present along with the file level statistics.
Currently a manifest file is generated per top level partition ( i.e date ).
Update server/src/catalog.rs

Co-authored-by: Nick <[email protected]>
Signed-off-by: Satyam Singh <[email protected]>

Update server/src/catalog/manifest.rs

Co-authored-by: Nick <[email protected]>
Signed-off-by: Satyam Singh <[email protected]>

Update server/src/catalog/column.rs

Co-authored-by: Nick <[email protected]>
Signed-off-by: Satyam Singh <[email protected]>

Update server/src/handlers/http/query.rs

Co-authored-by: Nick <[email protected]>
Signed-off-by: Satyam Singh <[email protected]>

Update server/src/catalog/manifest.rs

Co-authored-by: Nick <[email protected]>
Signed-off-by: Satyam Singh <[email protected]>
@trueleo trueleo changed the title Catalog Introduce custom table catalog format Dec 6, 2023
@nitisht nitisht requested review from nitisht and theteachr December 6, 2023 10:47
Copy link
Member

@nitisht nitisht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM & Tested

@nitisht nitisht merged commit 3b98dd8 into parseablehq:main Dec 7, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Dec 7, 2023
@trueleo trueleo deleted the catalog branch December 14, 2023 09:03
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants