Skip to content

Conversation

trueleo
Copy link
Contributor

@trueleo trueleo commented Sep 8, 2022

Description

Instead of pre listing all the valid prefixes before query execution and letting datafusion infer the schema, it is better to pass the schema we already have for a given stream and let datafusion do all the heavy lifting. In case stream info does not have a schema for a given stream then we return early with Error suggesting they need to post events to this logstream first.

Goal

  • Avoid unnecessary network calls as Datafusion underneath does that and ignores if file is not present.

Solution

Check for schema in metadata and add it to query itself which then can be used for query


This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Instead of pre listing all the valid prefixes before query execution and letting datafusion infer the schema, it is better to pass the schema we already have for a given stream and let datafusion do all the heavy lifting. In case stream info does not have a schema for a given stream then we return early with Error suggesting they need to post events to this logstream first.
@trueleo trueleo requested a review from nitisht September 8, 2022 14:31
@nitisht nitisht merged commit a7abb5d into parseablehq:main Sep 8, 2022
@trueleo trueleo deleted the s3_query_listing branch September 9, 2022 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants