-
Notifications
You must be signed in to change notification settings - Fork 1k
General virtual columns support + row numbers as a first use-case #8715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
vustef
wants to merge
28
commits into
apache:main
Choose a base branch
from
vustef:feature/parquet-virtual-row-numbers
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+750
−40
Draft
Changes from all commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
f93d36e
Add support for file row numbers in Parquet readers
jkylling e485c0b
Add Apache license header to row_number.rs
jkylling 2a62009
Run cargo format
jkylling fb5126f
Change with_row_number_column to take impl Into<String>
jkylling 5350728
Change Option<String> -> Option<&str> in build_array_reader
jkylling 188f350
Replace ParquetError::RowGroupMetaDataMissingRowNumber with General
jkylling 37a9d83
Split test_create_array_reader test into two
jkylling 41e38fe
first_row_number -> first_row_index
jkylling 1a1e6b6
Simplify RowNumberReader with iterators
jkylling bcad87f
Merge remote-tracking branch 'origin/main' into feature/parquet-reade…
vustef 89c1fd1
add parquet-testing change from the merge
vustef b0d53d0
Fix test_arrow_reader_all_columns
vustef 094ae81
Fix first_row_number
vustef a5858df
Rename to first_row_index consistently, remove Option.
vustef 5e7d9a1
revert parquet-testing update
vustef 54c22c6
Fix baselines in file::metadata::tests::test_memory_size
vustef f05d470
Fix encryption metadata and async tests. Those features and default f…
vustef 11e4f39
RowNumber extension type
vustef d02c977
using supplied_schema works
vustef 6fecc17
Don't modify parsing of parquet schema, virtual columns can only be a…
vustef 1414421
Reworked with_virtual_columns in options
vustef 07eb467
switch to ref to slice; cleanup with_row_number_columns; async tests …
vustef af0e0f9
Bring back optionality to first_row_index, for future consideration w…
vustef 8bccd22
Reexport
vustef 65679ba
reexport all within virtual_type
vustef 968d461
pub mod virtual_type skipping experimental schema
vustef 6144967
Switch back to `virtual_type::*` for now; fix warnings on cargo test
vustef 3af3ad7
Fix `projected_fields` assertion in async reader
vustef File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -561,6 +561,7 @@ mod tests { | |
| schema, | ||
| ProjectionMask::all(), | ||
| file_metadata.key_value_metadata(), | ||
| &[], | ||
| ) | ||
| .unwrap(); | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| // Licensed to the Apache Software Foundation (ASF) under one | ||
| // or more contributor license agreements. See the NOTICE file | ||
| // distributed with this work for additional information | ||
| // regarding copyright ownership. The ASF licenses this file | ||
| // to you under the Apache License, Version 2.0 (the | ||
| // "License"); you may not use this file except in compliance | ||
| // with the License. You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, | ||
| // software distributed under the License is distributed on an | ||
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| use crate::arrow::array_reader::ArrayReader; | ||
| use crate::errors::{ParquetError, Result}; | ||
| use crate::file::metadata::RowGroupMetaData; | ||
| use arrow_array::{ArrayRef, Int64Array}; | ||
| use arrow_schema::DataType; | ||
| use std::any::Any; | ||
| use std::sync::Arc; | ||
|
|
||
| pub(crate) struct RowNumberReader { | ||
| buffered_row_numbers: Vec<i64>, | ||
| remaining_row_numbers: std::iter::Flatten<std::vec::IntoIter<std::ops::Range<i64>>>, | ||
| } | ||
|
|
||
| impl RowNumberReader { | ||
| pub(crate) fn try_new<'a>( | ||
| row_groups: impl Iterator<Item = &'a RowGroupMetaData>, | ||
| ) -> Result<Self> { | ||
| let ranges = row_groups | ||
| .map(|rg| { | ||
| let first_row_index = rg.first_row_index().ok_or(ParquetError::General( | ||
| "Row group missing row number".to_string(), | ||
| ))?; | ||
| Ok(first_row_index..first_row_index + rg.num_rows()) | ||
| }) | ||
| .collect::<Result<Vec<_>>>()?; | ||
| Ok(Self { | ||
| buffered_row_numbers: Vec::new(), | ||
| remaining_row_numbers: ranges.into_iter().flatten(), | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| impl ArrayReader for RowNumberReader { | ||
| fn read_records(&mut self, batch_size: usize) -> Result<usize> { | ||
| let starting_len = self.buffered_row_numbers.len(); | ||
| self.buffered_row_numbers | ||
| .extend((&mut self.remaining_row_numbers).take(batch_size)); | ||
| Ok(self.buffered_row_numbers.len() - starting_len) | ||
| } | ||
|
|
||
| fn skip_records(&mut self, num_records: usize) -> Result<usize> { | ||
| // TODO: Use advance_by when it stabilizes to improve performance | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO from original PR |
||
| Ok((&mut self.remaining_row_numbers).take(num_records).count()) | ||
| } | ||
|
|
||
| fn as_any(&self) -> &dyn Any { | ||
| self | ||
| } | ||
|
|
||
| fn get_data_type(&self) -> &DataType { | ||
| &DataType::Int64 | ||
| } | ||
|
|
||
| fn consume_batch(&mut self) -> Result<ArrayRef> { | ||
| Ok(Arc::new(Int64Array::from_iter( | ||
| self.buffered_row_numbers.drain(..), | ||
| ))) | ||
| } | ||
|
|
||
| fn get_def_levels(&self) -> Option<&[i16]> { | ||
| None | ||
| } | ||
|
|
||
| fn get_rep_levels(&self) -> Option<&[i16]> { | ||
| None | ||
| } | ||
| } | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't remove this comment