-
Notifications
You must be signed in to change notification settings - Fork 1k
Support writing GeospatialStatistics in Parquet writer #8524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @paleolimbot, looks pretty good on a first pass. I just want to make sure that the size statistics are written properly when geo stats are enabled.
parquet/src/column/writer/encoder.rs
Outdated
if let Some(var_bytes) = T::T::variable_length_bytes(slice) { | ||
*self.variable_length_bytes.get_or_insert(0) += var_bytes; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should execute regardless of whether geo stats are enabled. The variable_length_bytes
are ultimately written to the SizeStatistics
which are useful even without min/max statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
parquet/tests/geospatial.rs
Outdated
drop(file_writer); | ||
|
||
// Check that statistics exist in thrift output | ||
thrift_metadata.row_groups[0].columns[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heads up that when the thrift stuff merges this will no longer be a format::FileMetaData
but file::metadata::ParquetMetaData
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! I removed these assertions so that they won't break when the thrift stuff merges (although there will be a few logical type constructors that will need to be updated).
Thank you for the review! I will clean this up on Monday and add a few more tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! I think this is ready for review!
#[test] | ||
fn test_roundtrip_statistics_geospatial() { | ||
let path = format!( | ||
"{}/geospatial/geospatial.parquet", | ||
arrow::util::test_util::parquet_test_data(), | ||
); | ||
|
||
test_roundtrip_statistics(&path, 2); | ||
} | ||
|
||
#[test] | ||
fn test_roundtrip_geospatial_with_nan() { | ||
let path = format!( | ||
"{}/geospatial/geospatial-with-nan.parquet", | ||
arrow::util::test_util::parquet_test_data(), | ||
); | ||
|
||
test_roundtrip_statistics(&path, 0); | ||
} | ||
|
||
#[test] | ||
fn test_roundtrip_statistics_crs() { | ||
let path = format!( | ||
"{}/geospatial/crs-default.parquet", | ||
arrow::util::test_util::parquet_test_data(), | ||
); | ||
|
||
test_roundtrip_statistics(&path, 0); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the main tests that I was personally using to ensure that this implementation matched the one in Arrow C++...they rewrite the geometry columns for the test files and ensure the statistics are identical.
/// Create a new [GeoStatsAccumulator] instance | ||
pub fn new_geo_stats_accumulator(descr: &ColumnDescPtr) -> Box<dyn GeoStatsAccumulator> { | ||
ACCUMULATOR_FACTORY | ||
.get_or_init(|| Arc::new(DefaultGeoStatsAccumulatorFactory::default())) | ||
.new_accumulator(descr) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only part that is used outside this file...my attempt to consolidate any implementation detail we have here with respect to whether this crate was or wasn't built with geospatial support.
/// Initialize the global [GeoStatsAccumulatorFactory] | ||
/// | ||
/// This may only be done once before any calls to [new_geo_stats_accumulator]. | ||
/// Clients may use this to implement support for builds of the Parquet crate without | ||
/// geospatial support or to implement support for Geography bounding using external | ||
/// dependencies. | ||
pub fn init_geo_stats_accumulator_factory( | ||
factory: Arc<dyn GeoStatsAccumulatorFactory>, | ||
) -> Result<(), ParquetError> { | ||
if ACCUMULATOR_FACTORY.set(factory).is_err() { | ||
Err(ParquetError::General( | ||
"Global GeoStatsAccumulatorFactory already set".to_string(), | ||
)) | ||
} else { | ||
Ok(()) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take this out if it's too much...this is what I would need in SedonaDB to write files with geospatial stats for Geometry and Geography types (I am not sure if we can enable the geospatial feature on Parquet two levels of dependency deep). We also have the C++ dependencies there to write stats for Geography...while I'd love to rewrite that in Rust and put it in parquet-geospatial, I don't have time to do that today and the C++ dependency to do that (s2geometry) is kind of insane to build inside of a Rust crate.
#[cfg(feature = "geospatial")] | ||
#[test] | ||
fn test_geometry_accumulator() { | ||
use parquet_geospatial::testing::{wkb_point_xy, wkb_point_xyzm}; | ||
|
||
use crate::geospatial::bounding_box::BoundingBox; | ||
|
||
let mut accumulator = ParquetGeoStatsAccumulator::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These cases are also tested by way of ensuring our rewrite of the test files matches, but this test is more explicit and easier to debug.
|
||
/// Computes [GeospatialStatistics], if any, and resets internal state such that any internal | ||
/// accumulator is prepared to accumulate statistics for the next column chunk. | ||
fn flush_geospatial_statistics(&mut self) -> Option<Box<GeospatialStatistics>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the lowest impact place I could find to insert the GeospatialStatistics calculation. The ColumnMetrics are maybe a better fit but would require passing a reference to something through the various write methods.
/// Explicitly specify the Parquet schema to be used | ||
/// | ||
/// If omitted (the default), the [ArrowSchemaConverter] is used to compute the | ||
/// Parquet [SchemaDescriptor]. This may be used When the [SchemaDescriptor] is | ||
/// already known or must be calculated using custom logic. | ||
pub fn with_parquet_schema(self, schema_descr: SchemaDescriptor) -> Self { | ||
Self { | ||
schema_descr: Some(schema_descr), | ||
..self | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need this to test the Arrow ByteArrayEncoder, but it would also what somebody would need to write Geometry/Geography types generally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second pass. All style, no substance. I'll try to do a deeper dive tomorrow.
let geo_stats_accumulator = if matches!( | ||
descr.logical_type(), | ||
Some(LogicalType::Geometry) | Some(LogicalType::Geography) | ||
) { | ||
Some(new_geo_stats_accumulator(descr)) | ||
} else { | ||
None | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this pattern twice. I'm wondering if new_geo_stats_accumulator
could be try_new_geo_stats_accumulator
and perform the check on logical type internally, returning None
if not Geometry
or Geography
.
Which issue does this PR close?
Rationale for this change
One of the primary reasons the GeoParquet community was excited about first-class Parquet Geometry/Geography support was the built-in column chunk statistics (we had a workaround that involved adding a struct column, but it was difficult for non-spatial readers to use it and very difficult for non-spatial writers to write it). This PR ensures it is possible for arrow-rs to write files that include those statistics.
What changes are included in this PR?
This PR inserts the minimum required change to enable this support.
Are these changes tested?
Yes!
Are there any user-facing changes?
There are several new functions (which include documentation). Previously it was difficult or impossible to actually write Geometry or Geography logical types, and so it is unlikely any previous usage would be affected.