Skip to content

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Oct 1, 2025

Which issue does this PR close?

Rationale for this change

One of the primary reasons the GeoParquet community was excited about first-class Parquet Geometry/Geography support was the built-in column chunk statistics (we had a workaround that involved adding a struct column, but it was difficult for non-spatial readers to use it and very difficult for non-spatial writers to write it). This PR ensures it is possible for arrow-rs to write files that include those statistics.

What changes are included in this PR?

This PR inserts the minimum required change to enable this support.

Are these changes tested?

Yes!

Are there any user-facing changes?

There are several new functions (which include documentation). Previously it was difficult or impossible to actually write Geometry or Geography logical types, and so it is unlikely any previous usage would be affected.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 1, 2025
Copy link
Member Author

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works!

@alamb @etseidl I'm aware this would need some tests/improved documentation at a lower level; however, I'd love some feedback on the approach before I go through and clean this up more thoroughly (whenever time allows!)

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @paleolimbot, looks pretty good on a first pass. I just want to make sure that the size statistics are written properly when geo stats are enabled.

Comment on lines 166 to 168
if let Some(var_bytes) = T::T::variable_length_bytes(slice) {
*self.variable_length_bytes.get_or_insert(0) += var_bytes;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should execute regardless of whether geo stats are enabled. The variable_length_bytes are ultimately written to the SizeStatistics which are useful even without min/max statistics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

drop(file_writer);

// Check that statistics exist in thrift output
thrift_metadata.row_groups[0].columns[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up that when the thrift stuff merges this will no longer be a format::FileMetaData but file::metadata::ParquetMetaData.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I removed these assertions so that they won't break when the thrift stuff merges (although there will be a few logical type constructors that will need to be updated).

@paleolimbot
Copy link
Member Author

Thank you for the review! I will clean this up on Monday and add a few more tests.

Copy link
Member Author

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! I think this is ready for review!

Comment on lines +186 to +214
#[test]
fn test_roundtrip_statistics_geospatial() {
let path = format!(
"{}/geospatial/geospatial.parquet",
arrow::util::test_util::parquet_test_data(),
);

test_roundtrip_statistics(&path, 2);
}

#[test]
fn test_roundtrip_geospatial_with_nan() {
let path = format!(
"{}/geospatial/geospatial-with-nan.parquet",
arrow::util::test_util::parquet_test_data(),
);

test_roundtrip_statistics(&path, 0);
}

#[test]
fn test_roundtrip_statistics_crs() {
let path = format!(
"{}/geospatial/crs-default.parquet",
arrow::util::test_util::parquet_test_data(),
);

test_roundtrip_statistics(&path, 0);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the main tests that I was personally using to ensure that this implementation matched the one in Arrow C++...they rewrite the geometry columns for the test files and ensure the statistics are identical.

Comment on lines 27 to 32
/// Create a new [GeoStatsAccumulator] instance
pub fn new_geo_stats_accumulator(descr: &ColumnDescPtr) -> Box<dyn GeoStatsAccumulator> {
ACCUMULATOR_FACTORY
.get_or_init(|| Arc::new(DefaultGeoStatsAccumulatorFactory::default()))
.new_accumulator(descr)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only part that is used outside this file...my attempt to consolidate any implementation detail we have here with respect to whether this crate was or wasn't built with geospatial support.

Comment on lines 34 to 50
/// Initialize the global [GeoStatsAccumulatorFactory]
///
/// This may only be done once before any calls to [new_geo_stats_accumulator].
/// Clients may use this to implement support for builds of the Parquet crate without
/// geospatial support or to implement support for Geography bounding using external
/// dependencies.
pub fn init_geo_stats_accumulator_factory(
factory: Arc<dyn GeoStatsAccumulatorFactory>,
) -> Result<(), ParquetError> {
if ACCUMULATOR_FACTORY.set(factory).is_err() {
Err(ParquetError::General(
"Global GeoStatsAccumulatorFactory already set".to_string(),
))
} else {
Ok(())
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take this out if it's too much...this is what I would need in SedonaDB to write files with geospatial stats for Geometry and Geography types (I am not sure if we can enable the geospatial feature on Parquet two levels of dependency deep). We also have the C++ dependencies there to write stats for Geography...while I'd love to rewrite that in Rust and put it in parquet-geospatial, I don't have time to do that today and the C++ dependency to do that (s2geometry) is kind of insane to build inside of a Rust crate.

Comment on lines +277 to +284
#[cfg(feature = "geospatial")]
#[test]
fn test_geometry_accumulator() {
use parquet_geospatial::testing::{wkb_point_xy, wkb_point_xyzm};

use crate::geospatial::bounding_box::BoundingBox;

let mut accumulator = ParquetGeoStatsAccumulator::default();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These cases are also tested by way of ensuring our rewrite of the test files matches, but this test is more explicit and easier to debug.

Comment on lines 126 to 129

/// Computes [GeospatialStatistics], if any, and resets internal state such that any internal
/// accumulator is prepared to accumulate statistics for the next column chunk.
fn flush_geospatial_statistics(&mut self) -> Option<Box<GeospatialStatistics>>;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the lowest impact place I could find to insert the GeospatialStatistics calculation. The ColumnMetrics are maybe a better fit but would require passing a reference to something through the various write methods.

Comment on lines 503 to 513
/// Explicitly specify the Parquet schema to be used
///
/// If omitted (the default), the [ArrowSchemaConverter] is used to compute the
/// Parquet [SchemaDescriptor]. This may be used When the [SchemaDescriptor] is
/// already known or must be calculated using custom logic.
pub fn with_parquet_schema(self, schema_descr: SchemaDescriptor) -> Self {
Self {
schema_descr: Some(schema_descr),
..self
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need this to test the Arrow ByteArrayEncoder, but it would also what somebody would need to write Geometry/Geography types generally.

@paleolimbot paleolimbot marked this pull request as ready for review October 7, 2025 21:30
@paleolimbot paleolimbot requested a review from kylebarron October 7, 2025 21:49
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second pass. All style, no substance. I'll try to do a deeper dive tomorrow.

Comment on lines 453 to 460
let geo_stats_accumulator = if matches!(
descr.logical_type(),
Some(LogicalType::Geometry) | Some(LogicalType::Geography)
) {
Some(new_geo_stats_accumulator(descr))
} else {
None
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this pattern twice. I'm wondering if new_geo_stats_accumulator could be try_new_geo_stats_accumulator and perform the check on logical type internally, returning None if not Geometry or Geography.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support writing GeospatialStatistics in Parquet writer
2 participants