Skip to content

Conversation

@kaushiksrini
Copy link
Contributor

@kaushiksrini kaushiksrini commented Aug 26, 2025

Which issue does this PR close?

Rationale for this change

  • This is part of a draft to support geospatial types (geometry and geography) in Parquet. This has been

What changes are included in this PR?

  • Structs for supporting geospatial statistics information (bbox and geospatial types) derived from thrift classes.
  • Would appreciate feedback on structure and where certain parts should go.

Are these changes tested?

Yes

Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

@kaushiksrini kaushiksrini marked this pull request as draft August 26, 2025 01:58
@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 26, 2025
/// # Returns
///
/// A new `BoundingBox` instance with the specified coordinates.
pub fn new(xmin: f64, ymin: f64, xmax: f64, ymax: f64, zmin: Option<f64>, zmax: Option<f64>, mmin: Option<f64>, mmax: Option<f64>) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having this very verbose constructor with 8 arguments, I'd suggest having new() take only the required 4 values for the 2D box. Then have a with_zrange that takes in non-null zmin and zmax and have a with_mrange that takes in non-null mmin and mmax. This also ensures that it's impossible to define a null zmin with non-null zmax.

/// - 4: MultiPoint
/// - 5: MultiLineString
/// - 6: MultiPolygon
/// - 7: GeometryCollection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// ----------------------------------------------------------------------
// Bounding Box

/// Represents a 2D/3D bounding box with optional M-coordinate support.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// # Returns
///
/// A `GeospatialStatistics` instance with no bounding box or type information.
pub fn new_empty() -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe cleaner to remove this and just implement Default?

@kylebarron
Copy link
Member

@alamb may have opinions on whether we should create a new feature flag for this?

@kaushiksrini
Copy link
Contributor Author

Thanks @kylebarron for the feedback - made some changes

Comment on lines 207 to 211
if bbox.zmin.is_some() && bbox.zmax.is_some() {
new_bbox = new_bbox.with_zrange(bbox.zmin.unwrap().into(), bbox.zmax.unwrap().into());
} else if bbox.zmin.is_some() != bbox.zmax.is_some() {
return Err(ParquetError::General(format!("Z-coordinate values mismatch: {:?} and {:?}", bbox.zmin, bbox.zmax)));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you but maybe slightly easier to read if you match (bbox.zmin, bbox.zmax) instead of having these two if clauses.

@kaushiksrini kaushiksrini marked this pull request as ready for review August 27, 2025 02:34
@kaushiksrini kaushiksrini changed the title [draft] add geospatial statistics and bbox types for parquet [geo] add geospatial statistics and bbox types for parquet Aug 27, 2025
@kaushiksrini kaushiksrini marked this pull request as draft August 27, 2025 02:34
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start!

I am still new to this reader, but I wonder if a reasonable endpoint for this PR would be a test that reads the example files in the parquet-testing repository and demonstrates that the Thrift statistics/logical type information can be accessed from the ParquetMetadata as exposed by the user-facing reader class.

I'm happy to iterate with you here on Parquet Spec/basic Rust issues (although Kyle and Andrew are the ones who actually know this particular reader!).

(Background: Kaushik and I met offline at Community over Code this year and he kindly reminded me about this PR!)

///
/// Special cases:
/// - For X values only, xmin may exceed xmax. In this case, a point matches if x >= xmin OR x <= xmax
/// - This wraparound occurs when the bounding box crosses the antimeridian line
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// - This wraparound occurs when the bounding box crosses the antimeridian line
/// - This wraparound can occur when the bounding box crosses the antimeridian line

mmax: Option<f64>,
}

impl BoundingBox {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this needs getters for the fields here (or they won't be accessible by code reading the metadata!)

Comment on lines 161 to 162
/// * `bbox` - Optional bounding box defining the spatial extent
/// * `geospatial_types` - Optional list of geometry type identifiers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is important to explicitly specify what a bbox of None means (i.e., no information was provided), and what geospatial_types of None means (i.e., we don't have any information about the geometry types).

new_bbox = match (bbox.zmin, bbox.zmax) {
(Some(zmin), Some(zmax)) => new_bbox.with_zrange(zmin.into(), zmax.into()),
(None, None) => new_bbox,
_ => return Err(ParquetError::General("Z-coordinate values mismatch".to_string())),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there's some previous precedent (I don't know this reader very well), rather than error here, I think keeping the bbox.z/m/min/max values set to None is a better choice (i.e., don't pass on bad statistics but don't abort the entire read because some writer wrote questionable values).

} else {
None
};
let geospatial_types = geo_stats.geospatial_types;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you have to be careful...if I remember correctly, if geo_stats.geospatial_types is empty, you have to set GeospatialStatistics.geospatial_types to None (i.e., the file did not provide information about the statistics).

Comment on lines 287 to 292
#[test]
fn test_geo_statistics_from_thrift() {
let stats = GeospatialStatistics::new(Some(BoundingBox::new(0.0, 0.0, 100.0, 100.0)), Some(vec![1, 2, 3]));
let thrift_stats = to_thrift(Some(&stats));
assert_eq!(from_thrift(thrift_stats).unwrap(), Some(stats));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need tests to cover the branches for mismatched z/m values and for cases where the z/m values are present

Comment on lines 298 to 301
assert_eq!(thrift_bbox.xmin, 0.0f64);
assert_eq!(thrift_bbox.ymin, 0.0f64);
assert_eq!(thrift_bbox.xmax, 100.0f64);
assert_eq!(thrift_bbox.ymax, 100.0f64);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert_eq!(thrift_bbox.xmin, 0.0f64);
assert_eq!(thrift_bbox.ymin, 0.0f64);
assert_eq!(thrift_bbox.xmax, 100.0f64);
assert_eq!(thrift_bbox.ymax, 100.0f64);
assert_eq!(thrift_bbox.xmin, 0.0);
assert_eq!(thrift_bbox.ymin, 0.0);
assert_eq!(thrift_bbox.xmax, 100.0);
assert_eq!(thrift_bbox.ymax, 100.0);

(I've never seen that needed in a test...was it required?)

Comment on lines 74 to 82
/// Minimum X coordinate (longitude or easting)
xmin: f64,
/// Minimum Y coordinate (latitude or northing)
ymin: f64,
/// Maximum X coordinate (longitude or easting)
xmax: f64,
/// Maximum Y coordinate (latitude or northing)
ymax: f64,
/// Minimum Z coordinate (elevation/height), if present
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that in Thrift these are all separate, although they may be more ergonomic to use as:

    x_range: (f64, f64),
    y_range: (f64, f64),
    z_range: Option<(f64, f64)>,
    m_range: Option<(f64, f64)>,

...which would eliminate the need for a caller to do the same type of checking with respect to the case where z/mmin was specified but z/mmax was not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of using tuples here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with either here (they are not pub and the accessors all need to be there anyway). This really is how the object is defined for Thrift.

@kaushiksrini kaushiksrini marked this pull request as ready for review September 19, 2025 16:47
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for cleaning this up!

Would a reasonable next step be to add a test reading one of the example files in parquet-testing/data/geospatial? It looks like there is a function that can get you the path to that directory here:

/// Returns the parquest test data directory, which is by default
/// stored in a git submodule rooted at
/// `arrow/parquest-testing/data`.
///
/// The default can be overridden by the optional environment variable
/// `PARQUET_TEST_DATA`
///
/// panics when the directory can not be found.
///
/// Example:
/// ```
/// let testdata = arrow::util::test_util::parquet_test_data();
/// let filename = format!("{}/binary.parquet", testdata);
/// assert!(std::path::PathBuf::from(filename).exists());
/// ```
pub fn parquet_test_data() -> String {

A good file to start with is probably https://github.com/apache/parquet-testing/blob/a3d96a65e11e2bbca7d22a894e8313ede90a33a3/data/geospatial/geospatial.parquet .

@alamb
Copy link
Contributor

alamb commented Sep 21, 2025

Would a reasonable next step be to add a test reading one of the example files in parquet-testing/data/geospatial? It looks like there is a function that can get you the path to that directory here:

Here is an example of what we did for variant: https://github.com/apache/arrow-rs/blob/main/parquet/tests/variant_integration.rs

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kaushiksrini and @paleolimbot -- this is looking quite close to me

I think it is important to minimize the size of the ParquetMetaData structure before merging this (I left a suggestion on how to do so)

As this would be a public API change, I also think it is important to add at least one "read existing file with statistics tests" as @paleolimbot suggested, to show that we have the round trip data structures working.

cc @etseidl as this adds the geospatial statistics / thrift support which is relevant to the thrift-remodel working that is ongoing


#[cfg(not(feature = "encryption"))]
let base_expected_size = 2312;
let base_expected_size = 2792;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a fairly non trivial increase in size (as the parquet metadata size is already a challenge for many users)

Can we please reduce this size by Boxing the Geo statistics? Something like

instead of

    geo_statistics: Option<geo_statistics::GeospatialStatistics>,

Make it like

    geo_statistics: Option<Box<geo_statistics::GeospatialStatistics>>,

(that would add just 8-16 bytes for another pointer rather than the size of the inlined structure)

}

#[test]
fn test_bounding_box_to_thrift() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another good set of tests for thrift encoding/decoding would be "roundtrip" tests that create a thrift structure -> rust --> thrift and ensure all fields survive the round trip

@kaushiksrini
Copy link
Contributor Author

Thank you for the feedback @alamb and @paleolimbot! I’ve made the suggested changes — please let me know if there’s anything else I should address.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

This seems reasonable from my end (but I'm new here and Andrew may have some comments!)

/// Returns geospatial statistics that are set for this column chunk,
/// or `None` if no geospatial statistics are available.
pub fn geo_statistics(&self) -> Option<&geo_statistics::GeospatialStatistics> {
self.geo_statistics.as_ref().map(|boxed| boxed.as_ref())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.geo_statistics.as_ref().map(|boxed| boxed.as_ref())
self.geo_statistics.as_deref()

size_statistics,
geospatial_statistics: None,
geospatial_statistics: geo_statistics::to_thrift(
self.geo_statistics.as_ref().map(|boxed| boxed.as_ref()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.geo_statistics.as_ref().map(|boxed| boxed.as_ref()),
self.geo_statistics.as_deref(),


/// Sets geospatial statistics for this column chunk.
pub fn set_geo_statistics(mut self, value: geo_statistics::GeospatialStatistics) -> Self {
self.0.geo_statistics = Some(Box::new(value));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the Box only to be an implementation detail? If the user already has a GeospatialStatistics, then this has to be copied in Box::new?

I'm just wondering whether there's any value in having the API take in a Box<GeospatialStatistics> instead of boxing it inside this function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left it as is since I felt the user doesn't necessarily need to know the memory-level implementation details, and there's no performance difference (Box::new moves rather than copies). However, open to suggestions!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point would be if a user already has a Box<GeospatialStatistics> they could pass it in which would be a pointer copy, rather than this which might have to move data from the stack to the heap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I changed it to a take in a heap allocated object

/// Converts our internal `BoundingBox` to the Thrift-generated format.
fn from(b: BoundingBox) -> parquet::BoundingBox {
parquet::BoundingBox {
xmin: b.get_xmin().into(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit: since we have access to the private attributes here, it might be slightly more readable to use those, but up to you:

Suggested change
xmin: b.get_xmin().into(),
xmin: b.xmin.into(),


new_bbox = match (bbox.zmin, bbox.zmax) {
(Some(zmin), Some(zmax)) => new_bbox.with_zrange(zmin.into(), zmax.into()),
// If both None or mismatch, set it to None and don't error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below, slight wording suggestion

Suggested change
// If both None or mismatch, set it to None and don't error
// If either None or mismatch, leave it as None and don't error

// specific language governing permissions and limitations
// under the License.

//! This module provides functionality for working with geospatial data in Parquet file as defied in the [spec][parquet-geo-spec].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! This module provides functionality for working with geospatial data in Parquet file as defied in the [spec][parquet-geo-spec].
//! This module provides functionality for working with geospatial data in Parquet file as defined in the [spec][parquet-geo-spec].

/// Optional bounding defining the spatial extent, where None represents a lack of information.
bbox: Option<BoundingBox>,
/// Optional list of geometry type identifiers, where None represents lack of information
geospatial_types: Option<Vec<i32>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the future we may want a more strongly-typed enum than a bare i32, but this is fine for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More strongly-typed enums are possibly the work of the geoarrow-rs and Sedonas of the world (where interested clients can convert the identifiers into nice structures if they need to). I think here we're just trying to provide the bare minimum for other libraries to be able to access the statistics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is not a public field, we can also potentially change it in the future too, though I do see it is part of the constructor

In my opinion it is fine to leave like this and we can adjust / update in a future release

Comment on lines 74 to 82
/// Minimum X coordinate (longitude or easting)
xmin: f64,
/// Minimum Y coordinate (latitude or northing)
ymin: f64,
/// Maximum X coordinate (longitude or easting)
xmax: f64,
/// Maximum Y coordinate (latitude or northing)
ymax: f64,
/// Minimum Z coordinate (elevation/height), if present
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of using tuples here

@kaushiksrini
Copy link
Contributor Author

Thanks @kylebarron for the review! Implemented the suggestions

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kaushiksrini and @paleolimbot and @kylebarron

There are a few more comments to resolve on this PR I think but nothing we couldn't do as a follow on PR

Please let me know when it is ready to merge and I'll do so!


/// Sets geospatial statistics for this column chunk.
pub fn set_geo_statistics(mut self, value: geo_statistics::GeospatialStatistics) -> Self {
self.0.geo_statistics = Some(Box::new(value));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point would be if a user already has a Box<GeospatialStatistics> they could pass it in which would be a pointer copy, rather than this which might have to move data from the stack to the heap

/// Optional bounding defining the spatial extent, where None represents a lack of information.
bbox: Option<BoundingBox>,
/// Optional list of geometry type identifiers, where None represents lack of information
geospatial_types: Option<Vec<i32>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is not a public field, we can also potentially change it in the future too, though I do see it is part of the constructor

In my opinion it is fine to leave like this and we can adjust / update in a future release

/// Converts a Thrift-generated geospatial statistics object to the internal representation.
pub fn from_thrift(
geo_statistics: Option<TGeospatialStatistics>,
) -> Result<Option<GeospatialStatistics>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is infallable (always returns Ok) maybe we could make the from_thrift function just return Option

That would also let you simplify the body if you wanted

pub fn from_thrift(
    geo_statistics: Option<TGeospatialStatistics>,
) -> Option<GeospatialStatistics> {
    let geo_stats = geo_statistics?;
    let bbox = geo_stats.bbox.map(|bbox| bbox.into());
    // If vector is empty, then set it to None
    let geospatial_types: Option<Vec<i32>> =
        geo_stats.geospatial_types.filter(|v| !v.is_empty());
    Some(GeospatialStatistics::new(bbox, geospatial_types))
}

@alamb
Copy link
Contributor

alamb commented Sep 28, 2025

I think we have bikeshed this one enough -- than you @kaushiksrini and @kylebarron @BlakeOrth and @paleolimbot

@alamb alamb merged commit 4ba9a25 into apache:main Sep 28, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants