-
Notifications
You must be signed in to change notification settings - Fork 1k
[geo] add geospatial statistics and bbox types for parquet #8225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[geo] add geospatial statistics and bbox types for parquet #8225
Conversation
parquet/src/geospatial/statistics.rs
Outdated
| /// # Returns | ||
| /// | ||
| /// A new `BoundingBox` instance with the specified coordinates. | ||
| pub fn new(xmin: f64, ymin: f64, xmax: f64, ymax: f64, zmin: Option<f64>, zmax: Option<f64>, mmin: Option<f64>, mmax: Option<f64>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having this very verbose constructor with 8 arguments, I'd suggest having new() take only the required 4 values for the 2D box. Then have a with_zrange that takes in non-null zmin and zmax and have a with_mrange that takes in non-null mmin and mmax. This also ensures that it's impossible to define a null zmin with non-null zmax.
parquet/src/geospatial/statistics.rs
Outdated
| /// - 4: MultiPoint | ||
| /// - 5: MultiLineString | ||
| /// - 6: MultiPolygon | ||
| /// - 7: GeometryCollection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest also linking to https://github.com/apache/parquet-format/blob/ae39061f28d7c508a97af58a3c0a567352c8ea41/Geospatial.md#geospatial-types for the full allowed list of type ids.
parquet/src/geospatial/statistics.rs
Outdated
| // ---------------------------------------------------------------------- | ||
| // Bounding Box | ||
|
|
||
| /// Represents a 2D/3D bounding box with optional M-coordinate support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe best to just copy the full upstream docstring https://github.com/apache/parquet-format/blob/ae39061f28d7c508a97af58a3c0a567352c8ea41/Geospatial.md#bounding-box
parquet/src/geospatial/statistics.rs
Outdated
| /// # Returns | ||
| /// | ||
| /// A `GeospatialStatistics` instance with no bounding box or type information. | ||
| pub fn new_empty() -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe cleaner to remove this and just implement Default?
|
@alamb may have opinions on whether we should create a new feature flag for this? |
|
Thanks @kylebarron for the feedback - made some changes |
parquet/src/geospatial/statistics.rs
Outdated
| if bbox.zmin.is_some() && bbox.zmax.is_some() { | ||
| new_bbox = new_bbox.with_zrange(bbox.zmin.unwrap().into(), bbox.zmax.unwrap().into()); | ||
| } else if bbox.zmin.is_some() != bbox.zmax.is_some() { | ||
| return Err(ParquetError::General(format!("Z-coordinate values mismatch: {:?} and {:?}", bbox.zmin, bbox.zmax))); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you but maybe slightly easier to read if you match (bbox.zmin, bbox.zmax) instead of having these two if clauses.
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start!
I am still new to this reader, but I wonder if a reasonable endpoint for this PR would be a test that reads the example files in the parquet-testing repository and demonstrates that the Thrift statistics/logical type information can be accessed from the ParquetMetadata as exposed by the user-facing reader class.
I'm happy to iterate with you here on Parquet Spec/basic Rust issues (although Kyle and Andrew are the ones who actually know this particular reader!).
(Background: Kaushik and I met offline at Community over Code this year and he kindly reminded me about this PR!)
parquet/src/geospatial/statistics.rs
Outdated
| /// | ||
| /// Special cases: | ||
| /// - For X values only, xmin may exceed xmax. In this case, a point matches if x >= xmin OR x <= xmax | ||
| /// - This wraparound occurs when the bounding box crosses the antimeridian line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| /// - This wraparound occurs when the bounding box crosses the antimeridian line | |
| /// - This wraparound can occur when the bounding box crosses the antimeridian line |
parquet/src/geospatial/statistics.rs
Outdated
| mmax: Option<f64>, | ||
| } | ||
|
|
||
| impl BoundingBox { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this needs getters for the fields here (or they won't be accessible by code reading the metadata!)
parquet/src/geospatial/statistics.rs
Outdated
| /// * `bbox` - Optional bounding box defining the spatial extent | ||
| /// * `geospatial_types` - Optional list of geometry type identifiers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is important to explicitly specify what a bbox of None means (i.e., no information was provided), and what geospatial_types of None means (i.e., we don't have any information about the geometry types).
parquet/src/geospatial/statistics.rs
Outdated
| new_bbox = match (bbox.zmin, bbox.zmax) { | ||
| (Some(zmin), Some(zmax)) => new_bbox.with_zrange(zmin.into(), zmax.into()), | ||
| (None, None) => new_bbox, | ||
| _ => return Err(ParquetError::General("Z-coordinate values mismatch".to_string())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there's some previous precedent (I don't know this reader very well), rather than error here, I think keeping the bbox.z/m/min/max values set to None is a better choice (i.e., don't pass on bad statistics but don't abort the entire read because some writer wrote questionable values).
parquet/src/geospatial/statistics.rs
Outdated
| } else { | ||
| None | ||
| }; | ||
| let geospatial_types = geo_stats.geospatial_types; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you have to be careful...if I remember correctly, if geo_stats.geospatial_types is empty, you have to set GeospatialStatistics.geospatial_types to None (i.e., the file did not provide information about the statistics).
| #[test] | ||
| fn test_geo_statistics_from_thrift() { | ||
| let stats = GeospatialStatistics::new(Some(BoundingBox::new(0.0, 0.0, 100.0, 100.0)), Some(vec![1, 2, 3])); | ||
| let thrift_stats = to_thrift(Some(&stats)); | ||
| assert_eq!(from_thrift(thrift_stats).unwrap(), Some(stats)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need tests to cover the branches for mismatched z/m values and for cases where the z/m values are present
parquet/src/geospatial/statistics.rs
Outdated
| assert_eq!(thrift_bbox.xmin, 0.0f64); | ||
| assert_eq!(thrift_bbox.ymin, 0.0f64); | ||
| assert_eq!(thrift_bbox.xmax, 100.0f64); | ||
| assert_eq!(thrift_bbox.ymax, 100.0f64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| assert_eq!(thrift_bbox.xmin, 0.0f64); | |
| assert_eq!(thrift_bbox.ymin, 0.0f64); | |
| assert_eq!(thrift_bbox.xmax, 100.0f64); | |
| assert_eq!(thrift_bbox.ymax, 100.0f64); | |
| assert_eq!(thrift_bbox.xmin, 0.0); | |
| assert_eq!(thrift_bbox.ymin, 0.0); | |
| assert_eq!(thrift_bbox.xmax, 100.0); | |
| assert_eq!(thrift_bbox.ymax, 100.0); |
(I've never seen that needed in a test...was it required?)
parquet/src/geospatial/statistics.rs
Outdated
| /// Minimum X coordinate (longitude or easting) | ||
| xmin: f64, | ||
| /// Minimum Y coordinate (latitude or northing) | ||
| ymin: f64, | ||
| /// Maximum X coordinate (longitude or easting) | ||
| xmax: f64, | ||
| /// Maximum Y coordinate (latitude or northing) | ||
| ymax: f64, | ||
| /// Minimum Z coordinate (elevation/height), if present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that in Thrift these are all separate, although they may be more ergonomic to use as:
x_range: (f64, f64),
y_range: (f64, f64),
z_range: Option<(f64, f64)>,
m_range: Option<(f64, f64)>,...which would eliminate the need for a caller to do the same type of checking with respect to the case where z/mmin was specified but z/mmax was not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of using tuples here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with either here (they are not pub and the accessors all need to be there anyway). This really is how the object is defined for Thrift.
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for cleaning this up!
Would a reasonable next step be to add a test reading one of the example files in parquet-testing/data/geospatial? It looks like there is a function that can get you the path to that directory here:
arrow-rs/arrow/src/util/test_util.rs
Lines 85 to 100 in 7efb395
| /// Returns the parquest test data directory, which is by default | |
| /// stored in a git submodule rooted at | |
| /// `arrow/parquest-testing/data`. | |
| /// | |
| /// The default can be overridden by the optional environment variable | |
| /// `PARQUET_TEST_DATA` | |
| /// | |
| /// panics when the directory can not be found. | |
| /// | |
| /// Example: | |
| /// ``` | |
| /// let testdata = arrow::util::test_util::parquet_test_data(); | |
| /// let filename = format!("{}/binary.parquet", testdata); | |
| /// assert!(std::path::PathBuf::from(filename).exists()); | |
| /// ``` | |
| pub fn parquet_test_data() -> String { |
A good file to start with is probably https://github.com/apache/parquet-testing/blob/a3d96a65e11e2bbca7d22a894e8313ede90a33a3/data/geospatial/geospatial.parquet .
Here is an example of what we did for variant: https://github.com/apache/arrow-rs/blob/main/parquet/tests/variant_integration.rs |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @kaushiksrini and @paleolimbot -- this is looking quite close to me
I think it is important to minimize the size of the ParquetMetaData structure before merging this (I left a suggestion on how to do so)
As this would be a public API change, I also think it is important to add at least one "read existing file with statistics tests" as @paleolimbot suggested, to show that we have the round trip data structures working.
cc @etseidl as this adds the geospatial statistics / thrift support which is relevant to the thrift-remodel working that is ongoing
parquet/src/file/metadata/mod.rs
Outdated
|
|
||
| #[cfg(not(feature = "encryption"))] | ||
| let base_expected_size = 2312; | ||
| let base_expected_size = 2792; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a fairly non trivial increase in size (as the parquet metadata size is already a challenge for many users)
Can we please reduce this size by Boxing the Geo statistics? Something like
instead of
geo_statistics: Option<geo_statistics::GeospatialStatistics>,Make it like
geo_statistics: Option<Box<geo_statistics::GeospatialStatistics>>,(that would add just 8-16 bytes for another pointer rather than the size of the inlined structure)
| } | ||
|
|
||
| #[test] | ||
| fn test_bounding_box_to_thrift() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another good set of tests for thrift encoding/decoding would be "roundtrip" tests that create a thrift structure -> rust --> thrift and ensure all fields survive the round trip
|
Thank you for the feedback @alamb and @paleolimbot! I’ve made the suggested changes — please let me know if there’s anything else I should address. |
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
This seems reasonable from my end (but I'm new here and Andrew may have some comments!)
parquet/src/file/metadata/mod.rs
Outdated
| /// Returns geospatial statistics that are set for this column chunk, | ||
| /// or `None` if no geospatial statistics are available. | ||
| pub fn geo_statistics(&self) -> Option<&geo_statistics::GeospatialStatistics> { | ||
| self.geo_statistics.as_ref().map(|boxed| boxed.as_ref()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.geo_statistics.as_ref().map(|boxed| boxed.as_ref()) | |
| self.geo_statistics.as_deref() |
| size_statistics, | ||
| geospatial_statistics: None, | ||
| geospatial_statistics: geo_statistics::to_thrift( | ||
| self.geo_statistics.as_ref().map(|boxed| boxed.as_ref()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.geo_statistics.as_ref().map(|boxed| boxed.as_ref()), | |
| self.geo_statistics.as_deref(), |
parquet/src/file/metadata/mod.rs
Outdated
|
|
||
| /// Sets geospatial statistics for this column chunk. | ||
| pub fn set_geo_statistics(mut self, value: geo_statistics::GeospatialStatistics) -> Self { | ||
| self.0.geo_statistics = Some(Box::new(value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want the Box only to be an implementation detail? If the user already has a GeospatialStatistics, then this has to be copied in Box::new?
I'm just wondering whether there's any value in having the API take in a Box<GeospatialStatistics> instead of boxing it inside this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left it as is since I felt the user doesn't necessarily need to know the memory-level implementation details, and there's no performance difference (Box::new moves rather than copies). However, open to suggestions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point would be if a user already has a Box<GeospatialStatistics> they could pass it in which would be a pointer copy, rather than this which might have to move data from the stack to the heap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! I changed it to a take in a heap allocated object
| /// Converts our internal `BoundingBox` to the Thrift-generated format. | ||
| fn from(b: BoundingBox) -> parquet::BoundingBox { | ||
| parquet::BoundingBox { | ||
| xmin: b.get_xmin().into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit: since we have access to the private attributes here, it might be slightly more readable to use those, but up to you:
| xmin: b.get_xmin().into(), | |
| xmin: b.xmin.into(), |
|
|
||
| new_bbox = match (bbox.zmin, bbox.zmax) { | ||
| (Some(zmin), Some(zmax)) => new_bbox.with_zrange(zmin.into(), zmax.into()), | ||
| // If both None or mismatch, set it to None and don't error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and below, slight wording suggestion
| // If both None or mismatch, set it to None and don't error | |
| // If either None or mismatch, leave it as None and don't error |
parquet/src/geospatial/mod.rs
Outdated
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| //! This module provides functionality for working with geospatial data in Parquet file as defied in the [spec][parquet-geo-spec]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| //! This module provides functionality for working with geospatial data in Parquet file as defied in the [spec][parquet-geo-spec]. | |
| //! This module provides functionality for working with geospatial data in Parquet file as defined in the [spec][parquet-geo-spec]. |
| /// Optional bounding defining the spatial extent, where None represents a lack of information. | ||
| bbox: Option<BoundingBox>, | ||
| /// Optional list of geometry type identifiers, where None represents lack of information | ||
| geospatial_types: Option<Vec<i32>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the future we may want a more strongly-typed enum than a bare i32, but this is fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More strongly-typed enums are possibly the work of the geoarrow-rs and Sedonas of the world (where interested clients can convert the identifiers into nice structures if they need to). I think here we're just trying to provide the bare minimum for other libraries to be able to access the statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is not a public field, we can also potentially change it in the future too, though I do see it is part of the constructor
In my opinion it is fine to leave like this and we can adjust / update in a future release
parquet/src/geospatial/statistics.rs
Outdated
| /// Minimum X coordinate (longitude or easting) | ||
| xmin: f64, | ||
| /// Minimum Y coordinate (latitude or northing) | ||
| ymin: f64, | ||
| /// Maximum X coordinate (longitude or easting) | ||
| xmax: f64, | ||
| /// Maximum Y coordinate (latitude or northing) | ||
| ymax: f64, | ||
| /// Minimum Z coordinate (elevation/height), if present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of using tuples here
|
Thanks @kylebarron for the review! Implemented the suggestions |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @kaushiksrini and @paleolimbot and @kylebarron
There are a few more comments to resolve on this PR I think but nothing we couldn't do as a follow on PR
Please let me know when it is ready to merge and I'll do so!
parquet/src/file/metadata/mod.rs
Outdated
|
|
||
| /// Sets geospatial statistics for this column chunk. | ||
| pub fn set_geo_statistics(mut self, value: geo_statistics::GeospatialStatistics) -> Self { | ||
| self.0.geo_statistics = Some(Box::new(value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point would be if a user already has a Box<GeospatialStatistics> they could pass it in which would be a pointer copy, rather than this which might have to move data from the stack to the heap
| /// Optional bounding defining the spatial extent, where None represents a lack of information. | ||
| bbox: Option<BoundingBox>, | ||
| /// Optional list of geometry type identifiers, where None represents lack of information | ||
| geospatial_types: Option<Vec<i32>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is not a public field, we can also potentially change it in the future too, though I do see it is part of the constructor
In my opinion it is fine to leave like this and we can adjust / update in a future release
parquet/src/geospatial/statistics.rs
Outdated
| /// Converts a Thrift-generated geospatial statistics object to the internal representation. | ||
| pub fn from_thrift( | ||
| geo_statistics: Option<TGeospatialStatistics>, | ||
| ) -> Result<Option<GeospatialStatistics>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is infallable (always returns Ok) maybe we could make the from_thrift function just return Option
That would also let you simplify the body if you wanted
pub fn from_thrift(
geo_statistics: Option<TGeospatialStatistics>,
) -> Option<GeospatialStatistics> {
let geo_stats = geo_statistics?;
let bbox = geo_stats.bbox.map(|bbox| bbox.into());
// If vector is empty, then set it to None
let geospatial_types: Option<Vec<i32>> =
geo_stats.geospatial_types.filter(|v| !v.is_empty());
Some(GeospatialStatistics::new(bbox, geospatial_types))
}|
I think we have bikeshed this one enough -- than you @kaushiksrini and @kylebarron @BlakeOrth and @paleolimbot |
Which issue does this PR close?
GeospatialStatistics.Rationale for this change
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.