The h3-indexer is an open source package for indexing geospatial data using PySpark, Apache Sedona and Uber's open source H3 hierarchical spatial indexing system. The h3-indexer maps any number of vector-type geospatial data sets to H3 grids for efficient spatial analysis and querying.
The h3-indexer contains 3 stages, and users can provide command line arguments to run the stages one at a time, or altogether.
- Supports vector point, line & polygon data types
- Supports the following inputs:
- Parquet files & shapefiles in AWS S3
- AWS Glue catalog tables
- Outputs are written to AWS S3 in parquet format
- Configurable H3 resolution (3-10)
- PySpark for H3 indexing operations using Apache Sedona
- YAML & JSON-based configuration supported
The developer setup is currently compatible with ARM64 only (not compatible with x86_64).
This tool requires the following versions:
- Python: 3.10
- AWS Glue: 4.0
- Spark: 3.3.0
- Apache Sedona: 1.7.1
Run the following commands from inside the h3-indexer root directory to set up your run environment:
chmod +x scripts/env_setup.sh
source ./scripts/env_setup.sh
When this executable finishes running, it will print out the environment variable paths that you should set in your .env file - for example:
Your SPARK_HOME path is <path>
Your GLUE_JARS path is <path>
Your JAVA_HOME path is <path>
You need to include the following environment variables in a .env file. You can convert the .sample.env file to a .env file and update it accordingly. You can get the paths for SPARK_HOME
, GLUE_JARS
and JAVA_HOME
from the env_setup.sh
file. You will need to add the Python interpreter path to the 3 Python environment variables.
- PYTHONPATH
- PYSPARK_PYTHON
- PYSPARK_DRIVER_PYTHON
- SPARK_HOME
- GLUE_JARS
- JAVA_HOME
Required version is Python 3.10. You will need to install the Python libraries in the requirements.txt file and update the Python environment variables to the path of your interpreter.
Here are example commands for Python environment setup using conda:
conda create -n h3 python=3.10
conda activate h3
First, with the h3
python environment activated, install aws-glue-libs:
cd ~/h3-indexer-env/aws-glue-libs
python -m pip install -e .
Then, cd back into the h3-indexer repository and run:
conda install -c conda-forge --file requirements.txt
Ensure that all requirements installed successfully by running (with the h3
python environment activated):
python
>> import awsglue
>> import pyspark
You'll need to have proper credentials and permissions set up to access the files in the S3 bucket or the tables in the AWS Glue Catalog that are included in your config.
There is a sample_job.yaml in the configs directory. Here's a sample launch.json file to use for debugging with a test config:
{
"name": "main",
"type": "debugpy",
"request": "launch",
"program": "path-to/h3-indexer/src/h3-indexer/src/main.py",
"console": "integratedTerminal",
"args": ["--yaml-path", "path-to/h3-indexer/src/h3-indexer/configs/sample_job.yaml"],
"envFile": "path-to/h3-indexer/src/h3-indexer/.env"
}
Run the H3 Indexer with a YAML configuration file:
python main.py --yaml-path configs/sample_job.yaml --run-all
The following args are available:
Inputs:
--yaml-path
: File path to the yaml.
--json-input
: JSON input.
Run Type:
--validate-only
: Only validate the input config. (Validate)
--index-only
: Only validate the input config and run the indexer. (Validate -> Index)
--run-all
: Optional flag, this is the defauly behavior. Run all. (Validate -> Index -> Resolve)
The YAML configuration file defines the job parameters including input data sources, H3 resolution, and output location.
name
: Project name (string)version
: Version number in semantic format #.#.# (string)h3_resolution
: H3 resolution level, supported range 3-10 (integer)output_s3_path
: AWS S3 path for output storage (string)inputs
: Dictionary of input data sources. Each input in theinputs
dictionary can either be a Vector or Raster input type. See below for more details on required parameters for each input type.
For vector inputs (type: "vector"):
type
: Must be "vector"glue_catalog_database_name
: AWS Glue Catalog database name - must be provided withglue_catalog_table_name
.glue_catalog_table_name
: AWS Glue Catalog table name - must be provided withglue_catalog_database_name
.where_clause
: SQL where clause to filter the AWS Glue catalog table. Only applicable with AWS Glue catalog source.s3_path
: Path to input data in AWS S3unique_id
: Column name containing unique identifiergeometry_type
: Type of geometry ("POINT", "LINE", or "POLYGON")geometry_column_name
: Name of the geometry column.method
: Processing method ("PCT_LENGTH" for lines, "PCT_AREA" for polygons)input_columns
: List of columns to include in output
When providing an AWS Glue Catalog table, both glue_catalog_database_name
and glue_catalog_table_name
must be provided. These variables are mutually exclusive to s3_path
. So when providing an AWS S3 path, you cannot also include AWS Glue Catalog parameters. And when providing AWS Glue Catalog parameters, you cannot also include an AWS S3 path.
For POINT inputs only, the geometry_column_name
parameter can be replaced with these 2 parameters:
lat_column_name
: Name of the latitude column.lon_column_name
: Name of the longitude column.
Raster data types are not currently enabled in the service. Raster pixels can be converted to points/centroids and then provided as an input data set to the indexer.
The Validator will fail if the following requirements are not met by the inputs:
- The
unique_id
column must, in fact, be unique or the validator will fail. The H3 indexer doesn't yet support an input that contains more than 1 column as the primary key. - The geometry column (
geometry_column_name
) must be of the following data types: WKB, WKT, Geometry, GeoJSON. - If your input is a POINT geometry type, and you elected to use the
lat_column_name
+lon_column_name
parameters in place of thegeometry_column_name
parameter, those columns must be Float or Double type. - The columns in
input_columns
must be numeric (Integer, Float, Double, etc). The H3 Indexer does not yet support categorical or label-based (String) column data types. - Invalid or NULL geometries are dropped from the dataset.
As yaml:
name: "project_name"
version: "1.0.0"
h3_resolution: 6
output_s3_path: "s3://my-bucket/output/"
inputs:
mypoints:
type: "vector"
s3_path: "s3://my-bucket/my_points.parquet"
unique_id: "unique_id"
geometry_type: "POINT"
lat_column_name: "latitude"
lon_column_name: "longitude"
method: "WITHIN"
input_columns:
- "population"
mylines:
type: "vector"
s3_path: "s3://my-bucket/my_line_data.parquet"
unique_id: "unique_id"
geometry_type: "LINE"
geometry_column_name: "geometry"
method: "PCT_LENGTH"
input_columns:
- "emissions"
mypolygons:
type: "vector"
s3_path: "s3://my-bucket/my_polygon_data.parquet"
unique_id: "unique_id"
geometry_type: "POLYGON"
geometry_column_name: "geom_wkt"
method: "PCT_AREA"
input_columns:
- "population"
As JSON:
{
"name": "project_name",
"version": "1.0.0",
"h3_resolution": 6,
"output_s3_path": "s3://my-bucket/output/",
"inputs": {
"mypoints": {
"type": "vector",
"s3_path": "s3://my-bucket/my_points.parquet",
"unique_id": "unique_id",
"geometry_type": "POINT",
"lat_column_name": "latitude",
"lon_column_name": "longitude",
"method": "WITHIN",
"input_columns": [
"population"
]
},
"mylines": {
"type": "vector",
"s3_path": "s3://my-bucket/my_line_data.parquet",
"unique_id": "unique_id",
"geometry_type": "LINE",
"geometry_column_name": "geometry",
"method": "PCT_LENGTH",
"input_columns": [
"emissions"
]
},
"mypolygons": {
"type": "vector",
"s3_path": "s3://my-bucket/my_polygon_data.parquet",
"unique_id": "unique_id",
"geometry_type": "POLYGON",
"geometry_column_name": "geom_wkt",
"method": "PCT_AREA",
"input_columns": [
"population"
]
}
}
}
The following methods are enabled for point inputs:
WITHIN
: The attribution of the point is allocated to the H3 grid that the point lies within.
The indexer will produce an output with the following columns for point inputs:
h3_index
: The H3 index.h3_resolution
: The H3 resolution of the index, as provided in the job config.h3_r3_parent
: The Resolution=3 parent index of the H3 grid. This column is used for partitioning.h3_area_km2
: The area of the H3 hexagon in km2.<unique_id>
: The unique ID of the input, based on the unique_id for the input provided in the job config.ratio
: The percentage of the point associated with the H3 grid. For methodWITHIN
, this will always be 1.00.total_count
: The total count of the point (will always be 1).
The primary key of the output is the combination of h3_index
and <unique_id>
.
For example:
h3_index | h3_resolution | h3_r3_parent | h3_area_km2 | pixel_id | ratio | total_count |
---|---|---|---|---|---|---|
840e4d3ffffffff | 4 | 830e4dfffffffff | 2004.4344472440796 | 6834 | 1.0 | 1 |
840e4d7ffffffff | 4 | 830e4dfffffffff | 2011.5201608518523 | 13638 | 1.0 | 1 |
This example shows 2 H3 hexagons (2 rows) at resolution 4 that each intersected with 1 unique point (pixel_id). The area of both hexagons is ~2005km2. 100% of each point is associated with the hexagon that they are located within. The total count of each point is 1.
The following methods are enabled for line inputs:
PCT_LENGTH
: The attribution of the line is allocated to the H3 grids based on the percentage of the total length of the line that intersects with the H3 grids. For example, if 50% of the line falls within an H3 grid, then 50% of the line’s attributes get mapped to the H3 grid.
The indexer will produce an output with the following columns for line inputs:
h3_index
: The H3 index.h3_resolution
: The H3 resolution of the index, as provided in the job config.h3_r3_parent
: The Resolution=3 parent index of the H3 grid. This column is used for partitioning.h3_area_km2
: The area of the H3 hexagon in km2.<unique_id>
: The unique ID of the input, based on the unique_id for the input provided in the job config.ratio
: The percent length of the line associated with the H3 grid. This calculation is based on the method for the input provided in the job config.total_length_km
: The total length in km of the line.
The primary key of the output is the combination of h3_index
and <unique_id>
.
For example:
h3_index | h3_resolution | h3_r3_parent | h3_area_km2 | route_id | ratio | total_length_km |
---|---|---|---|---|---|---|
86446cae7ffffff | 6 | 86446cfffffffff | 40.55609958082783 | route123 | 0.7927260401223691 | 10.74319262992267 |
86446ca57ffffff | 6 | 86446cfffffffff | 40.58272492454886 | route123 | 0.2072739598776309 | 10.74319262992267 |
This example table depicts 2 H3 hexagons (2 rows) at resolution 6 that intersected with a single line (route123). The area of both hexagons is ~40km2. 79.3% of the line intersected with the first hexagon and 20.7% of the line intersected with the second hexagon. The total length of line route123 is 10.7 km.
The following methods are enabled for polygon inputs:
PCT_AREA
: The attribution of the polygon is allocated to the H3 grids based on the percentage of the total area of the polygon that overlaps with the H3 grids. For example, if 50% of the polygon overlaps with an H3 grid, then 50% of the polygon’s attributes get mapped to the H3 grid.
The indexer will product an output with the following columns for polygon inputs:
h3_index
: The H3 index.h3_resolution
: The H3 resolution of the index, as provided in the job config.h3_r3_parent
: The Resolution=3 parent index of the H3 grid. This column is used for partitioning.h3_area_km2
: The area of the H3 hexagon in km2.<unique_id>
: The unique ID of the input, based on the unique_id for the input provided in the job config.ratio
: The percent area of the polygon associated with the H3 grid. This calculation is based on the method for the input provided by the job config.total_area_km2
: The total area in km2 of the polygon.
The primary key of the output is the combination of h3_index
and <unique_id>
.
For example:
h3_index | h3_resolution | h3_r3_parent | h3_area_km2 | GEOID20 | ratio | total_area_km2 |
---|---|---|---|---|---|---|
8644697b7ffffff | 6 | 864469fffffffff | 40.12018482559633 | geoid123 | 0.6830630172222252 | 0.7395279788133455 |
86446945fffffff | 6 | 86446ffffffffff | 40.145249906240224 | geoid123 | 0.3169369827778003 | 0.7395279788133455 |
This example table depicts 2 H3 hexagons (2 rows) at resolution 6 that intersected with a single polygon (geoid123). The area of both hexagons is ~40km2. 68.3% of the polygon intersected with the first hexagon and 31.7% of the polygon intersected with the second hexagon. The total area of polygon geoid123 is ~0.74 km2.
The resolver takes all of the input_columns
of all of the inputs in the Job config and resolves them to H3 grids. It does this by taking the ratios associated with each H3 grid, then summing the attributes up in a GROUP BY h3_index
. This operation handles for known overlaps on H3 grids, for example: when several routes pass through the same H3 grid, or when several blocks intersect with the same H3 grid.
The resolver will produce an output with the following columns:
h3_index
: The H3 index.h3_resolution
: The H3 resolution of the index, as provided in the job config.h3_area_km2
: The area of the H3 hexagon in km2.sum_<input_column_name>
: There will be one of these columns from all of theinput_columns
for each input from the job config. The values are multiplied by their corresponding ratios, then summed and grouped by the H3 indices that they intersected with.
The primary key of the output is h3_index
.
For example:
h3_index | h3_resolution | h3_area_km2 | sum_census_value | sum_route_value |
---|---|---|---|---|
86446cae7ffffff | 6 | 40.55609958082783 | 100.43 | 7.8728 |
86446ca57ffffff | 6 | 40.58272492454886 | 9433.12 | 0.007726 |
This example table depicts 2 H3 hexagons (2 rows) at resolution 6. The area of both hexagons is ~40km2. This example resolver output would have had 2 inputs:
- A polygon input table with the column: census_value
- A line input table with the column: route_value
For each hexagon, the column named sum_census_value
is calculated by taking the sum of the census_value
x ratio
for all intersecting polygons. The column named sum_route_value
is calculated by taking the sum of the route_value
x ratio
for all intersecting lines.
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.