-
Couldn't load subscription status.
- Fork 1.7k
Description
Describe the bug
When using a ListingTableFactory (i.e. using the datafusion-cli) tables that leverage Hive partitioning do not yield the expected columns derived from the hive key/value pairs (and other such oddities).
To Reproduce
A publicly accessible S3 bucket hosting Hive partitioned parquet data can be used to show the issues. The general structure of the data can be seen below:
$ aws s3 ls s3://overturemaps-us-west-2/release/2025-07-23.0/
PRE theme=addresses/
PRE theme=base/
PRE theme=buildings/
PRE theme=divisions/
PRE theme=places/
PRE theme=transportation/$ aws s3 ls s3://overturemaps-us-west-2/release/2025-07-23.0/ --recursive
2025-07-22 14:42:50 1073537247 release/2025-07-23.0/theme=addresses/type=address/part-00000-57f746d8-98a1-4faa-94c2-4f084343ecac-c000.zstd.parquet
2025-07-22 14:43:06 1013189887 release/2025-07-23.0/theme=addresses/type=address/part-00001-57f746d8-98a1-4faa-94c2-4f084343ecac-c000.zstd.parquet
. . .Here are a couple of examples off odd behavior when trying to access this data.
DataFusion CLI v49.0.0
> SET datafusion.execution.listing_table_ignore_subdirectory to false;
0 row(s) fetched.
Elapsed 0.001 seconds.
> CREATE EXTERNAL TABLE overture_maps
STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-07-23.0/';
0 row(s) fetched.
Elapsed 0.645 seconds.
> describe overture_maps;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
+-------------+-----------+-------------+
0 row(s) fetched.
Elapsed 0.001 seconds.In the above example no schema has been detected at all, either from Hive keys or the underlying parquet data itself, in spite of ensuring the listing table can traverse subdirectories.
> CREATE EXTERNAL TABLE overture_maps
STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-07-23.0/theme=addresses/';
0 row(s) fetched.
Elapsed 0.911 seconds.
> select column_name, is_nullable from information_schema.columns where table_name='overture_maps';
+----------------+-------------+
| column_name | is_nullable |
+----------------+-------------+
| id | YES |
| geometry | YES |
| bbox | YES |
| country | YES |
| postcode | YES |
| street | YES |
| number | YES |
| unit | YES |
| address_levels | YES |
| postal_city | YES |
| version | YES |
| sources | YES |
+----------------+-------------+
12 row(s) fetched.
Elapsed 0.005 seconds.
> select column_name, is_nullable from information_schema.columns where table_name='overture_maps' and column_name='type';
+-------------+-------------+
| column_name | is_nullable |
+-------------+-------------+
+-------------+-------------+
0 row(s) fetched.
Elapsed 0.006 seconds.Next, when a path to a subdirectory is given to create the table, the underlying parquet schema is successfully detected, however the columns corresponding to the Hive keys are still missing.
Expected behavior
When accessing a hive partitioned dataset with no explicit schema set, and having subdirectory traversal enabled, I would expect to get a table that can be queried/filtered on both the columns represented in the parquet schema as well as the hive partitions.
Additional context
No response