Skip to content

Auto detecting partitions with ListingTableFactory on Hive partitioned datasets #17049

@BlakeOrth

Description

@BlakeOrth

Describe the bug

When using a ListingTableFactory (i.e. using the datafusion-cli) tables that leverage Hive partitioning do not yield the expected columns derived from the hive key/value pairs (and other such oddities).

To Reproduce

A publicly accessible S3 bucket hosting Hive partitioned parquet data can be used to show the issues. The general structure of the data can be seen below:

$ aws s3 ls s3://overturemaps-us-west-2/release/2025-07-23.0/
                           PRE theme=addresses/
                           PRE theme=base/
                           PRE theme=buildings/
                           PRE theme=divisions/
                           PRE theme=places/
                           PRE theme=transportation/
$ aws s3 ls s3://overturemaps-us-west-2/release/2025-07-23.0/ --recursive
2025-07-22 14:42:50 1073537247 release/2025-07-23.0/theme=addresses/type=address/part-00000-57f746d8-98a1-4faa-94c2-4f084343ecac-c000.zstd.parquet
2025-07-22 14:43:06 1013189887 release/2025-07-23.0/theme=addresses/type=address/part-00001-57f746d8-98a1-4faa-94c2-4f084343ecac-c000.zstd.parquet
. . .

Here are a couple of examples off odd behavior when trying to access this data.

DataFusion CLI v49.0.0
> SET datafusion.execution.listing_table_ignore_subdirectory to false;
0 row(s) fetched.
Elapsed 0.001 seconds.

> CREATE EXTERNAL TABLE overture_maps
STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-07-23.0/';
0 row(s) fetched.
Elapsed 0.645 seconds.

> describe overture_maps;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
+-------------+-----------+-------------+
0 row(s) fetched.
Elapsed 0.001 seconds.

In the above example no schema has been detected at all, either from Hive keys or the underlying parquet data itself, in spite of ensuring the listing table can traverse subdirectories.

> CREATE EXTERNAL TABLE overture_maps
STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-07-23.0/theme=addresses/';
0 row(s) fetched.
Elapsed 0.911 seconds.

> select column_name, is_nullable from information_schema.columns where table_name='overture_maps';
+----------------+-------------+
| column_name    | is_nullable |
+----------------+-------------+
| id             | YES         |
| geometry       | YES         |
| bbox           | YES         |
| country        | YES         |
| postcode       | YES         |
| street         | YES         |
| number         | YES         |
| unit           | YES         |
| address_levels | YES         |
| postal_city    | YES         |
| version        | YES         |
| sources        | YES         |
+----------------+-------------+
12 row(s) fetched.
Elapsed 0.005 seconds.

> select column_name, is_nullable from information_schema.columns where table_name='overture_maps' and column_name='type';
+-------------+-------------+
| column_name | is_nullable |
+-------------+-------------+
+-------------+-------------+
0 row(s) fetched.
Elapsed 0.006 seconds.

Next, when a path to a subdirectory is given to create the table, the underlying parquet schema is successfully detected, however the columns corresponding to the Hive keys are still missing.

Expected behavior

When accessing a hive partitioned dataset with no explicit schema set, and having subdirectory traversal enabled, I would expect to get a table that can be queried/filtered on both the columns represented in the parquet schema as well as the hive partitions.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions