Skip to content

Error When Querying Partitioned JSON Table #7816

@devinjdangelo

Description

@devinjdangelo

Describe the bug

I am seeing the following error when attempting to query a table of hive style partitioned JSON files via datafusion-cli:

Arrow error: Json error: Encountered unmasked nulls in non-nullable StructArray child: Field { name: "a", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }

To Reproduce

I have a table with the following directory structure:

dev@dev:~/arrow-datafusion/test_table$ ls -lhR
.:
total 12K
drwxrwxr-x 2 dev dev 4.0K Oct 12 17:35 'a=2'
drwxrwxr-x 2 dev dev 4.0K Oct 12 17:35 'a=4'
drwxrwxr-x 2 dev dev 4.0K Oct 12 17:35 'a=6'

'./a=2':
total 4.0K
-rw-rw-r-- 1 dev dev 20 Oct 12 17:35 tn0Sfag4abaDm6i2.json

'./a=4':
total 4.0K
-rw-rw-r-- 1 dev dev 20 Oct 12 17:35 tn0Sfag4abaDm6i2.json

'./a=6':
total 4.0K
-rw-rw-r-- 1 dev dev 20 Oct 12 17:35 tn0Sfag4abaDm6i2.json

And the JSON files look like:

dev@dev:~/arrow-datafusion/test_table$ cat a\=2/tn0Sfag4abaDm6i2.json 
{"b":"1"}
{"b":"1"}

Attempting to query like the following fails:

dev@dev:~/arrow-datafusion$ datafusion-cli
DataFusion CLI v32.0.0
❯ CREATE EXTERNAL TABLE
json_test(a string, b string)
STORED AS json
LOCATION './test_table'
PARTITIONED BY (a);
0 rows in set. Query took 0.001 seconds.

❯ select * from json_test;
Arrow error: Json error: Encountered unmasked nulls in non-nullable StructArray child: Field { name: "a", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }
❯ 

Querying without the partitions defined works as expected:

dev@dev:~/arrow-datafusion$ datafusion-cli
DataFusion CLI v32.0.0
❯ CREATE EXTERNAL TABLE
json_test(b string)
STORED AS json
LOCATION './test_table';
0 rows in set. Query took 0.001 seconds.

❯ select * from json_test;
+---+
| b |
+---+
| 3 |
| 3 |
| 1 |
| 1 |
| 5 |
| 5 |
+---+
6 rows in set. Query took 0.002 seconds.

The exact same table structure DDL works for CSV and parquet files, but not JSON.

Expected behavior

The above json query should work.

Additional context

I discovered this while working on https://github.com/apache/arrow-datafusion/pull/7801/files#diff-0580d65ff5db0c78c1fa4cf693f2567e7d2394923412687560614836401c223f, and there are additional relevant tests in this PR.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions