-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Describe the bug
Performing a SQL query against a NDJson with partition columns will fail when filtering on any of the partition columns with the following error. In this case my partition column is a timestamp but it holds for other types as well.
ArrowError(JsonError("Encountered unmasked nulls in non-nullable StructArray child: Field { name: "hourly_timestamp", data_type: Timestamp(Second, None), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }"))
It correctly prunes the files however, it doesn't populate the partition predicate correctly. This is in contrast to the ParquetExec which adds an extra predicate to populate the partition column.
JsonExec: file_groups={1 group: [[Users/taylorbeever/git/theelderbeever/df-test/data/ndjson/hourly_timestamp=2023-09-25T20:00:00/data.ndjson]]}, projection=[id, timestamp, value, hourly_timestamp]
ParquetExec: file_groups={1 group: [[Users/taylorbeever/git/theelderbeever/df-test/data/parquet/hourly_timestamp=2023-09-25T20:00:00/data.zstd.parquet]]}, projection=[id, timestamp, value, hourly_timestamp], predicate=hourly_timestamp@3 = 1695672000
Attempted solutions - all fail:
- Add partition columns to each json file.
- Define the Schema for the table
- Other datatypes for partition
To Reproduce
I created an example repo here.
Example data is included in the repo. All code is contained in src/main.rs. The parquet files are identical data to ndjson files. They do not contain a column for the partition column as written.
To run:
First the parquet one which will succeed.
RUST_LOG=debug cargo run -- parquetThen the ndjson which will fail.
RUST_LOG=debug cargo run -- ndjsonExpected behavior
Partitioned table reads shouldn't fail when filtering on a partition column.
Additionally, the default file_extension for NDJsonReadOptions is .json which is a little misleading. Its should be one of .ndjson or .jsonl.
Additional context
No response