Skip to content

NDJsonExec doesn't properly apply predicates on partitioned tables. #7686

@theelderbeever

Description

@theelderbeever

Describe the bug

Performing a SQL query against a NDJson with partition columns will fail when filtering on any of the partition columns with the following error. In this case my partition column is a timestamp but it holds for other types as well.

ArrowError(JsonError("Encountered unmasked nulls in non-nullable StructArray child: Field { name: "hourly_timestamp", data_type: Timestamp(Second, None), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }"))

It correctly prunes the files however, it doesn't populate the partition predicate correctly. This is in contrast to the ParquetExec which adds an extra predicate to populate the partition column.

JsonExec: file_groups={1 group: [[Users/taylorbeever/git/theelderbeever/df-test/data/ndjson/hourly_timestamp=2023-09-25T20:00:00/data.ndjson]]}, projection=[id, timestamp, value, hourly_timestamp]

ParquetExec: file_groups={1 group: [[Users/taylorbeever/git/theelderbeever/df-test/data/parquet/hourly_timestamp=2023-09-25T20:00:00/data.zstd.parquet]]}, projection=[id, timestamp, value, hourly_timestamp], predicate=hourly_timestamp@3 = 1695672000

Attempted solutions - all fail:

  • Add partition columns to each json file.
  • Define the Schema for the table
  • Other datatypes for partition

To Reproduce

I created an example repo here.

Example data is included in the repo. All code is contained in src/main.rs. The parquet files are identical data to ndjson files. They do not contain a column for the partition column as written.

To run:

First the parquet one which will succeed.

RUST_LOG=debug cargo run -- parquet

Then the ndjson which will fail.

RUST_LOG=debug cargo run -- ndjson

Expected behavior

Partitioned table reads shouldn't fail when filtering on a partition column.

Additionally, the default file_extension for NDJsonReadOptions is .json which is a little misleading. Its should be one of .ndjson or .jsonl.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions