Skip to content

array_agg with pyarrow errors with ArrowInvalid: Schema at index 0 was different #8032

@Maxsparrow

Description

@Maxsparrow

Describe the bug

When using the array_agg function in a query on a DataFusion table generated by a pyarrow Dataset on parquet files, the .to_pandas() and .to_polars() functions error with the below stacktrace:

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[29], line 1
----> 1 results.to_pandas()

File /opt/conda/envs/opt-strategies-amer/lib/python3.10/site-packages/pyarrow/table.pxi:4057, in pyarrow.lib.Table.from_batches()

File /opt/conda/envs/opt-strategies-amer/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/conda/envs/opt-strategies-amer/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: Schema at index 0 was different: 
double_field: list<item: double>
string_field: list<item: string>
vs
double_field: list<item: double> not null
string_field: list<item: string> not null

To Reproduce

Using latest pyarrow and datafusion (other versions not tested):
pyarrow==14.0.0
datafusion==32.0.0

import pandas as pd
df = pd.DataFrame({'double_field': [1.5, 2.5, 3.5], 'string_field': ['a', 'b', 'c']})
df.to_parquet("test_array_agg.parquet")

from pyarrow import dataset
test_dataset = dataset.dataset("test_array_agg.parquet")

from datafusion import SessionContext
ctx = SessionContext()
ctx.register_dataset("test_table", test_dataset)
tbl = ctx.table("test_table")

query = """
SELECT
    array_agg("double_field" ORDER BY "string_field") as "double_field",
    array_agg("string_field" ORDER BY "string_field") as "string_field"
FROM test_table
"""
results = ctx.sql(query)

# Result schema is list<item: double>
print(results.schema())

# First pyarrow.RecordBatch schema is list<item: double> not null
print(results.collect()[0].schema)

# ERRORS with ArrowInvalid
results.to_pandas()

Expected behavior

The .to_pandas() method should return a DataFrame with arrays in each cell.

Additional context

The query works and returns the expected result if you wrap array_agg in additional calls to make_array and array_extract:

SELECT
    array_extract(make_array(array_agg("double_field") ORDER BY "string_field"), 1) as "double_field",
    array_extract(make_array(array_agg("string_field") ORDER BY "string_field"), 1) as "string_field"
FROM test_table

Returns from .to_pandas() with:

image

In this case both results.schema() and results.collect()[0].schema are list<item: double>/list<item: string>

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions