-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
When using the array_agg function in a query on a DataFusion table generated by a pyarrow Dataset on parquet files, the .to_pandas() and .to_polars() functions error with the below stacktrace:
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
Cell In[29], line 1
----> 1 results.to_pandas()
File /opt/conda/envs/opt-strategies-amer/lib/python3.10/site-packages/pyarrow/table.pxi:4057, in pyarrow.lib.Table.from_batches()
File /opt/conda/envs/opt-strategies-amer/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()
File /opt/conda/envs/opt-strategies-amer/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()
ArrowInvalid: Schema at index 0 was different:
double_field: list<item: double>
string_field: list<item: string>
vs
double_field: list<item: double> not null
string_field: list<item: string> not null
To Reproduce
Using latest pyarrow and datafusion (other versions not tested):
pyarrow==14.0.0
datafusion==32.0.0
import pandas as pd
df = pd.DataFrame({'double_field': [1.5, 2.5, 3.5], 'string_field': ['a', 'b', 'c']})
df.to_parquet("test_array_agg.parquet")
from pyarrow import dataset
test_dataset = dataset.dataset("test_array_agg.parquet")
from datafusion import SessionContext
ctx = SessionContext()
ctx.register_dataset("test_table", test_dataset)
tbl = ctx.table("test_table")
query = """
SELECT
array_agg("double_field" ORDER BY "string_field") as "double_field",
array_agg("string_field" ORDER BY "string_field") as "string_field"
FROM test_table
"""
results = ctx.sql(query)
# Result schema is list<item: double>
print(results.schema())
# First pyarrow.RecordBatch schema is list<item: double> not null
print(results.collect()[0].schema)
# ERRORS with ArrowInvalid
results.to_pandas()Expected behavior
The .to_pandas() method should return a DataFrame with arrays in each cell.
Additional context
The query works and returns the expected result if you wrap array_agg in additional calls to make_array and array_extract:
SELECT
array_extract(make_array(array_agg("double_field") ORDER BY "string_field"), 1) as "double_field",
array_extract(make_array(array_agg("string_field") ORDER BY "string_field"), 1) as "string_field"
FROM test_tableReturns from .to_pandas() with:
In this case both results.schema() and results.collect()[0].schema are list<item: double>/list<item: string>
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working
