array_agg with pyarrow errors with ArrowInvalid: Schema at index 0 was different

### Describe the bug

When using the `array_agg` function in a query on a DataFusion table generated by a pyarrow Dataset on parquet files, the .to_pandas() and .to_polars() functions error with the below stacktrace:

```
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[29], line 1
----> 1 results.to_pandas()

File /opt/conda/envs/opt-strategies-amer/lib/python3.10/site-packages/pyarrow/table.pxi:4057, in pyarrow.lib.Table.from_batches()

File /opt/conda/envs/opt-strategies-amer/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/conda/envs/opt-strategies-amer/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: Schema at index 0 was different: 
double_field: list<item: double>
string_field: list<item: string>
vs
double_field: list<item: double> not null
string_field: list<item: string> not null
```

### To Reproduce

Using latest pyarrow and datafusion (other versions not tested):
pyarrow==14.0.0
datafusion==32.0.0

```python
import pandas as pd
df = pd.DataFrame({'double_field': [1.5, 2.5, 3.5], 'string_field': ['a', 'b', 'c']})
df.to_parquet("test_array_agg.parquet")

from pyarrow import dataset
test_dataset = dataset.dataset("test_array_agg.parquet")

from datafusion import SessionContext
ctx = SessionContext()
ctx.register_dataset("test_table", test_dataset)
tbl = ctx.table("test_table")

query = """
SELECT
    array_agg("double_field" ORDER BY "string_field") as "double_field",
    array_agg("string_field" ORDER BY "string_field") as "string_field"
FROM test_table
"""
results = ctx.sql(query)

# Result schema is list<item: double>
print(results.schema())

# First pyarrow.RecordBatch schema is list<item: double> not null
print(results.collect()[0].schema)

# ERRORS with ArrowInvalid
results.to_pandas()
```

### Expected behavior

The `.to_pandas()` method should return a DataFrame with arrays in each cell.

### Additional context

The query works and returns the expected result if you wrap array_agg in additional calls to make_array and array_extract:

```python
SELECT
    array_extract(make_array(array_agg("double_field") ORDER BY "string_field"), 1) as "double_field",
    array_extract(make_array(array_agg("string_field") ORDER BY "string_field"), 1) as "string_field"
FROM test_table
```

Returns from .to_pandas() with:

![image](https://github.com/apache/arrow-datafusion/assets/7087373/8f5ef1c3-3dea-4d5f-a0cf-3014e0fa485d)

In this case both `results.schema()` and `results.collect()[0].schema` are `list<item: double>`/`list<item: string>`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

array_agg with pyarrow errors with ArrowInvalid: Schema at index 0 was different #8032

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

array_agg with pyarrow errors with ArrowInvalid: Schema at index 0 was different #8032

Description

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions