-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Describe the bug
Affected Version: 42.x, 43.x, 44.x (regression since 41.x)
The DataFrame::schema (=> LogicalPlan::schema) method returns a schema that includes all columns from the joined sources (using NATURAL JOIN), including columns not present in the final output. This behavior is incorrect and inconsistent with the documented behavior:
Returns the DFSchema describing the output of this DataFrame.
To Reproduce
Simple MRE here:
// Works for datafusion: 41.x and earlier
// Failed for datafusion: 42.x and later (including 44.x)
use datafusion::arrow::util::pretty;
use datafusion::prelude::*;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let ctx = SessionContext::new();
// Create table1
ctx.sql(
r#"
CREATE TABLE table1 AS
SELECT * FROM (
VALUES
(1, 'a'),
(2, 'b'),
(3, 'c')
) AS t(id, value1)
"#,
)
.await?;
// Create table2
ctx.sql(
r#"
CREATE TABLE table2 AS
SELECT * FROM (
VALUES
(1, 'x'),
(3, 'y'),
(4, 'z')
) AS t(id, value2)
"#,
)
.await?;
// Execute NATURAL JOIN query
let df = ctx.sql("SELECT * FROM table1 NATURAL JOIN table2").await?;
// Incorrect schema includes all columns from both tables
let schema = df.schema().as_arrow().clone();
println!("Schema: {:?}", schema);
// Output does not include all columns
let result = df.collect().await?;
pretty::print_batches(&result)?;
let result_schema = result.first().unwrap().schema();
assert_eq!(&schema, &*result_schema, "Schema mismatch");
Ok(())
}Deps:
datafusion = "44.0.0"
tokio = { version = "1", features = ["full"] }Expected behavior
The schema returned by DataFrame::schema should match the structure of the output produced by collect/collect_partitioned and etc. Specifically:
- Excluded columns from the result of a NATURAL JOIN should not appear in the schema.
Or, if it was intended - the documentation should be aligned and be clear how to access the schema.
However, I find previous behavior correct and useful (e.g - get schema before methods like write_parquet/csv/json)
Additional context
This is a regression, as the method previously worked correctly in version 41.x.x and earlier.
Also, it probably points to the missing test coverage for particular code-paths. In a sense it's not enough to compare SQL execution results