Skip to content

Regression: DataFrame::schema returns incorrect schema for NATURAL JOIN #14058

@DDtKey

Description

@DDtKey

Describe the bug

Affected Version: 42.x, 43.x, 44.x (regression since 41.x)

The DataFrame::schema (=> LogicalPlan::schema) method returns a schema that includes all columns from the joined sources (using NATURAL JOIN), including columns not present in the final output. This behavior is incorrect and inconsistent with the documented behavior:

Returns the DFSchema describing the output of this DataFrame.

To Reproduce

Simple MRE here:

// Works for datafusion: 41.x and earlier
// Failed for datafusion: 42.x and later (including 44.x)

use datafusion::arrow::util::pretty;
use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx = SessionContext::new();

    // Create table1
    ctx.sql(
        r#"
        CREATE TABLE table1 AS
        SELECT * FROM (
            VALUES
            (1, 'a'),
            (2, 'b'),
            (3, 'c')
        ) AS t(id, value1)
        "#,
    )
    .await?;

    // Create table2
    ctx.sql(
        r#"
        CREATE TABLE table2 AS
        SELECT * FROM (
            VALUES
            (1, 'x'),
            (3, 'y'),
            (4, 'z')
        ) AS t(id, value2)
        "#,
    )
    .await?;

    // Execute NATURAL JOIN query
    let df = ctx.sql("SELECT * FROM table1 NATURAL JOIN table2").await?;

    // Incorrect schema includes all columns from both tables
    let schema = df.schema().as_arrow().clone();
    println!("Schema: {:?}", schema);

    // Output does not include all columns
    let result = df.collect().await?;
    pretty::print_batches(&result)?;

    let result_schema = result.first().unwrap().schema();
    assert_eq!(&schema, &*result_schema, "Schema mismatch");

    Ok(())
}

Deps:

datafusion = "44.0.0"
tokio = { version = "1", features = ["full"] }

Expected behavior

The schema returned by DataFrame::schema should match the structure of the output produced by collect/collect_partitioned and etc. Specifically:

  • Excluded columns from the result of a NATURAL JOIN should not appear in the schema.

Or, if it was intended - the documentation should be aligned and be clear how to access the schema.
However, I find previous behavior correct and useful (e.g - get schema before methods like write_parquet/csv/json)

Additional context

This is a regression, as the method previously worked correctly in version 41.x.x and earlier.

Also, it probably points to the missing test coverage for particular code-paths. In a sense it's not enough to compare SQL execution results

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinghelp wantedExtra attention is neededregressionSomething that used to work no longer does

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions