Restore custom SchemaAdapter functionality for Parquet #16791

adriangb · 2025-07-15T17:19:45Z

#16235 (comment)

This essentially partially reverts #16461 by keeping backward compatibility with the existing SchemaAdapter.

adriangb · 2025-07-15T17:46:56Z

I'd like to add a unit test that confirms the custom schema adapter factory will be used if specified.

adriangb · 2025-07-15T20:31:31Z

I'd like to add a unit test that confirms the custom schema adapter factory will be used if specified.

done

alamb

Thank you @adriangb

I am a bit unclear on how the physicalexprrewriter and the schema adapter now interact when doing schema evolution. In particular I am not sure about the intended behavior when they are both present

I wonder if we could make an exmple showing how to use the two APIs together that would make it clearer what was supposed to happen

docs/source/library-user-guide/upgrading.md

alamb · 2025-07-16T11:44:25Z

docs/source/library-user-guide/upgrading.md

+
+To resolve this you need to implement a custom `PhysicalExprAdapterFactory` and use that instead of a `SchemaAdapterFactory`.
+See the [default values](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/default_column_values.rs) for an example of how to do this.
+Opting into the new APIs will set you up for future changes since we plan to expand use of `PhysicalExprAdapterFactory` to other areas of DataFusion.


A link to some description of the future plan would be super helpful here

alamb · 2025-07-16T11:47:46Z

datafusion/datasource-parquet/src/row_filter.rs

    rows_matched: metrics::Count,
    /// how long was spent evaluating this predicate
    time: metrics::Time,
+    /// used to perform type coercion while filtering rows


I think it is a bit unclear how the schema mapper and expression rewriter work together -- I think it is the case that the schema is mapped first and then the simplified physical expression is evaluated against the mapped schema rather than the file schema

Maybe we can add a comment explaining how this works

I think it is a bit unclear how the schema mapper and expression rewriter work together

If you have an expression adapter you map the expression and the expression is now evaluated against the physical file schema. So there is no longer a need / point of having a SchemaAdapter. It will still be there but it becomes a no-op because it's adapting between identical schemas.

alamb · 2025-07-16T11:49:21Z

datafusion/datasource-parquet/src/row_filter.rs

    }

    fn evaluate(&mut self, batch: RecordBatch) -> ArrowResult<BooleanArray> {
+        let batch = self.schema_mapper.map_batch(batch)?;


Applying the schema mapper first means the predicate is applied on batches that have been mapped to the table schema, but wasn't the predicate rewritten to be in terms of the file schema?

This code basically becomes a no-op if a predicate rewriter is used. But instead of making it Option or something like that I think it's easier to minimize the version on version diff and code paths to leave it be.

It's a no-op because when the schema mapping is calculated it will be from the physical file schema to the physical file schema.

alamb · 2025-07-16T11:51:10Z

datafusion/datasource-parquet/src/opener.rs

+    }
+
+    #[tokio::test]
+    async fn test_custom_schema_adapter_no_rewriter() {


If possible, I think this test should be of the "end to end" variety (in core_integration) that shows how these APIs interact to rewrite predicates / schemas correctly. It is not super clear to me how these low level APIs would be used by users and thus not sure if this test covers the cases correctly

I added a test in https://github.com/apache/datafusion/pull/16791/files#diff-eba504ed30666e1eac14df009713df33d8668156a3fac7b6b7679d9b9cddcb04

adriangb · 2025-07-16T12:24:02Z

In particular I am not sure about the intended behavior when they are both present

If you have an expression adapter you map the expression and the expression is now evaluated against the physical file schema. So there is no longer a need / point of having a SchemaAdapter. It will still be there but it becomes a no-op because it's adapting between identical schemas.

I'm sorry I did not give more detail in the PR description. The idea is that there are one of 4 scenarios that users will fall into:

Custom SchemaAdapter, no custom PhysicalExprAdapter -> use SchemaAdapter / old code path
Default SchemaAdapter, no custom PhysicalExprAdapter -> use default PhysicalExprAdapter
Custom SchemaAdapter, custom PhysicalExprAdapter -> use the custom PhysicalExprAdapter (making the SchemaAdapter a no-op for predicate pushdown).
Default SchemaAdapter, custom PhysicalExprAdapter -> use custom PhysicalExprAdapter.

This makes it completely backwards compatible: anyone using a custom SchemaAdapter has to make no code changes to continue using it. If they want to opt into the new mechanism they can either stop setting a custom SchemaAdapter or also set a custom PhysicalExprAdapter.

A SchemaAdapter, custom or the default, is still used for projections.

Co-authored-by: Andrew Lamb <[email protected]>

adriangb · 2025-07-16T14:44:11Z

I opened #16800 to track the big picture

alamb

Thank you @adriangb

alamb · 2025-07-16T18:32:22Z

datafusion/core/tests/parquet/schema_adapter.rs

+            .unwrap()
+            .with_schema(table_schema.clone())
+            .with_schema_adapter_factory(Arc::new(DefaultSchemaAdapterFactory))
+            .with_physical_expr_adapter_factory(Arc::new(


is the idea to show the default being used? Or did you mean to also provide a custom factory that changed the schema?

I'm going to follow up with a commit that uses several combinations of custom factories

a846492#diff-eba504ed30666e1eac14df009713df33d8668156a3fac7b6b7679d9b9cddcb04

alamb · 2025-07-16T18:33:23Z

datafusion/datasource-parquet/src/source.rs

+                See https://github.com/apache/datafusion/issues/16800 for discussion and https://datafusion.apache.org/library-user-guide/upgrading.html#datafusion-49-0-0 for upgrade instructions.");
+        }
+
+        let (expr_adapter_factory, schema_adapter_factory) = match (


this is good.

Should we mark SchemaAdapter a deprecated as well? Maybe as a follow on PR?

I'm a bit torn: we don't yet have an alternative for projections. We kind of want to mark it as deprecated for predicates but not for projections, which is weird. It may be in a bit of a limbo state for a release or two.

docs/source/library-user-guide/upgrading.md

Co-authored-by: Andrew Lamb <[email protected]>

add fallback and docs

22663c3

github-actions bot added documentation Improvements or additions to documentation datasource Changes to the datasource crate labels Jul 15, 2025

adriangb changed the title ~~add fallback and docs~~ Restore custom SchemaAdapter functionality for Parquet Jul 15, 2025

fix test

4cfff83

adriangb requested a review from alamb July 15, 2025 17:32

add test

dc49b8e

adriangb mentioned this pull request Jul 15, 2025

Release DataFusion 49.0.0 (July 2025) #16235

Closed

34 tasks

adriangb added 2 commits July 15, 2025 15:34

better warns

a817cea

clippy

f78c494

alamb reviewed Jul 16, 2025

View reviewed changes

adriangb and others added 2 commits July 16, 2025 09:33

Update docs/source/library-user-guide/upgrading.md

091501f

Co-authored-by: Andrew Lamb <[email protected]>

allow both without error

7172793

adriangb mentioned this pull request Jul 16, 2025

Plan to replace SchemaAdapter with PhysicalExprAdapter #16800

Open

adriangb added 3 commits July 16, 2025 09:54

add link to issue, add clarifying comments

1c4c90a

fix types

e676bd4

Add test

3d2ae4a

github-actions bot added the core Core DataFusion crate label Jul 16, 2025

adriangb added 2 commits July 16, 2025 12:31

header

0e7cfa7

docs

6bc7fc6

alamb approved these changes Jul 16, 2025

View reviewed changes

better test

a846492

github-actions bot added the physical-expr Changes to the physical-expr crates label Jul 16, 2025

adriangb and others added 3 commits July 16, 2025 16:08

Update docs/source/library-user-guide/upgrading.md

7c314b0

Co-authored-by: Andrew Lamb <[email protected]>

fmt

ac8737a

restore comment

5ccddcb

adriangb added 3 commits July 16, 2025 16:20

fix

16ddbae

clippy

4885780

rename test

7a651a0

adriangb merged commit 6abc162 into apache:main Jul 17, 2025
11 checks passed

Uh oh!

Restore custom SchemaAdapter functionality for Parquet #16791

Restore custom SchemaAdapter functionality for Parquet #16791

Uh oh!

Conversation

adriangb commented Jul 15, 2025

Uh oh!

adriangb commented Jul 15, 2025

Uh oh!

adriangb commented Jul 15, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented Jul 16, 2025

Uh oh!

adriangb commented Jul 16, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants