Skip to content

Conversation

timsaucer
Copy link
Member

@timsaucer timsaucer commented Aug 26, 2025

Which issue does this PR close?

Rationale for this change

For some users, adding in sqlparser greatly increases build times for their projects even though they are not using sql. This feature makes sql an optional dependency to improve build times and reduce the number of dependencies.

What changes are included in this PR?

Adds in a sql feature in the core crate and makes sqlparser an optional dependency. The rest of the changes are to account for the fact that some of the structs in sqlparser are used throughout the repository.

Are these changes tested?

Ideally this is effectively a no-op change. By enabling sql (or simply using the default features) there is no change and all tests should run as before. I have also run all tests with the --no-default-features flag.

Are there any user-facing changes?

There is one change that may be impactful to some users. By adding in the sql feature, if users are not using default features of the datafusion core crate, they will need to add sql to their Cargo.toml.

Additional context

When building on a mac m4, running a cargo clean and then checking time and build size these are my local results for just building the core datafusion library:

cargo build --release -p datafusion --lib --no-default-features  374.93s user 11.45s system 800% cpu 48.276 total
-rw-r--r--@ 1 tsaucer  staff    10M Aug 27 08:15 target/release/libdatafusion.rlib

cargo build --release -p datafusion --lib --no-default-features --features sql  453.58s user 12.40s system 761% cpu 1:01.21 total
-rw-r--r--@ 1 tsaucer  staff    17M Aug 27 08:17 target/release/libdatafusion.rlib

cargo build -p datafusion --lib --no-default-features  172.85s user 13.35s system 637% cpu 29.189 total
-rw-r--r--@ 1 tsaucer  staff    49M Aug 27 08:20 target/debug/libdatafusion.rlib

cargo build -p datafusion --lib --no-default-features --features sql  210.16s user 14.26s system 471% cpu 47.625 total
-rw-r--r--@ 1 tsaucer  staff    91M Aug 27 08:19 target/debug/libdatafusion.rlib

Net results:

  • Release mode without SQL: 48.276 seconds, 10M library
  • Release mode with SQL: 61.21 seconds, 17M library
  • Debug mode without SQL: 29.189 seconds, 49M library
  • Debug mode with SQL: 47.625 seconds, 91M library

The absolute time numbers aren't so bad when you build just this library by itself, but as part of larger projects it has a compounding effect.

@timsaucer timsaucer self-assigned this Aug 26, 2025
@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate catalog Related to the catalog crate common Related to common crate execution Related to the execution crate proto Related to proto crate physical-plan Changes to the physical-plan crate development-process Related to development process of DataFusion substrait Changes to the substrait crate labels Aug 26, 2025
@timsaucer timsaucer marked this pull request as ready for review August 27, 2025 11:18
@mbutrovich mbutrovich self-requested a review August 27, 2025 15:05
@crepererum
Copy link
Contributor

I've adjusted the description because I think that also closes #17258 🥳

rust-version: stable
- name: Run datafusion-common tests
run: cargo test --profile ci -p datafusion-common --features=pyarrow
run: cargo test --profile ci -p datafusion-common --features=pyarrow,sql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we extend the CI to also verify that ALL crates at least cargo check with and without sql support? This YAML file has a few examples/listings of feature tests with cargo check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the existing coverage and it seems like the current pattern is to test features in CI here:

https://github.com/apache/datafusion/blob/49d49fd92dddf55bfb22787fea17dda1a698dc4d/.github/workflows/rust.yml#L188-L187

I took the liberty of adding the appropriate entry for the sql feature in this PR and documeting the feature

@mbutrovich
Copy link
Contributor

I was playing with this PR this morning to see if Comet could use it, and it looks like datafusion-sql is brought in with the nested_expressions feature (which Comet uses). I wonder if we can disentangle that dependency.

@findepi
Copy link
Member

findepi commented Sep 4, 2025

❤️ ❤️ ❤️ ❤️ ❤️

@alamb
Copy link
Contributor

alamb commented Sep 5, 2025

I was playing with this PR this morning to see if Comet could use it, and it looks like datafusion-sql is brought in with the nested_expressions feature (which Comet uses). I wonder if we can disentangle that dependency.

that would be great

@timsaucer
Copy link
Member Author

I was playing with this PR this morning to see if Comet could use it, and it looks like datafusion-sql is brought in with the nested_expressions feature (which Comet uses). I wonder if we can disentangle that dependency.

I did a quick check and with the latest push + comet I no longer have any sql dependency in the cargo lock file! I did have to make a change in comet, which was to set datafusion-functions-nested = { path="../../../datafusion/datafusion/functions-nested", default-features = false } in native/core/Cargo.toml.

@mbutrovich
Copy link
Contributor

I did a quick check and with the latest push + comet I no longer have any sql dependency in the cargo lock file! I did have to make a change in comet, which was to set datafusion-functions-nested = { path="../../../datafusion/datafusion/functions-nested", default-features = false } in native/core/Cargo.toml.

I'll give this a shot, thanks @timsaucer!

@timsaucer
Copy link
Member Author

I had to rebase, so you might want to pull again

@github-actions github-actions bot added documentation Improvements or additions to documentation and removed catalog Related to the catalog crate labels Sep 16, 2025
@alamb alamb changed the title feat: make sql an optional feature feat: add sql feature Sep 16, 2025
@alamb alamb changed the title feat: add sql feature feat: add sql feature to make sql planning optional Sep 16, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @timsaucer

Sorry for the delay in this PR. I merged up from main to resolve some conflicts and added a new test and docs. Hopefully it is good to go now.

Thanks again!

rust-version: stable
- name: Run datafusion-common tests
run: cargo test --profile ci -p datafusion-common --features=pyarrow
run: cargo test --profile ci -p datafusion-common --features=pyarrow,sql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the existing coverage and it seems like the current pattern is to test features in CI here:

https://github.com/apache/datafusion/blob/49d49fd92dddf55bfb22787fea17dda1a698dc4d/.github/workflows/rust.yml#L188-L187

I took the liberty of adding the appropriate entry for the sql feature in this PR and documeting the feature

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a big win from the final Comet binary size, but I think any encapsulation improvements are welcome. Thanks @timsaucer!

@timsaucer
Copy link
Member Author

I didn't see a big win from the final Comet binary size, but I think any encapsulation improvements are welcome. Thanks @timsaucer!

Right - I don't think you will see a binary difference, especially in release mode. But hopefully you will get some improvements in build time and number of dependencies.

@mbutrovich mbutrovich merged commit 0181c79 into apache:main Sep 16, 2025
30 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 16, 2025

🎉

}

#[cfg(not(feature = "sql"))]
pub(crate) fn parse_identifiers(s: &str) -> Result<Vec<String>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have two different, divergent implementations of parse_identifiers and parse_identifiers_normalized. This leads to unpredictable runtime behavior when integrating different components using DataFusion (one using sql feature and one not).

If the new implementation is equivalent to sqlparser-based, it should be the only implementation.

If it is not, it should not exist. All the APIs using it should be guarded by sql feature flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1448 to +1470
#[derive(Clone, PartialEq, Eq, PartialOrd, Hash, Debug)]
#[cfg(not(feature = "sql"))]
pub struct ExceptSelectItem {
pub first_element: Ident,
pub additional_elements: Vec<Ident>,
}
#[cfg(not(feature = "sql"))]
impl Display for ExceptSelectItem {
fn fmt(&self, f: &mut Formatter) -> fmt::Result {
write!(f, "EXCEPT ")?;
if self.additional_elements.is_empty() {
write!(f, "({})", self.first_element)?;
} else {
write!(
f,
"({}, {})",
self.first_element,
display_comma_separated(&self.additional_elements)
)?;
}
Ok(())
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expr crate should become un-dependent from sqlparser::ast types, so these copies should be the only types we use.

Is there a follow-up issue planned to cleanup these duplicate types?

Copy link
Contributor

@alamb alamb Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate development-process Related to development process of DataFusion documentation Improvements or additions to documentation execution Related to the execution crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate sql SQL Planner substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make sqlparser dependency optional RFC: Add features to reduce dependencies on core crate

5 participants