Skip to content

Conversation

timsaucer
Copy link
Member

@timsaucer timsaucer commented Dec 6, 2024

Which issue does this PR close?

Closes #513

This is built on top of #1267
I will rebase once that PR merges.

Rationale for this change

Users would like to use DataFrames as a parameter inside an SQL query. With this change, you can do the following:

from datafusion import SessionContext
ctx = SessionContext()
df_customer = ctx.read_parquet("examples/tpch/data/customer.parquet")
ctx.sql("select c_custkey, c_name from $df", df=df_customer)

This change allows for string replacement of any placeholder in the SQL statement. For most python objects this is calling str() on them. For DataFrame objects we register a temporary view and replace the parameter with the generated name of the view.

What changes are included in this PR?

  • Add param_values to allow prepare statement style replacement of scalar values
  • Add token parsing of placeholders to perform string replacement.
  • Parses string converted arguments into tokens to perform SQL validation of final object
  • Verifies generated strings contain exactly one statement to ensure malicious code is not injected
  • Add user documentation
  • Added unit tests

Are there any user-facing changes?

Existing code is not impacted, but new parameters are added.

Example

From the updated user documentation:

Screenshot 2025-10-13 at 8 26 03 AM

@MrPowers
Copy link
Contributor

MrPowers commented Dec 6, 2024

This user interface looks nice 😎

Copy link

@matko matko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that whenever a file is queried now, it'll also be registered as a table. This does have some impact, as it means these tables are now also returned whenever a context is queried for all registered tables. Meaning any sort of visualization or automation based on that would behave differently depending on whether certain queries were run.

I personally find this very surprising. I would not expect read_parquet to secretly register_parquet as that is not what this library did before, nor is it what the rust library does.

Are you sure this is fine and won't affect people? At the very least, shouldn't there be a way to filter these out easily, to cleanly differentiate between auto-registered and explicitly registered tables?

I very much see the value of the parameterized sql feature, but this seems like a very crude way of doing it.

skip_metadata: bool = True,
schema: pyarrow.Schema | None = None,
file_sort_order: list[list[Expr]] | None = None,
table_name: str | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, I've seen this parameterized as read_parquet().to_view() (e.g., Ibis, DuckDB, SedonaDB)

@timsaucer
Copy link
Member Author

timsaucer commented Oct 5, 2025

I'm going to update this PR to use $param instead of {param} but I don't like my current approach because it will not work with dataframes that contain custom table providers. We need a more robust solution, which I am investigating.

@timsaucer timsaucer marked this pull request as draft October 11, 2025 14:33
@timsaucer timsaucer force-pushed the feat/parameterized-sql-queries branch from 0f2dccf to b6bf566 Compare October 12, 2025 13:45
@timsaucer timsaucer marked this pull request as ready for review October 13, 2025 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Is it possible to pass query parameters? (:param or ?)

4 participants