- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.7k
Description
Is your feature request related to a problem or challenge?
It is sometimes helpful to have a custom table functions to extend DataFusion's functionality. As we continue to get more feature requests, such as #7859 it is important to support such usecases without having to add everything to the DataFusion core.
For example in the following query my_custom_fun is a table function
SELECT foo, bar FROM my_custom_fun(city='NYC', year=2020)A specific example might be a function that fetches the contents of remote csv file and parses it into a table.
SELECT date, value FROM parse_remote_csv('https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD')You can do something similar to this with a TableProvider, but the main differences are:
- A TableProviderhas no way to pass parameters
- A TableProvider's schema is fixed (it can't be a function of the parameters)
Prior Art
Other examples include the read_parquet etc functions in DuckDB
SELECT * FROM read_parquet(['folder1/*.parquet', 'folder2/*.parquet']);
...
SELECT * FROM parquet_schema('test.parquet');Describe the solution you'd like
I would like to be able to have a table function that supported everything a TableProvider does, including filter and projection pushdown. One way to do so would be to actually return a TableProvider:
Option 1: Add to FunctionRegistry:
We could add Table Functions to the datafusion::execution::FunctionRegistry along with the UDFs, UDAs, etc which arguably would make them easier to discover
Something like
trait FunctionRegistry  { 
...
/// Return a  `TableProvider` for executing the `name` table function
fn udtf(name: &str, args: &[Expr]) -> Result<Arc<dyn TableProvider>>;
}We would probably also need a
trait TableUDF {
  /// Return a  `TableProvider` for executing this  table function, given the specified
  /// arguments
  fn invoke(name: &str, args: &[Expr]) -> Result<Arc<dyn TableProvider>>;
}Describe alternatives you've considered
This API is very powerful and would allow Table Functions to do anything a table provider does. We could also offer a stripped down version of the API potentially
We can probably add something like datafusion::logical_expr::create_udf to make it easier to construct basic table functions (e.g that produce a single SendableRecordBatchStream)
Add to SchemaProvider:
We could also add Table Functions to datafusion::catalog::schema::SchemaProvider
This might make sense given how similar TableFunctions are to TableProviders
trait SchemaProvider  { 
...
/// Return a  `TableProvider` for executing the `name` table function, given the specified
/// arguments
fn table_function(name: &str) -> Result<Arc<dyn TableProvider>>;
/// Register the `TableFunction` with the specified name, returning the previously registered function, if any
fn register_table_function(name: &str) -> Result<Option<Arc<dyn TableProvider>>>;
...
}Additional context
I thought there was an existing ticket for this, but I can not find one
This came up several times, including: