-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Currently, a call to STRPOS
with a Utf8View datatypes induces a cast. After the change that fixes this issue, it should not.
query TT
EXPLAIN SELECT
STRPOS(column1_utf8view, 'f') as c,
STRPOS(column1_utf8view, column2_utf8view) as c2
FROM test;
----
logical_plan
01)Projection: strpos(__common_expr_1, Utf8("f")) AS c, strpos(__common_expr_1, CAST(test.column2_utf8view AS Utf8)) AS c2
02)--Projection: CAST(test.column1_utf8view AS Utf8) AS __common_expr_1, test.column2_utf8view
03)----TableScan: test projection=[column1_utf8view, column2_utf8view]
strpos
is defined here: https://github.com/apache/datafusion/blob/main/datafusion/functions/src/unicode/strpos.rs
casting tests are in: https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/string_view.slt
Is your feature request related to a problem or challenge?
We are working to add complete StringView support in DataFusion, which permits potentially much faster processing of string data. See #10918 for more background.
Today, most DataFusion string functions support DataType::Utf8 and DataType::LargeUtf8 and when called with a StringView argument DataFusion will cast the argument back to DataType::Utf8 which is expensive.
To realize the full speed of StringView, we need to ensure that all string functions support the DataType::Utf8View directly.
Describe the solution you'd like
Update the function to support DataType::Utf8View directly
Describe alternatives you've considered
The typical steps are:
- Write some tests showing the function doesn't support Utf8View (see the tests in
string_view.slt
to ensure the arguments are not being cast - Change the
Signature
of the function to acceptUtf8View
in addition toUtf8
/LargeUtf8
- Update the implementation of the function to operate on
Utf8View
Example PRs
- Update to use an arrow kernel that already supports StringView: feat: support
Utf8View
type instarts_with
function #11787 - Change the implementation to support StringView directly: Implement native support StringView for character length #11676
- Change implementation (option 2): Initial support for regex_replace on
StringViewArray
#11556
Additional context
The documentation of string functions can be found here: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#string-functions
To test a function with StringView with datafusion-cli
you can use an example like this (replacing starts_with
with the relevant function)
> create table foo as values (arrow_cast('foo', 'Utf8View'), arrow_cast('bar', 'Utf8View'));
0 row(s) fetched.
Elapsed 0.043 seconds.
> select starts_with(column1, column2) from foo;
+--------------------------------------+
| starts_with(foo.column1,foo.column2) |
+--------------------------------------+
| false |
+--------------------------------------+
1 row(s) fetched.
Elapsed 0.015 seconds.
To see if it is using utf8 view, use EXPLAIN
to see the plan and verify there is no CAST
. In this example the CAST(column1@0 AS Utf8)
indicates that the function is not using Utf8View
natively
> explain select starts_with(column1, column2) from foo;
+---------------+------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: starts_with(CAST(foo.column1 AS Utf8), CAST(foo.column2 AS Utf8)) |
| | TableScan: foo projection=[column1, column2] |
| physical_plan | ProjectionExec: expr=[starts_with(CAST(column1@0 AS Utf8), CAST(column2@1 AS Utf8)) as starts_with(foo.column1,foo.column2)] |
| | MemoryExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.006 seconds.
It is also often good to test with a constant as well (likewise there should be no cast):
> explain select starts_with(column1, 'foo') from foo;
+---------------+----------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: starts_with(CAST(foo.column1 AS Utf8), Utf8("foo")) |
| | TableScan: foo projection=[column1] |
| physical_plan | ProjectionExec: expr=[starts_with(CAST(column1@0 AS Utf8), foo) as starts_with(foo.column1,Utf8("foo"))] |
| | MemoryExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+----------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.002 seconds.
### Additional context
_No response_