Skip to content

Improve performance of SUBSTR for StringViewArray #12031

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

In https://github.com/apache/datafusion/pull/12019/files @dmitrybugakov added support for StringViewArray in the substr function ❤️

However, the initial implementation returns an output StringArray when the input is a StringViewArray, which means all the strings are copied

In some functions, such as substr, this extra copy is unnecessary and only the views (aka the i128s that make up the pointers). See GenericByteViewArray for more details

Describe the solution you'd like

I think we can avoid the copy when the input uses StringViewArray and thus make substr faster

Describe alternatives you've considered

The idea would be to

  1. Create a benchmark for the substring function for StringArray, LargeStringArray and StringViewArray
  2. Optimize the implementation of substr

The optimization would likely look like:

  1. Change the signature of substr so it produces a StringViewArray when its first argument is a StringViewArray (at the moment it produces StringArray when its argument is a StringViewArray)
  2. Make a function that took StringViewArray as input and produced another StringViewArray as output

Additional context

Here is an example benchmark: #12015

Here is the code to work to create StringViews: StringViewBuilder https://docs.rs/arrow/latest/arrow/array/type.StringViewBuilder.html

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions