Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 18 additions & 3 deletions datafusion/functions/src/utils.rs
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ use datafusion_expr::{ColumnarValue, ScalarFunctionImplementation};
///
/// If the input type is `Utf8` or `Binary` the return type is `$utf8Type`,
///
/// If the input type is `Utf8View` the return type is `Utf8View`,
/// If the input type is `Utf8View` the return type is $utf8Type,
macro_rules! get_optimal_return_type {
($FUNC:ident, $largeUtf8Type:expr, $utf8Type:expr) => {
pub(crate) fn $FUNC(arg_type: &DataType, name: &str) -> Result<DataType> {
Expand All @@ -41,8 +41,8 @@ macro_rules! get_optimal_return_type {
DataType::LargeUtf8 | DataType::LargeBinary => $largeUtf8Type,
// Binary inputs are automatically coerced to Utf8
DataType::Utf8 | DataType::Binary => $utf8Type,
// Utf8View inputs will yield Utf8View outputs
DataType::Utf8View => DataType::Utf8View,
// Utf8View max offset size is u32::MAX, the same as UTF8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did you find this? Can we add a test for it (e.g. an explain plan test showing a change in type)?

I can help write one if you explain the symptoms you were seeing to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used in string related ScalarUDFs, for example: https://github.com/XiangpengHao/datafusion/blob/string-view2-local/datafusion/functions/src/unicode/character_length.rs#L70

I added a small unit test to the function.

Note that there are a bunch of string related functions that only accepts Utf8 and LargeUtf8, we currently rely on coerce rules to cast them, which won't panic but may be slower than it should be. I think we should add native support to Utf8View, I'm working on it.

DataType::Utf8View | DataType::BinaryView => $utf8Type,
DataType::Null => DataType::Null,
DataType::Dictionary(_, value_type) => match **value_type {
DataType::LargeUtf8 | DataType::LargeBinary => $largeUtf8Type,
Expand Down Expand Up @@ -183,6 +183,21 @@ pub mod test {
};
}

use arrow::datatypes::DataType;
#[allow(unused_imports)]
pub(crate) use test_function;

use super::*;

#[test]
fn string_to_int_type() {
let v = utf8_to_int_type(&DataType::Utf8, "test").unwrap();
assert_eq!(v, DataType::Int32);

let v = utf8_to_int_type(&DataType::Utf8View, "test").unwrap();
assert_eq!(v, DataType::Int32);

let v = utf8_to_int_type(&DataType::LargeUtf8, "test").unwrap();
assert_eq!(v, DataType::Int64);
}
}
6 changes: 3 additions & 3 deletions datafusion/physical-plan/src/coalesce_batches.rs
Original file line number Diff line number Diff line change
Expand Up @@ -270,14 +270,14 @@ impl RecordBatchStream for CoalesceBatchesStream {
}
}

/// Heuristically compact [`StringViewArray`]s to reduce memory usage, if needed
/// Heuristically compact `StringViewArray`s to reduce memory usage, if needed
///
/// This function decides when to consolidate the StringView into a new buffer
/// to reduce memory usage and improve string locality for better performance.
///
/// This differs from [`StringViewArray::gc`] because:
/// This differs from `StringViewArray::gc` because:
/// 1. It may not compact the array depending on a heuristic.
/// 2. It uses a larger default block size (2MB) to reduce the number of buffers to track.
/// 2. It uses a precise block size to reduce the number of buffers to track.
///
/// # Heuristic
///
Expand Down