Skip to content

Improve type coercion and casting #8302

@jayzhan211

Description

@jayzhan211

Is your feature request related to a problem or challenge?

I think there is room for improvement in type coerceion or casting.

Background

comparison_coercion is widely used in datafusion, a lossless conversion
https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/type_coercion/binary.rs

can_coerce_from is used mainly for signature, a lossless conversion
https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/type_coercion/functions.rs

can_cast_types is from arrow-cast, which is a lossy conversion. It is also used in some comparison_coercion building block. https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/arrow-cast/src/cast.rs#L76-L273

Not sure if there is other coercion I missed

Proposal

comparison_coercion and can_coerce_from seem like doing the similar thing, maybe we can just have one lossless conversion. If lossless conversion is useful for arrow-rs, we can introduce a lossless version of can_cast_types, then rely on it for datafusion.

Lossy conversion vs Lossless

I think the definition for lossy is that the value is not recoverable after casting back, otherwise it is lossless.

Lossy

  • Int32 to Int16 / Int8

Lossless

  • Int32 to Int64

Describe the solution you'd like

  1. Replace can_coerce_from with comparison_coercion's building block numeric coercion, list coercion, string coercion, null coercion, etc
  2. Split list_coercion from string_coercion to make each building block of coercion clear on the task it focus on. list_coercion do list/fixed size list/large list coercion, string_coercion do utf/large utf coercion.
  3. Introduce these lossless coercion to arrow-rs?

Known issue or question I have

  • Introduce list_coercion that currently exist in string_concat_coercion
  • No list coercion for can_coerce_from
  • Decimal128 can cast to Float64 in can_coerce_from, why?

Describe alternatives you've considered

If there are many customize conversion need, then this change might not be helpful at all. We need other approach to let type casting / coercion easy to use.

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions