-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently row conversion is stateful, relying on a separate RowConverter to maintain this global state.
This has a number of drawbacks:
- Prevents parallelizing row conversion across multiple threads
- Complicates APIs which have to manage mutable access to a RowConverter - https://github.com/apache/arrow-datafusion/blob/cf229b82cb53f1d9f430a2fa4ff5191ba52e5cef/datafusion/core/src/physical_plan/sorts/stream.rs#L38
- State accumulation can be problematic, necessitating workarounds like Implement
CardinalityAwareRowConverterwhile doing streaming merge datafusion#7401 - The dictionary interning logic is very expensive - Make dictionary preservation optional in row encoding #3831
Describe the solution you'd like
I would like to propose removing the dictionary preservation logic, instead always hydrating the dictionaries values when encoding. This in turn would allow simplifying the API to no longer have a notion of a stateful RowConverter.
This may represent a performance regression for dictionaries with small number of values. We should definitely quantify this, but it is my expectation that this will only occur for dictionaries with a very low number of values. It is currently the case that even arrays with low numbers of distinct values may contain non-trivial number of values as a result of the way dictionaries are handled by the various kernels and readers, and so I'm inclined to not weigh this very highly.
Describe alternatives you've considered
Additional context