Skip to content

Stable SortPreservingMergeStream #1686

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

                                                                                 
┌─────────────────────────┐  Merged Stream                                       
│ ┌───┬───┬───┬───┐       │                 ┌───────────────────────────────┐    
│ │ A │ B │ C │ D │ ...   │────────┐        │ ┌───┬───╦═══╦───┬───╦═══╗     │    
│ └───┴───┴───┴───┘       │        ├───────▶│ │ A │ B ║ B ║ C │ D ║ E ║ ... │    
└─────────────────────────┘        │        │ └───┴─▲─╩═══╩───┴───╩═══╝     │    
  Stream 1                         │        └───────────────────────────────┘    
                                   │                └ ─ ─ ─ ─ ─ ─ ┐              
                                   │                                             
┌─────────────────────────┐        │                              │              
│ ╔═══╦═══╗               │        │                                             
│ ║ B ║ E ║     ...       │────────┘                   Stable Sort: the merged   
│ ╚═══╩═══╝               │                         stream places equal rows from
└─────────────────────────┘                          stream 1 *before* stream 2  
  Stream 2                                                                       
                                                                                 

Thus, if stream 1 and stream 2 had the same record values, it was guaranteed that the row from stream1 came out before stream 2. IOx uses this property correctly deduplicate updates (as streams with a larger index were inserted later).

IOx uses this property correctly deduplicate updates (as streams with a larger index were inserted later).

SortPreservingMergeStream appears to no longer be stable after #1596 was merged Specifically, the previous implementation of SortPreservingMergeSteam would always pick inputs with a lower 'stream index' when the sort keys were tied.

Describe the solution you'd like
I would like SortPreservingMergeStream to be stable

Describe alternatives you've considered
We could add a synthetic "column" to our input and then add it to the sort key, but this seems like a larger overhead than necessary

Additional context
I have a simple fix for this. Will get it up shortly.

cc @yjshen

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions