- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1.7k
 
STRING_AGG missing functionality #14412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
59ac6aa    to
    fb56b81      
    Compare
  
    fb56b81    to
    808e417      
    Compare
  
    | 
           Close/reopen to rerun CI checks  | 
    
808e417    to
    4abf8a0      
    Compare
  
    | NULL | ||
| 
               | 
          ||
| query T | ||
| SELECT STRING_AGG(DISTINCT x,',') FROM strings WHERE g > 100 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing space between the x and ','.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷♂️ It's the format that all the previous STRING_AGG tests were following, I kept it for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consistency always wins over correctness!
b36a9c5    to
    c022671      
    Compare
  
    e204059    to
    f6be8e4      
    Compare
  
    | pub struct StringAgg { | ||
| signature: Signature, | ||
| array_agg: ArrayAgg, | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I've seen this pattern much in DataFusion codebase about reusing another function as a building block. Any opinions about this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems good to me 👍
f6be8e4    to
    babc94b      
    Compare
  
    | #[cfg(test)] | ||
| mod tests { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm finding that having unit tests near the function definitions itself is significantly more ergonomic than writing sql logic tests for a couple of reasons:
- We can tests the accumulators in isolation, allowing for finer grained control about batch updating, state generation, merging different states from different accumulators, etc...
 - The time it takes since a developer makes a code change, until the appropriate test is run is reduced significantly:
 
cargo test --lib string_agg::tests --manifest-path   1.40s user 1.53s system 132% cpu 2.204 totalVS
cargo test --test sqllogictests  33.61s user 7.17s system 365% cpu 11.164 totalMeasured on a Mac M3
Not saying that we should not have sql logic tests, I think those are a must, but maybe having some testing tooling for folks to be able to contribute unit tests also here could improve DX.
I see that this is not an stablished pattern though, and I'm wondering what are people's take on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think that we should strive that sqllogictests cover all the "end user" visible behavior -- since they function as integration style test and make sure the functionality is all hooked up together correctly (not just working in isolation)
Unit tests / in module tests are good to cover cases that are hard to cover in sqllogictests
I think there is some more back story here: https://datafusion.apache.org/contributor-guide/testing.html#sqllogictests-tests
In terms of cycle times, sqllogictests do have the benefit you can update them without any code changes (so writing / updating them is sometimes faster than code), though you are right that testing code changes requires a recompilation
@logan-keede has been working on improving the build performance recently. Hopefully this will get better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 it looks there's still some cards that can be played for making compilation times faster then. 👍 thanks for all that info!
| 
           @geoffreyclaude (who is helping review) when you think this PR is ready, can you please ping me and I'll give it a review?  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty complete to me, you've covered all my initial suggestions!
| 
           @alamb all good for me! You can give the final review.  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @gabotechs and @geoffreyclaude -- I think this implementation is very nice, clean and well tested. 🏆
| pub struct StringAgg { | ||
| signature: Signature, | ||
| array_agg: ArrayAgg, | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems good to me 👍
Which issue does this PR close?
distinctandorder byclause forstring_aggaggregate function #8260.Rationale for this change
See #14413 first.
Complete the missing functionality of the STRING_AGG function.
What changes are included in this PR?
Adds support for DISTINCT and ORDER_BY clauses by reusing the existing ARRAY_AGG functionality and building the whole STRING_AGG aggregation function on top of it. This way, the full STRING_AGG functionality is automatically implemented [almost] for free.
The rationale for reusing the ARRAY_AGG functionality is because both functions are very similar, with just two minor diferences:
In order to have the full STRING_AGG functionality, some small addition is also needed for the ARRAY_AGG function, as the current implementation is missing support for DISTINCT + ORDER BY. See #14413.
Are these changes tested?
Yes, both in unit tests and sqllogictests.
Are there any user-facing changes?
Users will be able to issue STRING_AGG calls with DISTINCT and ORDER BY clauses.