- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 4.5k
 
feat(ai_grouping): Send token length metrics on stacktraces sent to Seer #101477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…Seer In preparation to making the switch to token length being considered instead of frame count of errors, we take metrics of the token length of stacktraces being sent to be able to map out the statistics and the impact that would make. Insturmented get_token_count to monitor how long it takes. We use the existing titoken library which was already in use in Sentry.
We will be turning it off only if something goes wrong
Meant introducing new `transformers` package to Sentry
…cal file saved the model locally under data/models and added a readme for downloading it again
It is still used in getsentry, and causes build to fail if removed
| return 0 | ||
| stacktrace_text = get_stacktrace_string(get_grouping_info_from_variants(variants)) | ||
| 
               | 
          ||
| if stacktrace_text: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
 Potential bug: get_token_count calls get_grouping_info_from_variants, which returns data with keys incompatible with the downstream get_stacktrace_string function, causing incorrect calculations.
- 
Description: The
get_token_countfunction callsget_grouping_info_from_variantsto generate grouping information when a cached stacktrace string is not available. This function creates a dictionary with keys likeapp_stacktrace. However, the downstreamget_stacktrace_stringfunction, which consumes this data, expects keys likeappandsystem. This key mismatch causesget_stacktrace_stringto find no relevant data, produce an empty string, and consequently makesget_token_countalways return 0. This silently defeats the purpose of the new token count metrics collection. - 
Suggested fix: In
get_token_count, replace the call toget_grouping_info_from_variants(variants)withget_grouping_info_from_variants_legacy(variants). This will produce a data structure with the keys thatget_stacktrace_stringexpects, allowing for correct token count calculation.
severity: 0.65, confidence: 0.95 
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually correct, @lobsterkatie recently made a change to get_grouping_info_from_variants changing existing usages to get_grouping_info_from_variants_legacy, which needed to happen here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also realised my tests didnt cover this case. While I realize we previously said variants shouldn't be empty, I guess it doesnt hurt to protect from it since its a dict and we cant be sure theoretically.
          Codecov Report❌ Patch coverage is  
 Additional details and impacted files@@             Coverage Diff             @@
##           master   #101477      +/-   ##
===========================================
+ Coverage   78.81%    81.29%   +2.48%     
===========================================
  Files        8699      8595     -104     
  Lines      385940    383404    -2536     
  Branches    24413     23858     -555     
===========================================
+ Hits       304162    311698    +7536     
+ Misses      81427     71363   -10064     
+ Partials      351       343       -8      | 
    
…cy` and add test for it
In preparation to making the switch to token length being considered instead of frame count of errors, we take metrics of the token length of stacktraces being sent to be able to map out the statistics and the impact that would make. Insturmented get_token_count to monitor how long it takes.
Introduces usage of tokenizers library for token count. Added the local tokenization model to Sentry to be used for tokenization without external dependencies.
Redo of #99873 which removed
tiktokendep by mistake. It is still used ingetsentryand causes build errors if removed.