feat(ai_grouping): Send token length metrics on stacktraces sent to Seer #101477

yuvmen · 2025-10-14T20:52:38Z

In preparation to making the switch to token length being considered instead of frame count of errors, we take metrics of the token length of stacktraces being sent to be able to map out the statistics and the impact that would make. Insturmented get_token_count to monitor how long it takes.

Introduces usage of tokenizers library for token count. Added the local tokenization model to Sentry to be used for tokenization without external dependencies.

Redo of #99873 which removed tiktoken dep by mistake. It is still used in getsentry and causes build errors if removed.

…Seer In preparation to making the switch to token length being considered instead of frame count of errors, we take metrics of the token length of stacktraces being sent to be able to map out the statistics and the impact that would make. Insturmented get_token_count to monitor how long it takes. We use the existing titoken library which was already in use in Sentry.

We will be turning it off only if something goes wrong

Meant introducing new `transformers` package to Sentry

…cal file saved the model locally under data/models and added a readme for downloading it again

It is still used in getsentry, and causes build to fail if removed

seer-by-sentry · 2025-10-14T21:00:06Z

src/sentry/seer/similarity/utils.py

+                    return 0
+                stacktrace_text = get_stacktrace_string(get_grouping_info_from_variants(variants))
+
+            if stacktrace_text:


Potential bug: get_token_count calls get_grouping_info_from_variants, which returns data with keys incompatible with the downstream get_stacktrace_string function, causing incorrect calculations.

Description: The get_token_count function calls get_grouping_info_from_variants to generate grouping information when a cached stacktrace string is not available. This function creates a dictionary with keys like app_stacktrace. However, the downstream get_stacktrace_string function, which consumes this data, expects keys like app and system. This key mismatch causes get_stacktrace_string to find no relevant data, produce an empty string, and consequently makes get_token_count always return 0. This silently defeats the purpose of the new token count metrics collection.

Suggested fix: In get_token_count, replace the call to get_grouping_info_from_variants(variants) with get_grouping_info_from_variants_legacy(variants). This will produce a data structure with the keys that get_stacktrace_string expects, allowing for correct token count calculation.
_{severity: 0.65, confidence: 0.95}

_{Did we get this right? 👍 / 👎 to inform future reviews.}

This was actually correct, @lobsterkatie recently made a change to get_grouping_info_from_variants changing existing usages to get_grouping_info_from_variants_legacy, which needed to happen here as well

also realised my tests didnt cover this case. While I realize we previously said variants shouldn't be empty, I guess it doesnt hurt to protect from it since its a dict and we cant be sure theoretically.

codecov · 2025-10-14T21:16:55Z

Codecov Report

❌ Patch coverage is 94.64286% with 3 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/sentry/seer/similarity/utils.py	94.54%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master   #101477      +/-   ##
===========================================
+ Coverage   78.81%    81.29%   +2.48%     
===========================================
  Files        8699      8595     -104     
  Lines      385940    383404    -2536     
  Branches    24413     23858     -555     
===========================================
+ Hits       304162    311698    +7536     
+ Misses      81427     71363   -10064     
+ Partials      351       343       -8

…cy` and add test for it

yuvmen and others added 19 commits September 18, 2025 15:20

test and typing fix

159a9f9

Review refactors

fec50e8

typing fixes

8518cf7

PR comments

4767202

Setting option to be default true

2d29e85

We will be turning it off only if something goes wrong

fix test

da58c5d

Merge branch 'master' into yuvmen/token-count-stacktraces-poc

e0c7ac7

Changed tokenizer to same one used by Seer

d079587

Meant introducing new `transformers` package to Sentry

swapped tokenizer library to tokenizers

ff26951

❄️ re-freeze requirements

17c3a6a

Changed Tokenizer model to be lazy loaded at runtime and load from lo…

5b36cb2

…cal file saved the model locally under data/models and added a readme for downloading it again

typing fix

06777b1

small refactor to tags and protetction from empty variants

42e61d7

fix tests

4a8d2ac

Merge branch 'master' into yuvmen/token-count-stacktraces-poc

2dbbe74

Merge branch 'master' into yuvmen/token-count-stacktraces-poc

2e4d20a

removed remote fallback, raised error which gets caught instead

f7de5db

return tiktoken dependacny removed by mistake

a38b96c

It is still used in getsentry, and causes build to fail if removed

yuvmen requested review from a team as code owners October 14, 2025 20:52

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Oct 14, 2025

yuvmen changed the title ~~Yuvmen/token count stacktraces poc~~ feat(ai_grouping): Send token length metrics on stacktraces sent to Seer Oct 14, 2025

❄️ re-freeze requirements

e1f2ab7

vercel bot deployed to Preview October 14, 2025 20:55 View deployment

JoshFerge approved these changes Oct 14, 2025

View reviewed changes

seer-by-sentry bot reviewed Oct 14, 2025

View reviewed changes

correct no variants case to use `get_grouping_info_from_variants_lega…

8c3987a

…cy` and add test for it

vercel bot deployed to Preview October 14, 2025 22:16 View deployment

Merge branch 'master' into yuvmen/token-count-stacktraces-poc

b94eeff

vercel bot deployed to Preview October 15, 2025 16:30 View deployment

yuvmen merged commit e24b4f5 into master Oct 15, 2025
66 checks passed

yuvmen deleted the yuvmen/token-count-stacktraces-poc branch October 15, 2025 16:53

github-actions bot locked and limited conversation to collaborators Oct 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(ai_grouping): Send token length metrics on stacktraces sent to Seer #101477

feat(ai_grouping): Send token length metrics on stacktraces sent to Seer #101477

Uh oh!

yuvmen commented Oct 14, 2025

Uh oh!

seer-by-sentry bot Oct 14, 2025

Uh oh!

yuvmen Oct 14, 2025

Uh oh!

yuvmen Oct 14, 2025

Uh oh!

codecov bot commented Oct 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

feat(ai_grouping): Send token length metrics on stacktraces sent to Seer #101477

feat(ai_grouping): Send token length metrics on stacktraces sent to Seer #101477

Uh oh!

Conversation

yuvmen commented Oct 14, 2025

Uh oh!

seer-by-sentry bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

yuvmen Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

yuvmen Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Oct 14, 2025 •

edited

Loading