Skip to content

Conversation

@yuvmen
Copy link
Member

@yuvmen yuvmen commented Sep 18, 2025

In preparation to making the switch to token length being considered instead of frame count of errors, we take metrics of the token length of stacktraces being sent to be able to map out the statistics and the impact that would make. Insturmented get_token_count to monitor how long it takes.

Introduces usage of tokenizers library for token count. Added the local tokenization model to Sentry to be used for tokenization without external dependencies.


Note

Add token-count metrics for Seer stacktraces using a transformers tokenizer, gate via option, adjust ingest checks, and update tests.

  • Seer Similarity (backend):
    • Add token counting for stacktraces: get_token_count and report_token_count_metric using transformers.AutoTokenizer with jinaai/jina-embeddings-v2-base-en.
    • Emit grouping.similarity.token_count distribution and timing via metrics.timer; capture exceptions with sentry_sdk.
    • Reorder ingest checks in should_call_seer_for_grouping to run _has_too_many_contributing_frames after _has_empty_stacktrace_string.
  • Options:
    • Register seer.similarity.token_count_metrics_enabled (Bool, default True).
  • Dependencies:
    • Add transformers>=4.21.0.
  • Tests:
    • Add token count tests and adjust assertions to accommodate additional metric calls.

Written by Cursor Bugbot for commit d079587. This will update automatically on new commits. Configure here.

…Seer

In preparation to making the switch to token length being considered instead of frame count of errors,
we take metrics of the token length of stacktraces being sent to be able to map out the statistics
and the impact that would make. Insturmented get_token_count to monitor how long it takes.
We use the existing titoken library which was already in use in Sentry.
@yuvmen yuvmen requested a review from a team as a code owner September 18, 2025 22:22
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Sep 18, 2025
@yuvmen yuvmen requested a review from lobsterkatie September 18, 2025 22:22
@yuvmen yuvmen changed the title feat(ai_grouping) - Send token length metrics on stacktraces sent to Seer feat(ai_grouping): Send token length metrics on stacktraces sent to Seer Sep 18, 2025
cursor[bot]

This comment was marked as outdated.

Comment on lines 338 to 368
sample_rate=options.get("seer.similarity.metrics_sample_rate"),
tags={**shared_tags, "outcome": "block"},
)
metrics.distribution(
"grouping.similarity.token_count",
get_token_count(event, variants),
sample_rate=options.get("seer.similarity.metrics_sample_rate"),
tags={
"platform": platform,
"frame_check_outcome": "block",
},
)
return True

metrics.incr(
"grouping.similarity.frame_count_filter",
sample_rate=options.get("seer.similarity.metrics_sample_rate"),
tags={**shared_tags, "outcome": "pass"},
)
metrics.distribution(
"grouping.similarity.token_count",
get_token_count(event, variants),
sample_rate=options.get("seer.similarity.metrics_sample_rate"),
tags={
"platform": platform,
"frame_check_outcome": "pass",
},
)
return False


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential bug: The has_too_many_contributing_frames function calls get_token_count twice in its main execution path, instead of reusing the result from the first call.
  • Description: In the has_too_many_contributing_frames function, the token count is calculated multiple times unnecessarily. After an initial call to get_token_count whose result is stored in the token_count variable, the function is called again for metrics reporting in both the "block" and "pass" branches, instead of reusing the value from the token_count variable. This leads to a second, redundant execution of an expensive operation in a critical event ingestion path, impacting performance for every event processed by Seer.

  • Suggested fix: In the "block" and "pass" branches of has_too_many_contributing_frames, reuse the token_count variable for the metrics.timing call instead of calling get_token_count a second time. The "bypass" branch should also be refactored to avoid a separate call, perhaps by calculating the token count once at the top of the function.
    severity: 0.6, confidence: 0.95

Did we get this right? 👍 / 👎 to inform future reviews.

@codecov
Copy link

codecov bot commented Sep 18, 2025

Codecov Report

❌ Patch coverage is 96.42857% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/sentry/seer/similarity/utils.py 96.36% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #99873      +/-   ##
==========================================
+ Coverage   78.81%   81.30%   +2.49%     
==========================================
  Files        8699     8593     -106     
  Lines      385940   383015    -2925     
  Branches    24413    23858     -555     
==========================================
+ Hits       304162   311429    +7267     
+ Misses      81427    71243   -10184     
+ Partials      351      343       -8     

try:
timer_tags["has_content"] = False

# Get the encoding for cl100k_base (used by GPT-3.5-turbo/GPT-4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should double check that this is the same/similar enough to the tokenizer our embeddings model uses

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm thats actually a good point, I used what we use in Sentry in Seer its actually something else. I looked and it seems like the actual models we use in Seer are environment depenedant, locally it uses a dummy and in production they come from Google storage, ill ask the AI team lets see

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought it was the same one! Yeah, I def think we should try to align if possible.

timer_tags["has_content"] = False

# Get the encoding for cl100k_base (used by GPT-3.5-turbo/GPT-4)
encoding = tiktoken.get_encoding("cl100k_base")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this line have to be done every function call, or could it be static?

timer_tags["source"] = "no_stacktrace_string"
return 0

except Exception:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we report the exception to sentry?

)


def get_token_count(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm wondering if we should put this behind an option for the initial deploy just to be safe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a part of me feels like I am wrapping it with a try so it should be fine, but I dont mind being extra safe here. You mean like internally inside this get token to not even perform the try and just return a zero? or around even sending the metrics?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just around being able to debug any exceptions / know why they're happening

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh woops, i replied to the wrong comment. i mean just around running the function to report the metric, in case for whatever reason there's a problem when we deploy and need to roll back. (thinking perf/memory/whatever problem)

Copy link
Member

@JoshFerge JoshFerge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me.



def get_token_count(
event: Event | GroupEvent, variants: dict[str, BaseVariant] | None = None
Copy link
Member

@lobsterkatie lobsterkatie Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason variants is typed as being nullable? AFAIK that shouldn't be possible.

data={"title": "Broken event"},
)

# Should not raise an exception, even with bad data
Copy link
Member

@lobsterkatie lobsterkatie Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this in fact raise an exception inside the function? Can we assert on that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will be adding a repot to sentry there, I will assert it gets called

@yuvmen
Copy link
Member Author

yuvmen commented Sep 19, 2025

refactored a bunch, could use another look - still up in the air if titoken is going to be a good gauge for our seer token count, so I will need to see about that

cursor[bot]

This comment was marked as outdated.


def report_token_count_metric(
event: Event | GroupEvent,
variants: dict[str, BaseVariant] | None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before - this shouldn't be nullable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bah cursor keeps adding it for some reason 🤦

try:
timer_tags["has_content"] = False
timer_tags["cached"] = event.data.get("stacktrace_string") is not None
timer_tags["source"] = "stacktrace_string"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timer_tags["source"] = "stacktrace_string"
timer_tags["source"] = "cached_string"

(To better differentiate from the get_stacktrace_string source.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but there's a "cached" tag, though I guess I can just lose that and unifty it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I prefer it like this, I dont want to have to write "non_cached_string" on the other, and I dont want it to just say "stacktrace_string" and someone to need to know "cached_string" also exists.

stacktrace_text = event.data.get("stacktrace_string")

if stacktrace_text is None:
stacktrace_text = get_stacktrace_string(get_grouping_info_from_variants(variants))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stacktrace_text = get_stacktrace_string(get_grouping_info_from_variants(variants))
stacktrace_text = get_stacktrace_string(get_grouping_info_from_variants(variants))
timer_tags["source"] = "get_stacktrace_string"

(To address cursor's complaint, which I think is valid.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand this one, adding the source tag here? I set it above always, and I dont see any cursor comment about this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the comment I'm talking about:

image

So yeah, if you have the cached boolean separately (which I wasn't thinking about), then how you had it originally is fine. If you go with unifying them, though, having cached_stacktrace and get_stacktrace_string as two different values tells you whether or not it was there, just like cached would.


def test_handles_exception_gracefully(self) -> None:
"""Test that get_token_count handles exceptions gracefully and returns 0."""
# Create an event with cached stacktrace that will cause tiktoken to fail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is a misleading comment - the only reason tikitoken fails is because we later force it to, not because of the cached stacktrace.

mock_encode.side_effect = ValueError("Tiktoken encoding failed")

with patch("sentry.seer.similarity.utils.logger.error") as mock_logger_error:
token_count = get_token_count(broken_event, variants=None, platform="python")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
token_count = get_token_count(broken_event, variants=None, platform="python")
token_count = get_token_count(broken_event, variants={}, platform="python")

(Since variants should be non-nullable.)

cursor[bot]

This comment was marked as outdated.

]

IGNORED_FILENAMES = ["<compiler-generated>"]
TOKENIZER = Tokenizer.from_pretrained("jinaai/jina-embeddings-v2-base-en")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Global Tokenizer Initialization Causes Startup Delays

The TOKENIZER is initialized globally at module import time via Tokenizer.from_pretrained(). This can significantly slow down application startup, make unnecessary network requests, and potentially crash the module import if the model fails to load, even when the token counting feature is disabled.

Fix in Cursor Fix in Web

…cal file

saved the model locally under data/models and added a readme for downloading it again

if stacktrace_text:
timer_tags["has_content"] = True
return len(get_tokenizer().encode(stacktrace_text))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Token Count Calculation and Stacktrace Metrics Issues

The get_token_count function incorrectly calculates token counts by using len() on the tokenizer's Encoding object instead of its .ids attribute. It also reports misleading timer_tags["source"] metrics, showing "stacktrace_string" even when the stacktrace is generated rather than cached. Furthermore, calling get_stacktrace_string with empty variants may lead to a crash.

Fix in Cursor Fix in Web

token_count = get_token_count(broken_event, variants=variants, platform="python")
mock_logger_exception.assert_called()

assert token_count == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test Fails to Validate Tokenizer Exception Handling

The test_handles_exception_gracefully test doesn't properly verify tokenizer exception handling. It attempts to mock a non-existent TOKENIZER global, and get_token_count returns early when variants is empty, preventing the mocked exception from ever being reached.

Fix in Cursor Fix in Web

# Double-check pattern to avoid race conditions
if self._tokenizer is None:
# Try to load from local model first, fallback to remote
if os.path.exists(TOKENIZER_MODEL_PATH):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should always be true, no? the remote fallback might not be completely secure or successful (e.g., jinaai makes the huggingface repo private)

if for some reason we do want the remote fallback, then it should supply

self._tokenizer = Tokenizer.from_pretrained(
    "jinaai/jina-embeddings-v2-base-en",
    revision="322d4d7e2f35e84137961a65af894fda0385eb7a",
)

Copy link
Member Author

@yuvmen yuvmen Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yea I was conflicted on the remote fallback.. looking at it again I am not sure it makes sense, I think I will just remove it. If this stays in our source code it should not get broken and tests will fail if it does. I will consider it if I end up having to host the model file.

@yuvmen yuvmen merged commit 7d8584b into master Oct 14, 2025
68 checks passed
@yuvmen yuvmen deleted the yuvmen/token-count-stacktraces-poc branch October 14, 2025 20:01
@yuvmen yuvmen added the Trigger: Revert Add to a merged PR to revert it (skips CI) label Oct 14, 2025
@getsentry-bot
Copy link
Contributor

PR reverted: 982c529

getsentry-bot added a commit that referenced this pull request Oct 14, 2025
yuvmen added a commit that referenced this pull request Oct 15, 2025
…eer (#101477)

In preparation to making the switch to token length being considered
instead of frame count of errors, we take metrics of the token length of
stacktraces being sent to be able to map out the statistics and the
impact that would make. Insturmented get_token_count to monitor how long
it takes.

Introduces usage of tokenizers library for token count. Added the local
tokenization model to Sentry to be used for tokenization without
external dependencies.

Redo of #99873 which removed `tiktoken` dep by mistake. It is still used
in `getsentry` and causes build errors if removed.
chromy pushed a commit that referenced this pull request Oct 17, 2025
…eer (#99873)

In preparation to making the switch to token length being considered
instead of frame count of errors, we take metrics of the token length of
stacktraces being sent to be able to map out the statistics and the
impact that would make. Insturmented get_token_count to monitor how long
it takes.

Introduces usage of `tokenizers` library for token count. Added the
local tokenization model to Sentry to be used for tokenization without
external dependencies.
chromy pushed a commit that referenced this pull request Oct 17, 2025
@github-actions github-actions bot locked and limited conversation to collaborators Oct 30, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Scope: Backend Automatically applied to PRs that change backend components Trigger: Revert Add to a merged PR to revert it (skips CI)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants