Skip to content

xds: client outlier detection metrics [A91] #8437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

PardhuKonakanchi
Copy link

@PardhuKonakanchi PardhuKonakanchi commented Jul 8, 2025

Implements A91 in grpc-go

Two counters have been added according to the gRFC within the outlier detection algorithm on either ejections or unenforced ejection.

Currently, grpc.lb.backend_service is an optional label with an empty string as per the design until A89 and A75 are both implemented in grpc-go, after which the string can be updated for these metrics.

RELEASE NOTES:

  • outlierdetection: add metrics for enforced (grpc.lb.outlier_detection.ejections_enforced) and unenforced (grpc.lb.outlier_detection.ejections_unenforced) outlier ejections.

Tests

Updating existing outlier detection balancer tests to check if ejection stats are measured

Copy link

codecov bot commented Jul 8, 2025

Codecov Report

Attention: Patch coverage is 60.00000% with 14 lines in your changes missing coverage. Please review.

Project coverage is 82.22%. Comparing base (a21e374) to head (19021a8).
Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
xds/internal/balancer/outlierdetection/balancer.go 60.00% 10 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8437      +/-   ##
==========================================
- Coverage   82.26%   82.22%   -0.05%     
==========================================
  Files         414      414              
  Lines       40410    40454      +44     
==========================================
+ Hits        33243    33262      +19     
- Misses       5796     5814      +18     
- Partials     1371     1378       +7     
Files with missing lines Coverage Δ
xds/internal/balancer/outlierdetection/balancer.go 86.33% <60.00%> (-2.45%) ⬇️

... and 16 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@eshitachandwani eshitachandwani self-requested a review July 11, 2025 18:09
@eshitachandwani eshitachandwani self-assigned this Jul 11, 2025
@eshitachandwani eshitachandwani added the Area: xDS Includes everything xDS related, including LB policies used with xDS. label Jul 11, 2025
endpoints: resolver.NewEndpointMap[*endpointInfo](),
metricsRecorder: cc.MetricsRecorder(),
target: bOpts.Target.String(),
backendService: "", // Will be set in UpdateClientConnState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed since when nothing is set, fields take default zero value which is an empty string. The comment can be put when writing the definition of the struct here

@eshitachandwani eshitachandwani added the Type: Feature New features or improvements in behavior label Jul 14, 2025
@eshitachandwani eshitachandwani added this to the 1.75 Release milestone Jul 14, 2025
@eshitachandwani
Copy link
Member

Also please add release notes.

@eshitachandwani
Copy link
Member

Adding @dfawley for a second review.

@PardhuKonakanchi
Copy link
Author

go test -v ./internal/transport is failing in CI checks, but running locally with go 1.23.0 and 1.24.5 passed for me. I'm presuming the failure is unrelated to this PR's changes

@arjan-bal
Copy link
Contributor

go test -v ./internal/transport is failing in CI checks, but running locally with go 1.23.0 and 1.24.5 passed for me. I'm presuming the failure is unrelated to this PR's changes

Hi @PardhuKonakanchi, the test failure is due to a recent bug introduced in the master branch. The fix has been merged in #8451. I'll re-trigger CI.

@PardhuKonakanchi
Copy link
Author

@dfawley Gentle ping for a review on this one, thanks!

@eshitachandwani eshitachandwani self-requested a review July 25, 2025 05:28
Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you very much. I have just a couple minor comments and test requests inline.

pickerUpdateCh: buffer.NewUnbounded(),
channelzParent: bOpts.ChannelzParent,
endpoints: resolver.NewEndpointMap[*endpointInfo](),
metricsRecorder: cc.MetricsRecorder(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: since we have cc captured, maybe we don't also want to save the MetricsRecorder in another field? Less state seems better.

Comment on lines +160 to +168
// extractBackendService extracts the backend service from resolver state attributes.
// This is a placeholder implementation - the actual extraction logic should be
// implemented based on the specific resolver attributes available.
func extractBackendService(resolver.State) string {
// TODO: Implement backend service extraction from resolver attributes per A89 and A75
// For now, return empty string as this is optional
return ""
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please remove this? If it's not clear this needs to be done as part of A75/A89 implementation work, then you could file a bug so that we don't forget about it. Thanks!

channelzParent channelz.Identifier
metricsRecorder estats.MetricsRecorder
target string
backendService string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be removed along with the above.

if float64(b.numEndpointsEjected)/float64(b.endpoints.Len())*100 >= float64(b.cfg.MaxEjectionPercent) {
// Record unenforced ejection due to max ejection percentage
ejectionsUnenforcedMetric.Record(b.metricsRecorder, 1, b.target, "success_rate", "max_ejection_overflow", b.backendService)
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add tests that include the cases that are currently missed? I.e. ejection did not occur for various reasons.

@dfawley dfawley removed their assignment Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants