Update metrics_utils for future global metrics aggregation in controller. #55568

Stack-Attack · 2025-08-13T07:42:39Z

Why are these changes needed?

These changes modify the autoscaler metrics collection and aggregation functions in preparation for global aggregation in the controller.

Related issue number

Partial for #46497

Required for #41135 #51905

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

gemini-code-assist

Code Review

This pull request refactors the metrics aggregation logic for autoscaling by introducing new generic aggregation methods in InMemoryMetricsStore and a consolidate_metrics_stores function. The changes are a good step towards improving the separation of concerns. However, I've identified a few critical issues where None return values from aggregation functions are not handled, which will lead to TypeError exceptions. Additionally, there's a bug in consolidate_metrics_stores that could cause data loss for a specific metric, and a minor API design concern. Please review the detailed comments for suggestions on how to address these points.

python/ray/serve/_private/replica.py

python/ray/serve/_private/router.py

python/ray/serve/_private/metrics_utils.py

python/ray/serve/_private/replica.py

Stack-Attack · 2025-08-13T08:31:15Z

@zcin Partial PR for #51905 is here.

abrarsheikh · 2025-08-13T16:16:41Z

python/ray/serve/_private/metrics_utils.py

+        return self._aggregate_reduce(keys, statistics.mean)
+
+
+def consolidate_metrics_stores(*stores: InMemoryMetricsStore) -> InMemoryMetricsStore:


can you explain which serve component would call this function?

@abrarsheikh This would be called in the 'autoscaler_state_manager' to consolidate metrics from multiple handles. AFAIK there is no guarantee that replica-handle metrics are one-to-one (based on current logic), so I wanted to handle the case of the same replica reporting through different handles with a different cadence.

can we add a docstring explaining inputs and return value? Since there are 2 axes, handles and replicas, it would be helpful if the docstring could explain with that context in mind.

Added here, as well as the aggregate functions as the tuple return is new.

You mentioned in the other PR to use a single aggregate function which takes an enum. Let me know if you want that here. I'd need to import the enum to test_metrics_utils and, replica.py and router.py, which I'm not sure is preferable.

python/ray/serve/_private/metrics_utils.py

abrarsheikh · 2025-08-13T17:40:50Z

python/ray/serve/_private/metrics_utils.py

+                if key == QUEUED_REQUESTS_KEY:
+                    # Sum queued requests across handle metrics.
+                    merged.data[QUEUED_REQUESTS_KEY][-1].value += timeseries[-1].value
+                elif key not in merged.data or (
+                    timeseries
+                    and (
+                        not merged.data[key]
+                        or timeseries[-1].timestamp > merged.data[key][-1].timestamp
+                    )
+                ):
+                    # Replace if not present or if newer datapoints are available
+                    merged.data[key] = timeseries.copy()


finding it hard to follow this logic. can you deconstruct the expression to better readability

each timeseries here is the number of requests (over e.g. past 30 seconds) that handle X send to replica Y right? are we choosing only the data from one out of all handles here, instead of aggregating across handles?

We're going to have a lot of handle_metrics or replica_metrics ending up in the controller. We will store all of these as raw InMemoryMetricsStore objects. Each key stores one timeseries. Assuming that:

Different metric_store's might contain the same key.

All key's map to one unique replica.
This function allows us to feed all the metrics_stores in, and only get the latest metrics for each key. So if a key is found, we overwrite it's data if the new timeseries newer. So this doesn't aggregate, it just consolidates the data and returns a single object representing the global state of all metrics_stores.

I'll break out that large conditional for readability, and add more documentation to make this clear!

@Stack-Attack I'm not convinced that's what we want. If we have 2 handles, A and B, sending requests to a single replica X, then handle A can have time series (3, 4, 5), and handle B can have time series (7, 8, 9). This means that handle A most recently recorded 5 requests that were sent to replica X, and handle B recorded 9 requests that were sent to replica X. This means that 14 requests in total are executing on replica X.

However this code seems to only be taking either 5 or 9, instead of 14.

Alternatively, we could use the replica-level metric collection for running requests, and only collect the queued requests from handles? If custom metrics will be recorded at the replica anyway, this might be better? @abrarsheikh

If I recall correctly, one reason metrics collection was moved to handles was to strictly guarantee that replicas do not exceed their max_running limit. Handles currently report the number of requests they send to each replica, and this is the metric source for running_requests.

To support custom metrics in the future, however, we are going to need to send metrics from the replica-level right? So it might make sense to unify the metrics collection back at the replica, and send only queue length from the handles.

I think this is the only major blocker. Since either solution above has pro's and con's, I think I should leave it to maintainers to decide on the design.

I think @Stack-Attack, your proposal to send queued requests from the handle and the rest of the metrics from replicas makes sense to me.

we'll have to apply a windowed sum across all replica metrics?

that is correct

Changed the consolidate function to a merge function using a time window. In practice this window will be metrics_interval_s, which will be sent from the controller.

python/ray/serve/_private/router.py

python/ray/serve/_private/replica.py

zcin · 2025-08-13T19:08:25Z

python/ray/serve/_private/metrics_utils.py

+        return self._aggregate_reduce(keys, statistics.mean)
+
+
+def consolidate_metrics_stores(*stores: InMemoryMetricsStore) -> InMemoryMetricsStore:


can we add a docstring explaining inputs and return value? Since there are 2 axes, handles and replicas, it would be helpful if the docstring could explain with that context in mind.

zcin · 2025-08-13T19:12:58Z

python/ray/serve/_private/metrics_utils.py

+                if key == QUEUED_REQUESTS_KEY:
+                    # Sum queued requests across handle metrics.
+                    merged.data[QUEUED_REQUESTS_KEY][-1].value += timeseries[-1].value
+                elif key not in merged.data or (
+                    timeseries
+                    and (
+                        not merged.data[key]
+                        or timeseries[-1].timestamp > merged.data[key][-1].timestamp
+                    )
+                ):
+                    # Replace if not present or if newer datapoints are available
+                    merged.data[key] = timeseries.copy()


each timeseries here is the number of requests (over e.g. past 30 seconds) that handle X send to replica Y right? are we choosing only the data from one out of all handles here, instead of aggregating across handles?

Stack-Attack · 2025-08-14T05:50:04Z

Error in rebasing on master. Fixing now.

Signed-off-by: Kyle Robinson <[email protected]>

zcin

CI is failing because the example you added in the docstring for _aggregate_reduce is tested and failing: https://buildkite.com/ray-project/microcheck/builds/23995#0198e5ec-e7b9-49a2-b299-2226bf66710e/189-869

zcin · 2025-08-26T18:23:49Z

python/ray/serve/_private/metrics_utils.py

+                else:
+                    merged.data[key] = timeseries.copy()
+
+    return merged


OK, let's table this for when you raise the next PR!

zcin · 2025-08-26T18:32:47Z

python/ray/serve/tests/unit/test_metrics_utils.py

+        assert merged["m1"] == [
+            TimeStampedValue(0, 101),
+            TimeStampedValue(2, 1),
+            TimeStampedValue(4, 2),


why is there data at timestamp=4? I don't see data added for m1 at timestamp=4

The windows can shift based on ordering. The only way to avoid this would be to compute the global start_timestamp before any merges.

merge(s1,s2, window=2): t=1 => window==[0, 2) => sum(1) => TimeStampedValue(1,1) t=3 => window==[2,4) => sum(2) => TimeStampedValue(3,2) result = (1,1),(3,2) merge(result, s4, window=2): t=0 => window==[-1,1) => sum(100) => TimeStampedValue(0,100) t=2 => window==[1,3) => sum(1) => TimeStampedValue(2,1) t=4 => window==[3,5) => sum(2) => TimeStampedValue(4,2) result = (0,100),(2,1),(4,2)

Note, this was still passing incorrectly due to:

@dataclass(order=True) class TimeStampedValue: timestamp: float value: float = field(compare=False)

I added a comparison to the tests to force check the value as well. I think we should remove the compare=False, but it's not required.

@Stack-Attack Hmm it seems pretty strange that the windows can shift. can this be fixed if we instead key windows by start of the window instead of using ts_center = start + w * window_s + (window_s / 2.0)? aka directly use start + w * window_s?

@Stack-Attack actually I'm not sure if that's the exact issue. We just need the windows to be consistent - aka if on the previous merge the windows were (0,2), (2,4) etc, it shouldn't shift to (1,3),(3,5) etc at the next merge. How about we make sure that the windows are always multiples of window_s, aka it's always (0,window_s), (window_s,2*window_s)?

@zcin Good call! Modified to snap start to window_s grid. I maintained the window_s / 2.0 as it should still be helpful when window_s matches the polling rate of the metrics.

python/ray/serve/_private/metrics_utils.py

Signed-off-by: Kyle Robinson <[email protected]>

abrarsheikh · 2025-09-02T22:14:39Z

@Stack-Attack could you please address @zcin 's comment? I think we are very close to closing this.

Signed-off-by: Kyle Robinson <[email protected]>

…e-metrics-utils

Signed-off-by: Kyle Robinson <[email protected]>

Signed-off-by: abrar <[email protected]>

abrarsheikh

i made 3 changes

added a validation for window_s
apply the bucketing logic even when a ensure that if a single timeseries is passed
added more unit tests

zcin

LGTM with one more comment! I think we're good to merge after this.

python/ray/serve/_private/metrics_utils.py

Signed-off-by: abrar <[email protected]>

…ler. (ray-project#55568) ## Why are these changes needed? These changes modify the autoscaler metrics collection and aggregation functions in preparation for global aggregation in the controller. ## Related issue number Partial for ray-project#46497 Required for ray-project#41135 ray-project#51905  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: abrar <[email protected]> Co-authored-by: Abrar Sheikh <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: abrar <[email protected]> Signed-off-by: sampan <[email protected]>

…ler. (ray-project#55568) ## Why are these changes needed? These changes modify the autoscaler metrics collection and aggregation functions in preparation for global aggregation in the controller. ## Related issue number Partial for ray-project#46497 Required for ray-project#41135 ray-project#51905  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: abrar <[email protected]> Co-authored-by: Abrar Sheikh <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: abrar <[email protected]> Signed-off-by: jugalshah291 <[email protected]>

…ler. (ray-project#55568) ## Why are these changes needed? These changes modify the autoscaler metrics collection and aggregation functions in preparation for global aggregation in the controller. ## Related issue number Partial for ray-project#46497 Required for ray-project#41135 ray-project#51905  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: abrar <[email protected]> Co-authored-by: Abrar Sheikh <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: abrar <[email protected]> Signed-off-by: yenhong.wong <[email protected]>

…ler. (#55568) ## Why are these changes needed? These changes modify the autoscaler metrics collection and aggregation functions in preparation for global aggregation in the controller. ## Related issue number Partial for #46497 Required for #41135 #51905  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: abrar <[email protected]> Co-authored-by: Abrar Sheikh <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: abrar <[email protected]> Signed-off-by: Douglas Strodtman <[email protected]>

…ler. Original PR #55568 by Stack-Attack Original: ray-project/ray#55568

… aggregation in controller. Merged from original PR #55568 Original: ray-project/ray#55568

…ler. (ray-project#55568) ## Why are these changes needed? These changes modify the autoscaler metrics collection and aggregation functions in preparation for global aggregation in the controller. ## Related issue number Partial for ray-project#46497 Required for ray-project#41135 ray-project#51905  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: abrar <[email protected]> Co-authored-by: Abrar Sheikh <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: abrar <[email protected]>

Stack-Attack requested a review from a team as a code owner August 13, 2025 07:42

gemini-code-assist bot reviewed Aug 13, 2025

View reviewed changes

python/ray/serve/_private/replica.py Show resolved Hide resolved

python/ray/serve/_private/router.py Outdated Show resolved Hide resolved

python/ray/serve/_private/metrics_utils.py Outdated Show resolved Hide resolved

python/ray/serve/_private/metrics_utils.py Outdated Show resolved Hide resolved

Stack-Attack force-pushed the update-metrics-utils branch from 1424436 to 0988170 Compare August 13, 2025 08:10

Stack-Attack mentioned this pull request Aug 13, 2025

Add new autoscaling parameter aggregation function #51905

Closed

Stack-Attack changed the title ~~Update metrics_utils for global metrics aggregation in controller.~~ Update metrics_utils for future global metrics aggregation in controller. Aug 13, 2025

Stack-Attack commented Aug 13, 2025

View reviewed changes

python/ray/serve/_private/replica.py Show resolved Hide resolved

ray-gardener bot added community-contribution Contributed by the community core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Aug 13, 2025

abrarsheikh reviewed Aug 13, 2025

View reviewed changes

zcin reviewed Aug 13, 2025

View reviewed changes

Stack-Attack requested review from a team, SongGuyang, kfstorm and raulchen as code owners August 14, 2025 05:46

Stack-Attack added 5 commits August 14, 2025 14:53

Update metrics_utils for global metrics aggregation in controller.

1ad0b1b

Signed-off-by: Kyle Robinson <[email protected]>

Fix unpacking

56b94b5

Signed-off-by: Kyle Robinson <[email protected]>

Lint

61cb6f3

Signed-off-by: Kyle Robinson <[email protected]>

Bugfixes

3c3171f

Signed-off-by: Kyle Robinson <[email protected]>

Update metrics_utils for global metrics aggregation in controller.

59b5723

Signed-off-by: Kyle Robinson <[email protected]>

Stack-Attack requested a review from zcin August 26, 2025 10:22

abrarsheikh added the go add ONLY when ready to merge, run all tests label Aug 26, 2025

zcin reviewed Aug 26, 2025

View reviewed changes

python/ray/serve/_private/metrics_utils.py Outdated Show resolved Hide resolved

Stack-Attack and others added 3 commits August 27, 2025 15:18

Fix failing tests.

94770df

Signed-off-by: Kyle Robinson <[email protected]>

Fix docstring example

d0f8db5

Signed-off-by: Kyle Robinson <[email protected]>

Merge branch 'master' into update-metrics-utils

05fa04a

abrarsheikh approved these changes Aug 27, 2025

View reviewed changes

Stack-Attack and others added 3 commits September 3, 2025 11:10

Stabilize sequential binning.

dce01c2

Signed-off-by: Kyle Robinson <[email protected]>

Merge remote-tracking branch 'origin/update-metrics-utils' into updat…

7a5e237

…e-metrics-utils

Merge branch 'master' into update-metrics-utils

881edc2

Stack-Attack requested a review from zcin September 3, 2025 02:14

Stack-Attack force-pushed the update-metrics-utils branch from 3963cdb to 881edc2 Compare September 3, 2025 05:42

chore(ci): retrigger CI (empty, signed)

edd2b30

Signed-off-by: Kyle Robinson <[email protected]>

arcyleung mentioned this pull request Sep 3, 2025

[serve] Include custom metrics method and report to controller #56005

Merged

8 tasks

add more tests

19a1ea8

Signed-off-by: abrar <[email protected]>

abrarsheikh reviewed Sep 3, 2025

View reviewed changes

zcin approved these changes Sep 3, 2025

View reviewed changes

python/ray/serve/_private/metrics_utils.py Outdated Show resolved Hide resolved

use window start instead of center

a6ec6ab

Signed-off-by: abrar <[email protected]>

zcin approved these changes Sep 4, 2025

View reviewed changes

zcin merged commit 2cbf1e2 into ray-project:master Sep 4, 2025
5 checks passed

snorkelopstesting1-a11y mentioned this pull request Oct 22, 2025

Update metrics_utils for future global metrics aggregation in controller. snorkel-marlin-repos/ray-project_ray_pr_55568_5fb8ce7a-e80b-46af-9fd3-5970bcd17ef0#1

Merged

		return self._aggregate_reduce(keys, statistics.mean)


		def consolidate_metrics_stores(*stores: InMemoryMetricsStore) -> InMemoryMetricsStore:

Update metrics_utils for future global metrics aggregation in controller. #55568

Update metrics_utils for future global metrics aggregation in controller. #55568

Conversation

Stack-Attack commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Stack-Attack commented Aug 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Stack-Attack commented Aug 14, 2025

Uh oh!

zcin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

abrarsheikh commented Sep 2, 2025

Uh oh!

abrarsheikh left a comment

Choose a reason for hiding this comment

Uh oh!

zcin left a comment

Stack-Attack commented Aug 13, 2025 •

edited

Loading