- 
                Notifications
    You must be signed in to change notification settings 
- Fork 6.8k
          Add new autoscaling parameter aggregation function
          #51905
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
  
    Add new autoscaling parameter aggregation function
  
  #51905
              Conversation
| Some notes for @zcin. I originally worked from your prototype, however, there were a few concerns which lead to this simpler implementation. 
 However, given all this, I'm actually more confident in my previous implementation now than before. I'll add an approximate sketch below. It represents one theoretical scaling decision over some time. Currently we only have the blue option. This PR gives us the red option. My previous PR gives us the target. I think the target is achievable with this PR and very careful smoothing parameters, but it doesn't make the average scaling case any better. The average aggregation result will still discard all previous data over the calculation window, with no way to use that information for better scaling. Either way, I hope we can squeeze this in an upcoming release :). | 
6d5d722    to
    4ff155f      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can abstract these methods into one with a function parameter, but I don't expect there will be more added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason (legacy?) the replica count was based on replica-level metrics in the latter half of this block. Since this PR splits the aggregation function for replica and handle metrics, we need to modify this so that we always use handle metrics for autoscaling decisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The legacy autoscaling implementation was based off of request metrics collected from replicas, which is stored in _replica_requests. I think we want to keep that because if RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE is turned off, there won't be any data in _handle_requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha. Missed that case. Based on this, the feature would only be possible when RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=True. I'll re-add the fallback logic, add a warning message, and update the docs?
4ff155f    to
    b198139      
    Compare
  
    | @zcin Any timeline on moving forward here? Either through this or the previous method? | 
| @Stack-Attack I will take a look at this today! | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Stack-Attack, I understand your concern, but I think the current implementation may not be calculating exactly what you want it to because the aggregation is happening locally. Let me know what you think.
        
          
                python/ray/serve/_private/router.py
              
                Outdated
          
        
      | raise ValueError( | ||
| f"Unsupported aggregation function: " | ||
| f"{self.autoscaling_config.aggregation_function}" | ||
| ) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are 10 handles, since we are aggregating "locally" then this means the autoscaling decision would be summing up:
- max number of requests for handle 1 at T1
- max number of requests for handle 2 at T2
 ...
- max number of requests for handle 10 at T10
Is my understanding correct? So this is not calculating the maximum total number of request over the look back period, instead it's summing up a bunch of local maxes that occurred at different times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zcin That's correct.
- AFAIK there is only one serve controller. In large clusters with large deployments, aggregating all raw metrics in the controller seems inefficient.
- In any case, the timestamps sent from the
handleare not in sync with each other, so we would need to do some more complex windowing on the raw replica metrics if we wanted to align them. Otherwise, the controller sum over thelook_back_periodwill be identical to sum of each individual max/min/avg which is used [in this PR].- All this said, if this is the chosen method, I think this is a close enough approximation to the real
total_max_requestsgiven power-of-two routing should smooth things out.However, given all this, I'm actually more confident in my previous implementation now than before. I'll add an approximate sketch below. It represents one theoretical scaling decision over some time.
instead it's summing up a bunch of local maxes that occurred at different times?
The different times is the key issue. I don't see an easy way to align timestamps in the controller (2), and even if we did the processing would not scale out horizontally (1), so the approximation using the latest calculation from each handle (this PR) seemed like the best balance (3).
If I missed something important let me know. I dug through the code, but you've got much deeper insight I'm sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zcin Gentle bump. If we chose the approximation, then this should be good to go with little change. Want to confirm before finishing up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The legacy autoscaling implementation was based off of request metrics collected from replicas, which is stored in _replica_requests. I think we want to keep that because if RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE is turned off, there won't be any data in _handle_requests.
| This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. | 
| @zcin Bumping to remove stale label | 
| This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. | 
| hey @Stack-Attack , can we connect over ray slack (Abrar Sheikh) and come to a conclusion here? I am motivated to get this feature in, but there are a few considerations which are better discussed in person for faster iteration. Let me know | 
| @abrarsheikh Sure! Sorry for the delay; I was away for a few weeks. I'll get set up on the Ray slack and reach out this week. Looking forward to finally fixing this issue. | 
| Spoke with @abrarsheikh today. He suggested we aggregate the raw metrics in controller, then align/window/reduce there. Since this does not scale out horizontally, we will need to add tests ensuring the controller can handle arbitrarily large clusters. I plan to complete this next week! | 
b496dc8    to
    968a8c5      
    Compare
  
    | Let me know if I need any additional tests/benchmarks. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Stack-Attack for the contribution!
One request that I think will help us land your contribution faster: would you be able to split off your changes to metrics_utils and test_metrics_utils into a separate PR? That PR would be (1), and this PR would be (2) (and it would depend on PR#1). Since this change is quite big and touches a lot of components, this will help your reviewers focus on the main change per PR and speed up the review faster.
| self, replica_id: str, window_avg: Optional[float], send_timestamp: float | ||
| self, | ||
| replica_id: str, | ||
| metrics_store: InMemoryMetricsStore, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a big strange we're passing an InMemoryMetricsStore object in between separate processes. Can we send the underlying dict of data instead of a InMemoryMetricsStore object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the biggest change I will need clarity on. Of course, we can pass the data dict, but then we would have to move all aggregation and processing code into their own utility functions. I much prefer the idea of managing autoscaling metrics within their own class where we can ensure certain constraints rather than raw dictionaries. Especially as a loss of the constraints on these metrics (i.e pruning/sorting) could introduce nasty bugs in the future.
Please advise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can reconstruct the InMemoryMetricsStore object on the controller side. It seems like we don't allow that right now, you can probably add a class method that initializes InMemoryMetricsStore from a dictionary of data, something like def from_data(cls, data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do refactor for that, but can I ask the reasoning? Is pickling the class significantly slower than sending the dict and re-initializing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually not sure if there's overhead for pickling the class. I just think it's strange to send an "in memory store" object between distinct processes. In this case the InMemoryObjectStore object is just a wrapper that helps keep track of a bunch of utility functions anyways, it's not actually storing anything, so I think it's much cleaner to just send the data and reconstruct the wrapper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. I'll modify as suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified as suggested. We could also make the metrics_utils aggregation into @staticmethod and use them on the raw data, as the object initialization now seems a little redundant.
| if handle_metric.total_requests > 0: | ||
| logger.debug( | ||
| f"Dropping metrics for handle '{handle_id}' because the Serve " | ||
| f"actor it was on ({handle_metric.actor_id}) is no longer " | ||
| f"alive. It had {handle_metric.total_requests} ongoing requests" | ||
| ) | ||
| logger.debug( | ||
| f"Dropping metrics for handle '{handle_id}' because the Serve " | ||
| f"actor it was on ({handle_metric.actor_id}) is no longer " | ||
| f"alive." | ||
| ) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not gated by the if statement anymore? this can print a lot of logs that can confuse the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, each handle_metric stored it's total_requests in memory, since it was aggregated at the handle. The new method lazily computes the total_requests when needed. So this if statement would call a lot of compute for a simple debug log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still like to keep it gated because the UX here is not very ideal -- previously this was not gated and it would make users think something is wrong, because it can happen pretty often for handles that already zeroed out their metrics. Is there a way we can keep it gated without adding too much overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll think of an efficient way and re-Impliment with other suggestions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified to approximate the totals using only the most recent values. Since this is only used for debug logging, it should be more than accurate enough.
| logger.info( | ||
| f"Dropping stale metrics for handle '{handle_id}' {actor_info}" | ||
| f"because no update was received for {timeout_s:.1f}s. " | ||
| ) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
| for id in self._running_replicas: | ||
| if id in self._replica_requests: | ||
| total_requests += self._replica_requests[id].running_requests | ||
| def get_current_requests_per_replica(self) -> float: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this method used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan was to modify autoscaling to operate on a per_replica_aggregate value, however the current policy and testing structure require a total_running_replicas value. I decided to maintain this endpoint for visibility. As we move towards allowing custom autoscaling metrics and policies, I think it is worth maintaining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add it when we need it, it's easy for dead code to be introduced this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
| if aggregate_function == "mean": | ||
| total_requests, report_count = merged_metrics.aggregate_avg( | ||
| self._running_replicas | ||
| ) or (0.0, 0) | ||
| elif aggregate_function == "max": | ||
| total_requests, report_count = merged_metrics.aggregate_max( | ||
| self._running_replicas | ||
| ) or (0.0, 0) | ||
| elif aggregate_function == "min": | ||
| total_requests, report_count = merged_metrics.aggregate_min( | ||
| self._running_replicas | ||
| ) or (0.0, 0) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we just have a single aggregate function that takes in aggregate_function? so we push down this logic to be metrics store's responsibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about this. Either way is fine with me. I personally just prefer not passing functions or arbitrary function names as strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we are already passing aggregate_function as a string to this function? (and clarification: didn't mean to imply we should pass a function, I meant to pass the aggregate_function variable which is a string down to metrics store)
If you prefer to not work with aggregate_function == "mean" etc, how about changing it to enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggregation function is pulled from the 'autoscaling_config' directly since it's in-scope here. So it's a string but it's strictly defined by the config class. If you prefer enum or moving to metrics I can adjust either way if you feel one is optimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. How about introducing an enum and having the config take Union[str, enum], similar to here, with a similar validator. Then when you're passing it down to metrics store, you can pass the enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified as suggested. Aggregate now lives in metrics_utils, as a simple wrapper with an enum.
| print("TOTAL: ", deployment_state._replicas.count()) | ||
| print("STATES", replicas) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove?
| to ensure enclusivity of the metrics. | ||
| """ | ||
| aggregated_requests, report_count = self._get_request_count() | ||
| return aggregated_requests * max(1, report_count) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is report_count? Is this the number of replicas for the deployment, the number of handles that hold request metrics for this deployment, the number of data points per handle, the number of data points per replica?
Without knowing this, it's unclear whether the calculation aggregated_requests * max(1, report_count) makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
report_count is the number of replicas which have reported valid data to a valid handle_metric. This is not the same as running_replicas as there are cases where running_replicas have not yet reported metrics to the handle. So report_count ensures that the total value is calculated based only on the number of reports collected.
I'll make this more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? In _aggregate_reduce, you define:
        values = (
            timeseries.value for key in keys for timeseries in self.data.get(key, ())
        )
And then return the report_count as len(values). Doesn't that mean report_count is something along the lines of (#replicas) * (#data pts per replica)? Let me know if I'm understanding correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zcin Correct, it was Implimented incorrectly here, but the tests didn't catch it. I fixed this earlier today in the metrics PR so it will be resolved by that. As soon as that's approved I'll rebase this PR and resolve the metrics-related issues in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if handle_metric.total_requests > 0: | ||
| logger.debug( | ||
| f"Dropping metrics for handle '{handle_id}' because the Serve " | ||
| f"actor it was on ({handle_metric.actor_id}) is no longer " | ||
| f"alive. It had {handle_metric.total_requests} ongoing requests" | ||
| ) | ||
| logger.debug( | ||
| f"Dropping metrics for handle '{handle_id}' because the Serve " | ||
| f"actor it was on ({handle_metric.actor_id}) is no longer " | ||
| f"alive." | ||
| ) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, each handle_metric stored it's total_requests in memory, since it was aggregated at the handle. The new method lazily computes the total_requests when needed. So this if statement would call a lot of compute for a simple debug log.
| if aggregate_function == "mean": | ||
| total_requests, report_count = merged_metrics.aggregate_avg( | ||
| self._running_replicas | ||
| ) or (0.0, 0) | ||
| elif aggregate_function == "max": | ||
| total_requests, report_count = merged_metrics.aggregate_max( | ||
| self._running_replicas | ||
| ) or (0.0, 0) | ||
| elif aggregate_function == "min": | ||
| total_requests, report_count = merged_metrics.aggregate_min( | ||
| self._running_replicas | ||
| ) or (0.0, 0) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about this. Either way is fine with me. I personally just prefer not passing functions or arbitrary function names as strings.
| for id in self._running_replicas: | ||
| if id in self._replica_requests: | ||
| total_requests += self._replica_requests[id].running_requests | ||
| def get_current_requests_per_replica(self) -> float: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan was to modify autoscaling to operate on a per_replica_aggregate value, however the current policy and testing structure require a total_running_replicas value. I decided to maintain this endpoint for visibility. As we move towards allowing custom autoscaling metrics and policies, I think it is worth maintaining.
| self, replica_id: str, window_avg: Optional[float], send_timestamp: float | ||
| self, | ||
| replica_id: str, | ||
| metrics_store: InMemoryMetricsStore, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the biggest change I will need clarity on. Of course, we can pass the data dict, but then we would have to move all aggregation and processing code into their own utility functions. I much prefer the idea of managing autoscaling metrics within their own class where we can ensure certain constraints rather than raw dictionaries. Especially as a loss of the constraints on these metrics (i.e pruning/sorting) could introduce nasty bugs in the future.
Please advise.
| to ensure enclusivity of the metrics. | ||
| """ | ||
| aggregated_requests, report_count = self._get_request_count() | ||
| return aggregated_requests * max(1, report_count) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
report_count is the number of replicas which have reported valid data to a valid handle_metric. This is not the same as running_replicas as there are cases where running_replicas have not yet reported metrics to the handle. So report_count ensures that the total value is calculated based only on the number of reports collected.
I'll make this more clear.
…ler. (#55568) ## Why are these changes needed? These changes modify the autoscaler metrics collection and aggregation functions in preparation for global aggregation in the controller. ## Related issue number Partial for #46497 Required for #41135 #51905 <!-- For example: "Closes #1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: abrar <[email protected]> Co-authored-by: Abrar Sheikh <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: abrar <[email protected]>
Signed-off-by: Kyle <[email protected]>
Co-authored-by: Arthur Leung <[email protected]> Signed-off-by: Kyle Robinson <[email protected]>
…ity. Signed-off-by: Kyle Robinson <[email protected]>
Co-authored-by: Cindy Zhang <[email protected]> Signed-off-by: Kyle Robinson <[email protected]>
Co-authored-by: Cindy Zhang <[email protected]> Signed-off-by: Kyle Robinson <[email protected]>
Signed-off-by: Kyle Robinson <[email protected]>
b8faf5f    to
    8da09f0      
    Compare
  
    | Reminder of the core changes: 
 What does not change here: 
 @zcin @abrarsheikh I've rebased on master, and adapted the code from previous suggestions. Unit tests I've run have passed, and I'll watch the results of the full test suite. Tomorrow I need to add more tests for the new functionality, as well as any changes you suggest. | 
| replica_report.metrics_dict | ||
| for replica_report in self._replica_requests.values() | ||
| ], | ||
| window_s=1, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be set to a variable. I would think it should be set to the same rate at which metrics are collected, not simply pushed. I didn't have time today to find it.
Signed-off-by: Kyle Robinson <[email protected]>
| Looking into failing tests and running them locally. | 
Signed-off-by: Kyle Robinson <[email protected]>
| If tests fail again, I'll split this PR one more time. Have one which just moves the aggregation to the controller, and then add  | 
…ler. (ray-project#55568) ## Why are these changes needed? These changes modify the autoscaler metrics collection and aggregation functions in preparation for global aggregation in the controller. ## Related issue number Partial for ray-project#46497 Required for ray-project#41135 ray-project#51905 <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: abrar <[email protected]> Co-authored-by: Abrar Sheikh <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: abrar <[email protected]> Signed-off-by: sampan <[email protected]>
| @Stack-Attack after looking at the details more closely, I think we need a gradual approach for this rollout. For that reason, I’ll take ownership of this task instead of moving forward with the current PR. I appreciate your effort here and hope you understand my reasoning. | 
…ler. (ray-project#55568) ## Why are these changes needed? These changes modify the autoscaler metrics collection and aggregation functions in preparation for global aggregation in the controller. ## Related issue number Partial for ray-project#46497 Required for ray-project#41135 ray-project#51905 <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: abrar <[email protected]> Co-authored-by: Abrar Sheikh <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: abrar <[email protected]> Signed-off-by: jugalshah291 <[email protected]>
…ler. (ray-project#55568) ## Why are these changes needed? These changes modify the autoscaler metrics collection and aggregation functions in preparation for global aggregation in the controller. ## Related issue number Partial for ray-project#46497 Required for ray-project#41135 ray-project#51905 <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: abrar <[email protected]> Co-authored-by: Abrar Sheikh <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: abrar <[email protected]> Signed-off-by: yenhong.wong <[email protected]>
…ler. (#55568) ## Why are these changes needed? These changes modify the autoscaler metrics collection and aggregation functions in preparation for global aggregation in the controller. ## Related issue number Partial for #46497 Required for #41135 #51905 <!-- For example: "Closes #1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: Kyle Robinson <[email protected]> Signed-off-by: abrar <[email protected]> Co-authored-by: Abrar Sheikh <[email protected]> Co-authored-by: Cindy Zhang <[email protected]> Co-authored-by: abrar <[email protected]> Signed-off-by: Douglas Strodtman <[email protected]>

Why are these changes needed?
Currently, the serve autoscaler makes scaling decisions only based on the most recent Serve Controller computation, even if the serve controller has made many scaling calculations over the scaling delay period. This results in poor autoscaling when clusters utilize long upscale/downscale delays. This PR allows more sensitive scaling by implementing min and max aggregate functions to handle metric collection in addition to the current average.
An alternate fix to this issue was proposed here..
Related issue number
#46497
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.