Skip to content

[Serve] Calculate autoscaling decisions over whole scale_delay_s period #46497

@kyle-v6x

Description

@kyle-v6x

Description

Currently, as per

def replica_queue_length_autoscaling_policy(
, scaling decisions are only made if they are consistent over the upscale_delay_s or downscale_delay_s however the final scaling decision is only based on the desired_num_replicas for that single call.

It would be nice to calculate the desired_num_replicas over the entire period. Even better would be selecting between min/max/average desired_num_replicas calculated over the delay period.

We can store the desired_num_replicas in the policy_state and caclulate the min/max/avg once the scaling decision is made.

Use case

Currently, deployments with high variability in request counts have no clear mechanism to scale appropriately. For example, if downscaling_delay_s=60 and there are 6 checks to the autoscaler, desired_num_replicas=>[10,10,15,10,10,5] then the cluster will scale to 5. Alternatively, if desired_num_replicas=>[2,2,2,15,2,2] there is a case to be made that the cluster should use 15 in order to accomodate the maximum traffic.

Note:
Increasing look_back_period_s results in slowing down all scaling decisions, as well as changing the entire balance of ongoing_requests due to the smoothing effect.
upscaling_factor and downscaling_factor can be used to help, but in the example cases above they still completely miss the correct autoscaling decision.

Metadata

Metadata

Assignees

Labels

P2Important issue, but not time-criticalenhancementRequest for new feature and/or capabilityserveRay Serve Related Issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions