Skip to content

Add delay logic between Rep's retries to download droplets #628

@vlast3k

Description

@vlast3k

Summary

When Rep is downloading droplets from a blobstore in some cases the Hyperscaller may apply throttling. E.g. Azure has a limit of ~100-150 Gbps, and as soon as this threshold is reached some HTTP Requests are terminated with "503 ServerBusy" so that the maximum bandwidth is not exceeded. For some reason they aren't just reducing the download speed of all of the connections, but just terminating some of them. Also they aren't responding with 429, which is the standard but 503.

We tried to workaround this by decreasing the diego.executor.max_concurrent_downloads to 2, but there was no improvement (we are updating ~45 cells in parallel). For now we will decrease the max_in_flight property, but this is rather a temporary solution and will increase the update time.

This is why we think it will be good to change the code which handles the retires in case of failure. It seems to be here
https://github.com/cloudfoundry/cacheddownloader/blob/master/downloader.go#L213-L215
And add some delay in case of 429 (eventually by processing also "Retry-After" header) and 503 ServerBusy (specifically for Azure).

We should discuss to what extent this should be configurable:

  • Plain on/off switch to enable/disable the functionality
  • or Configurable delay, even some randomness
  • or just add some preset delay of e.g. 5 seconds in case of those errors appearing

Diego repo

https://github.com/cloudfoundry/cacheddownloader

Describe alternatives you've considered (optional)

  • decrease diego.executor.max_concurrent_downloads from 5 to 2 - for some reason this did not help. Assumption is that Azure is summing up the downloaded data for a certain amount of time and regardless of the number of threads, it reaches the limit
  • decrease max_in_flight - this will be our current workaround, though this will increase the update time
  • use bigger VMs for the diego cells, so that we update less of them in parallel - this is something that we are currently working on, but is also a temporary solution

Additional Text Output, Screenshots, or contextual information (optional)

Please add any other context, slack conversations, log files, code snippets, or screenshots that would help us understand the request.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions