Skip to content

[data] Ray Autoscaling - Suboptimal Performance with Actors #54540

@praateekmahajan

Description

@praateekmahajan

Pipeline Description

Similar pipeline as #54433

  1. We have a I/O stage which downloads from internet (~100 seconds)
  2. We have an iterate stage (essentially unzips the file and parses it as pd.DataFrame; ~30 seconds)
  3. We have an extract stage (that performs some expensive compute, super expensive ~700 seconds)
  4. We have a write stage

Between each of the stages we pass a python object.

Setup

We can run this "pipeline" both as tasks or actors.

  1. When using tasks, we don't specify concurrency. The runtime is 2x faster than using actors.
  2. When using actors we specify (1, 16) as concurrency.

Problem

We have 16 CPUs and ~4 stages.
I only see ~6 cpus being used instead of all 16. For any of the stages the concurrency=(1,16) but none of the stages are at peak capacity.
Given the skew in time taken per stage, I'd imagine the slowest stage is running at peak capacity. But I'm interested in learning why / why not.

Screenshots

If you notice in the the following screenshot, Iterate already has produced 25 tasks, and naturally so it's been downscaled to 1 actor. However DocumentExtract (our slowest stage) is only at 4 actors. I'd imagine at this point since Extraction has a lot of pending work in its queue, we'll scale it up to max capacity but we didn't.

Image

After some time, if you notice in the following screenshot, Iterate ends up producing all the 48 tasks (even with its single actor). However DocumentExtract still throughout the job never exceeded 4 actors. (I notice that at the time of this screenshot we have only 4 tasks left so 4 actors make sense, but throughout the job we didn't exceed 4 actors for this job)

Image

Question

  1. At any point, since Iterate has produced so much "output" which is in memory, won't we want to scale up Iterate+1'th stage to scale up, to start popping from the queue?
  2. There is a 2x perf difference when comparing tasks vs actors. I understand its not a fair comparision since in Task world, Iterate/Extract/Write are all fused. But even then the workload distribution seems off

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions