[data] Ray Autoscaling - Suboptimal Performance with Actors

### Pipeline Description
Similar pipeline as https://github.com/ray-project/ray/issues/54433 

1. We have a I/O stage which downloads from internet (~100 seconds)
2. We have an iterate stage (essentially unzips the file and parses it as pd.DataFrame; ~30 seconds)
3. We have an extract stage (that performs some expensive compute, super expensive ~700 seconds)
4. We have a write stage

Between each of the stages we pass a python object.

## Setup
We can run this "pipeline" both as tasks or actors. 
1. When using tasks, we don't specify concurrency.  The runtime is 2x faster than using actors.
2. When using actors we specify (1, 16) as concurrency.

### Problem
We have 16 CPUs and ~4 stages. 
I only see ~6 cpus being used instead of all 16. For any of the stages the concurrency=(1,16)  but none of the stages are at peak capacity.
Given the skew in time taken per stage, I'd imagine the slowest stage is running at peak capacity. But I'm interested in learning why / why not. 

## Screenshots
If you notice in the the following screenshot, Iterate already has produced 25 tasks, and naturally so it's been downscaled to 1 actor. However DocumentExtract  (our slowest stage) is only at 4 actors. I'd imagine at this point since Extraction has a lot of pending work in its queue, we'll scale it up to max capacity but we didn't.

<img width="1229" height="121" alt="Image" src="https://github.com/user-attachments/assets/6cd14d81-6d78-4c6a-9d2b-618f23f258d2" />

After some time, if you notice in the following screenshot, Iterate  ends up producing all the 48 tasks (even with its single actor). However DocumentExtract still throughout the job never exceeded 4 actors. (I notice that at the time of this screenshot we have only 4 tasks left so 4 actors make sense, but **throughout** the job we didn't exceed 4 actors for this job)


<img width="1229" height="121" alt="Image" src="https://github.com/user-attachments/assets/86831f64-95bd-4933-a3b4-1bbef9e442ba" />

## Question
1. At any point, since Iterate  has produced so much "output" which is in memory, won't we want to scale up Iterate+1'th stage to scale up, to start popping from the queue?
2. There is a 2x perf difference when comparing tasks vs actors. I understand its not a fair comparision since in Task world, Iterate/Extract/Write are all fused. But even then the workload distribution seems off

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data] Ray Autoscaling - Suboptimal Performance with Actors #54540

Pipeline Description

Setup

Problem

Screenshots

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[data] Ray Autoscaling - Suboptimal Performance with Actors #54540

Description

Pipeline Description

Setup

Problem

Screenshots

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions