-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Description
Pipeline Description
Similar pipeline as #54433
- We have a I/O stage which downloads from internet (~100 seconds)
- We have an iterate stage (essentially unzips the file and parses it as pd.DataFrame; ~30 seconds)
- We have an extract stage (that performs some expensive compute, super expensive ~700 seconds)
- We have a write stage
Between each of the stages we pass a python object.
Setup
We can run this "pipeline" both as tasks or actors.
- When using tasks, we don't specify concurrency. The runtime is 2x faster than using actors.
- When using actors we specify (1, 16) as concurrency.
Problem
We have 16 CPUs and ~4 stages.
I only see ~6 cpus being used instead of all 16. For any of the stages the concurrency=(1,16) but none of the stages are at peak capacity.
Given the skew in time taken per stage, I'd imagine the slowest stage is running at peak capacity. But I'm interested in learning why / why not.
Screenshots
If you notice in the the following screenshot, Iterate already has produced 25 tasks, and naturally so it's been downscaled to 1 actor. However DocumentExtract (our slowest stage) is only at 4 actors. I'd imagine at this point since Extraction has a lot of pending work in its queue, we'll scale it up to max capacity but we didn't.
After some time, if you notice in the following screenshot, Iterate ends up producing all the 48 tasks (even with its single actor). However DocumentExtract still throughout the job never exceeded 4 actors. (I notice that at the time of this screenshot we have only 4 tasks left so 4 actors make sense, but throughout the job we didn't exceed 4 actors for this job)
Question
- At any point, since Iterate has produced so much "output" which is in memory, won't we want to scale up Iterate+1'th stage to scale up, to start popping from the queue?
- There is a 2x perf difference when comparing tasks vs actors. I understand its not a fair comparision since in Task world, Iterate/Extract/Write are all fused. But even then the workload distribution seems off