Skip to content

Conversation

OlivierDehaene
Copy link
Contributor

No description provided.

@njhill
Copy link
Contributor

njhill commented Apr 26, 2023

@OlivierDehaene I didn't get a chance yet to respond to #226, will do soon.

But I'm not sure about the change in this PR. Specifically, I don't think that only taking into account the number of tokens generated until a first request completes makes sense w.r.t. the batch inclusion decision.

Yes the total token count will shrink at that point but it could continue to grow indefinitely large without any new requests being added, depending on how many tokens the other requests in the batch generate.

Sorry if I've misunderstood something, which is very possible!

@Narsil
Copy link
Contributor

Narsil commented Apr 26, 2023

You're correct. We discussed internally and decide to postpone for now.

The core issue is that we're running in low free RAM mode and we want to squeeze things a bit too much. We're going to find a different route where we have more free space.
What you describe is indeed very real potential issue (everything should be recalculated after the batch is filtered out, but that still opens up potential issues).

@njhill
Copy link
Contributor

njhill commented Apr 26, 2023

@Narsil that's part of the reason for the complexity of my original impl, see the equivalent analysis here. You can use a cruder upper bound for the analysis which does make the logic simpler (as is done in the current main branch impl), but pay for it in reduced utilization.

My PR only covers the flash attention (non-padded) case, give me a bit of time today, I will push the changes to make it generic (I have each impl in different branches right now but the delta is relatively minor).

It would be good to talk it through with you guys if you are interested / have time!

@Narsil
Copy link
Contributor

Narsil commented Apr 26, 2023

but pay for it in reduced utilization.

If I'm not mistaken, your original implem does allow request re-order, correct ? IMO this can be OK in some circumstances, but is probably not ideal in some circumstances.

  • What if there's always a smaller request skipping you in the queue, you end up never being served.
  • Also in amounts of fairness, I'd say keeping request order is important to give the "capped" latency promise. Your request will be processed in bounded times or be rejected out of the box.

In general if a system is overloaded, it's not going to behave nicely. Here we were attempting to work in low free RAM environment, but the truth is that we are low on RAM, so we're capped in the amount of parallelism we can use. Fixing this core issue in our hardware is the only viable solution IMO (since it's doable).

@njhill
Copy link
Contributor

njhill commented Apr 26, 2023

@Narsil actually the reordering is an additional optimization, I was referring more to the batch admission logic.

The logic I linked to does a "projection" for each possible next request to add to the batch, to determine whether its inclusion would result in the configured max batch "weight" (e.g. total token count) being exceeded immediately or any point in the future. I.e. it segments the future time-steps based on worst-case number of tokens that each request will generate, and ensures that the batch shape at the end of each of the segments fits.

The reordering is in addition to this. If a particular next request doesn't fit according to the above logic, it will walk back up the queue to find other requests that do fit, but will only consider requests that arrived within a configurable time threshold of the request at the front of the queue (e.g. 1 second). This hopefully avoids the concerns that you described w.r.t. fairness and tail latencies.

@njhill
Copy link
Contributor

njhill commented Apr 26, 2023

The goal of this is to maximize the throughput for a given GPU. Beyond that you can always scale up/out via bigger GPU or multiple replicas.

Fixing this core issue in our hardware is the only viable solution IMO (since it's doable)

Are you referring to using larger GPUs / more of them?

@njhill
Copy link
Contributor

njhill commented Apr 26, 2023

I should also mention that the "weight" is calculated differently for the non-flash case since the memory requirement is quadratic in sequence length... so I use batch_size * seq_len^2.

@OlivierDehaene
Copy link
Contributor Author

[...] will only consider requests that arrived within a configurable time threshold of the request at the front of the queue (e.g. 1 second).

Yes but in chat apps, you always run with very large max_new_tokens. Then if you jump in front of another request, it could mean a latency increase of 10s of seconds for the request being jumped.

@Narsil
Copy link
Contributor

Narsil commented Apr 26, 2023

Are you referring to using larger GPUs / more of them?

Pretty much. Just general better balance of RAM, FLOPS, $ :) (Here we're short on RAM for the requests since the model is taking up too much already, leaving not enough room, to feed the compute efficiently basically)

@njhill
Copy link
Contributor

njhill commented Apr 26, 2023

Yes but in chat apps, you always run with very large max_new_tokens. Then if you jump in front of another request, it could mean a latency increase of 10s of seconds for the request being jumped.

But whatever resulting delay there is would have happened anyhow if the larger request had arrived 1 sec later. The queue-skipping can easily be disabled / doesn't need to be included. The motivation was for cases where there's a large request that needs some existing large requests in the batch to complete to make room, but in the meantime shorter requests could be processed without actually affecting how long the larger one has to wait.

I did actually think about making that part more sophisticated, to allow queue-jumping only for such cases where the timestep that the skipped request would be introduced isn't affected (or perhaps causes it to be additionally delayed by some maximum time like 1 sec). But the arrival time threshold was simpler for a first impl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants