Skip to content

Conversation

Terry-Uv
Copy link

Hi there. Since this is my first PR on github—feedback is very welcome. Thanks in advance for your review!

This PR fixes a crash that occurs when a sequence is preempted during decode and later re-enters the prefill path. In that scenario worker ranks hit:

AttributeError: 'Sequence' object has no attribute 'token_ids'

The root cause is that our cross-process Sequence serialization intentionally avoided sending token_ids after decode began, but prefill-after-preemption does need them.

Problem Reproduction

I use the following script to make this failure mode easy to reproduce and to sanity-check multi-GPU runs. The script drives a large batch and long prompts to increase the chance of preemption and re-prefill.

import time
import argparse
from random import randint, seed
from nanovllm import LLM, SamplingParams

MODEL_PATH = "model_path"

def build_prompts(num_seqs, max_input_len, min_input_len=100, vocab=10000):
    lens = [randint(min_input_len, max_input_len) for _ in range(num_seqs)]
    return [[randint(0, vocab) for _ in range(L)] for L in lens]

def parse_args():
    p = argparse.ArgumentParser()
    p.add_argument("--model", type=str, default=MODEL_PATH, help="HF model dir (local)")
    p.add_argument("--tp", type=int, default=1, help="tensor parallel size")
    p.add_argument("--num_seqs", type=int, default=256)
    p.add_argument("--max_input_len", type=int, default=4096)
    p.add_argument("--min_input_len", type=int, default=100)
    p.add_argument("--max_output_len", type=int, default=256)
    p.add_argument("--temperature", type=float, default=0.7)
    p.add_argument("--max_model_len", type=int, default=4096)
    p.add_argument("--vocab", type=int, default=10000,
                   help="toy vocab size for synthetic token IDs")
    p.add_argument("--seed", type=int, default=0)
    p.add_argument("--gpu_mem_util", type=float, default=0.9,
                   help="gpu_memory_utilization for KV cache planning")
    p.add_argument("--enforce_eager", action="store_true",
                   help="disable CUDA graph (debug/compat)")
    return p.parse_args()

def main():
    args = parse_args()
    seed(args.seed)

    t0 = time.time()
    llm = LLM(
        args.model,
        tensor_parallel_size=args.tp,
        enforce_eager=args.enforce_eager,
        max_model_len=args.max_model_len,
        gpu_memory_utilization=args.gpu_mem_util,
    )
    init_time = time.time() - t0
    print(f"[init] model={args.model} tp={args.tp} init_time={init_time:.3f}s")

    prompt_token_ids = build_prompts(
        num_seqs=args.num_seqs,
        max_input_len=args.max_input_len,
        min_input_len=args.min_input_len,
        vocab=args.vocab,
    )
    sampling_params = [
        SamplingParams(
            temperature=args.temperature,
            max_tokens=args.max_output_len,
            ignore_eos=True,
        ) for _ in range(args.num_seqs)
    ]

    t0 = time.time()
    llm.generate(prompt_token_ids, sampling_params, use_tqdm=False)
    t1 = time.time()

    total_decode_tokens = args.num_seqs * args.max_output_len
    total_time = t1 - t0
    tokps = total_decode_tokens / total_time if total_time > 0 else float("inf")
    print(f"[throughput] num_seqs={args.num_seqs} "
          f"decode_tokens={total_decode_tokens} time={total_time:.3f}s "
          f"throughput={tokps:.2f} tok/s")

if __name__ == "__main__":
    main()

And then:

python bench_tp.py \
  --num_seqs 256 \
  --tp 2 \
  --max_input_len 2048 \
  --max_output_len 256 \
  --temperature 0.7 \
  --max_model_len 4096

Then it would crash and we would see the AttributeError above.

Following the nanovllm style, I aimed for a minimal, targeted fix; happy to iterate based on your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant