Skip to content

Conversation

Cecilwang
Copy link
Contributor

@Cecilwang Cecilwang commented Mar 14, 2025

When MambaCacheManager is full, the following error may occur at this line:

IndexError: pop from empty list

This happens because when a seq_group reaches max_model_len, it is moved to scheduler._async_stopped instead of scheduler._finished_requests_ids, as seen in this line. As a result, MambaCacheManager cannot immediately release the seq_group in scheduler._async_stopped at the current step, leading to this issue.

However, this does not cause a memory leak, as scheduler.free_finished_seq_groups will eventually move _async_stopped to _finished_requests_ids after model execution, allowing it to be freed in the next step.

FIX #10693, #13129

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@youkaichao youkaichao requested a review from tlrmchlsmth March 22, 2025 02:38
@youkaichao
Copy link
Member

@tlrmchlsmth is the expert on this.

@Cecilwang Cecilwang changed the title [Bugfix][Mamba] Fix IndexError When MambaCacheManager is Full [Bugfix][Mamba] Fix MambaCache leak Apr 15, 2025
Copy link
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge in latest main. I think the bug in in async_llm_engine.py was fixed in #13454

Could you please take a look to see if that fixes the issue and if so, let's apply a consistent fix to llm_engine.py

@Cecilwang
Copy link
Contributor Author

Please merge in latest main. I think the bug in in async_llm_engine.py was fixed in #13454

Could you please take a look to see if that fixes the issue and if so, let's apply a consistent fix to llm_engine.py

@tlrmchlsmth I confirmed that this leak has been fixed in async engine in the latest main. Let's apply to llm engine.

@Cecilwang
Copy link
Contributor Author

Cecilwang commented Apr 17, 2025

@tlrmchlsmth , I forgot to test with enable-chunked-prefill.
#13129 this issue has been fixed with and without enable-chunked-prefill.
However, the bug described in this PR has not been fixed yet.

Here is a test script to reproduce
cmd:

python debug_mamba_cache.py --model state-spaces/mamba-130m-hf --enable-chunked-prefill --gpu-memory-utilization 0.1 --max-model-len 128 --max-num-seqs 2

code debug_mamba_cache.py:

"""Benchmark offline inference throughput."""
import argparse
from typing import Any, Optional, Union

import uvloop
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
from vllm.entrypoints.openai.api_server import (
    build_async_engine_client_from_engine_args,
)
from vllm.inputs import TokensPrompt
from vllm.utils import FlexibleArgumentParser, merge_async_iterators


async def run_vllm_async(
    engine_args: AsyncEngineArgs,
):
    from vllm import SamplingParams

    async with build_async_engine_client_from_engine_args(
        engine_args, True
    ) as llm:
        # Add the requests to the engine.
        prompts: list[TokensPrompt] = [
            TokensPrompt(prompt_token_ids=[0 for i in range(10)]),
            TokensPrompt(prompt_token_ids=[0 for i in range(110)]),
            TokensPrompt(prompt_token_ids=[0 for i in range(10)]),
        ]
        sampling_params: SamplingParams = SamplingParams(
            n=1,
            temperature=1.0,
            top_p=1.0,
            ignore_eos=True,
            max_tokens=100,
        )

        generators = []
        for i, prompt in enumerate(prompts):
            generator = llm.generate(
                prompt, sampling_params, request_id=f"test{i}"
            )
            generators.append(generator)
        all_gens = merge_async_iterators(*generators)
        async for i, res in all_gens:
            pass


def main(args: argparse.Namespace):
    uvloop.run(
        run_vllm_async(
            AsyncEngineArgs.from_cli_args(args),
        )
    )


if __name__ == "__main__":
    parser = FlexibleArgumentParser(description="Benchmark the throughput.")
    parser = AsyncEngineArgs.add_cli_args(parser)
    args = parser.parse_args()
    main(args)

@hmellor
Copy link
Member

hmellor commented Jul 16, 2025

Since V0 is deprecated and this PR is for the V0 engine, I'm going to close this.

If this change should also be made to V1, feel free to update and re-open this PR, or create a new one for V1.

@hmellor hmellor closed this Jul 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: MambaCacheManager Can Possibly Run Out of Free Slots

4 participants