[Misc]: Curious why this is happening: Running phi-3-vision on a RTX 3070 (8GB VRAM) works with transformer but not with vllm (goes out of memory)

### Anything you want to discuss about vllm.

I was wondering why does this happen? I am new to this space and was playing around with different machines, models and frameworks.

I am able to inference single image (on RTX3070) in around 70s using huggingface transformer. Tried similar thing using vllm (current main branch), it got out of memory which got me curious.

```python
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model_id = "microsoft/Phi-3-vision-128k-instruct"
device = "cuda:0"

model = AutoModelForCausalLM.from_pretrained(model_id, cache_dir="/content/my_models/phi_3_vision",
                                             device_map="cuda",
                                             trust_remote_code=True,
                                             torch_dtype="auto",
                                             _attn_implementation="eager")

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

def process_image(image_path):
    """Processes a single image and returns the model's response."""
    messages = [
        {
            "role": "user",
            "content": "<|image_1|>\nWhat is the destination address?",
        }
    ]

    prompt = processor.tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image = Image.open(image_path)

    inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")

    generation_args = {
        "max_new_tokens": 500,
        "temperature": 0.0,
        "do_sample": False,
    }

    generate_ids = model.generate(
        **inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args
    )

    generate_ids = generate_ids[:, inputs["input_ids"].shape[1] :]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
    return response
```

Vllm
```python
import os
import subprocess

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

from PIL import Image

from vllm import LLM, SamplingParams
from vllm.multimodal.image import ImagePixelData


def run_phi3v():
    model_path = "microsoft/Phi-3-vision-128k-instruct"
    llm = LLM(
        model=model_path,
        trust_remote_code=True,
        image_input_type="pixel_values",
        image_token_id=32044,
        image_input_shape="1,3,1008,1344",
        image_feature_size=1921,
        disable_image_processor=False,
	gpu_memory_utilization=0.7,
    )

    image = Image.open("images/iamge2.png")

    # single-image prompt
    prompt = "<|user|>\n<|image_1|>\nWhat is the destination address?<|end|>\n<|assistant|>\n"  # noqa: E501
    prompt = prompt.replace("<|image_1|>", "<|image|>" * 1921 + "<s>")

    sampling_params = SamplingParams(temperature=0, max_tokens=64)

    outputs = llm.generate(
        {
            "prompt": prompt,
            "multi_modal_data": ImagePixelData(image),
        },
        sampling_params=sampling_params)
    for o in outputs:
        generated_text = o.outputs[0].text
        print(generated_text)


if __name__ == "__main__":
    local_directory = "images"

    # Make sure the local directory exists or create it
    os.makedirs(local_directory, exist_ok=True)

    run_phi3v()
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Misc]: Curious why this is happening: Running phi-3-vision on a RTX 3070 (8GB VRAM) works with transformer but not with vllm (goes out of memory) #5883

Anything you want to discuss about vllm.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Misc]: Curious why this is happening: Running phi-3-vision on a RTX 3070 (8GB VRAM) works with transformer but not with vllm (goes out of memory) #5883

Description

Anything you want to discuss about vllm.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions