[RFC]: Adding support for Geospatial models

### Motivation.

Modern models are now not only targeting the generation of text but also the generation of images from text or image input as well. This RFC wants to open the stage towards supporting models that do not only generate text but also images either as a single modality or even supporting multi-modal output. One example of great interest to us is a set of models developed in co-operation with NASA for earth observation (https://huggingface.co/ibm-nasa-geospatial/Prithvi-100M) working on satellite images can be fine-tuned for several tasks including floods forecast, crop classification etc.

This example model, works on images of a fixed size in input and generates an image of the same size in output. In the specific, input images in the geotiff format are split in patches of dimensions 224×224, each patch is passed through the model for inference that generates a tensor of the same size as the input. This is similar in a way got an autoregressive process, with the difference that at every iteration the data passed to the model is different and there is no relationship when inferencing subsequent patches. All the output patches are then “re-assembled” into a geotiff image.

The goal of this RFC is that of creating enablement for non text output and demonstrate this with the above mentioned model.

Why support models in vLLM that do not generate text? Because consolidating towards a single serving platform simplifies the software stack of those dealing with multiple types of models. Also, with time, models not targeting text might benefit from optimizations introduced by the vLLM community. Similarly to what has been happening for Transformer based causal models.

### Proposed Change.

I propose a two phase approach. In the first phase, integrate the model as a PoolingModel and pre/post-process input output data outside of vLLM. In the second phase, a proper integration of the models is performed, taking also care of processing the input image and generating the output one.

## Phase 1: Basic enablement of Geospatial model in vLLM

Pre/post processing of the input image is done outside of vLLM. The input image is broken down into patches (generic tensor), all patches are fed into vLLM. Output tensors are collected and post processing would re-create the output image. 

For this phase we could piggyback on the support available for pooling models (thanks @dar for the suggestion) where the hidden states of the model are returned in output.

### Changes for phase 1:

#### Step 1
Extend the output type for pooling models, currently only targeting embeddings, to also support a generic output type. This output would be then post processed outside of vLLM.

```python
@dataclass
class PoolingOutput:
    """The output data of one pooling output of a request.

    Args:
        outputs: This can be either a list of floats (embedding vector), or a generic list of tensors defined by the model. The embedding vector, returned in case of embedding models, is a list of floats whose length depends on the model as listed in the embedding guide.
    """
    outputs: Union[List[float], List[torch.Tensor]]
```
#### Step 2
Also, right now, for Pooling models the only two possible methods to execute would be `encode` or `score`. Would it make sense to define a third one like `transform`? This would just be for the sake of not using encode. Also, would it make sense to create a new entrypoint class in addition to LLM? Something Like VisionModel or similar? This is again for sake of completeness since this is not a language model.

#### Step 3
Exploit the batching capabilites of vLLM and present all the images patches to the vLLM entrypoint as a list of generic tensors. Similar to what is done now when presenting multiple prompts at a time.

## Phase 2: Optimized integration of the model
Embed pre/post processing of images into vLLM and handle the recursive pattern for processing bigger images in vLLM.
(This phase might need to be updated/changed depending on the outcome of Phase 1)

#### Step 1
Integrate processing of the input with the already available multimodal input support. Here among the things to be considered is that an input image could be presented encoded as a string instead of being stored in an file.

#### Step 2
Introduce the possibility of “installing“ an output processor, that generates images of the required format. In the same spirit of what is done for input processors.

```python
@INPUT_REGISTRY.register_input_processor()
```
The idea would be to create an output registry and enable models to register an output processor so that all the output generated for a sequence can be converted into the proper image format for the specific model.

#### Step 3
Create a new output class that allows the output to be presented in the form of an image. We could call it ImageOutput and ImageRequestOutput. Users would be able to either post-process the model output and return a string containing the generated file or, return the raw image output for post-processing outside of vLLM

```python
class ImageOutput:
    image_out: Union[str, torch.Tensor]

class ImageRequestOutput:
    def __init__(self, request_id: str, outputs: "ImageOutput",
                 finished: bool):
        self.request_id = request_id
        self.finished = finished
        self.outputs = outputs
````

#### Step 4

Handle recursive processing of image patches within vLLM. Each image is fed to vLLM, pre processed and split in patches. All patches are processed and all the output patches are processed by the output processor. Could we re-use some of the logic used for handling autoregressive queries? In this case we would know already how many times the model inference should be executed (the number of image patches) and no need to append the output of an iteration to the input of the next, we just feed the next patch and so on.

The output of the request in this case will still be of type `ImageRequestOutput` with the `image_data` field actually optional and the image_path populated with the path to the image generated during post processing



### Feedback Period.

2 weeks

### CC List.

@njhill @ywang96 @DarkLight1337 @robertgshaw2-neuralmagic 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[RFC]: Adding support for Geospatial models #11065

Motivation.

Proposed Change.

Phase 1: Basic enablement of Geospatial model in vLLM

Changes for phase 1:

Step 1

Step 2

Step 3

Phase 2: Optimized integration of the model

Step 1

Step 2

Step 3

Step 4

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

[RFC]: Adding support for Geospatial models #11065

Description

Motivation.

Proposed Change.

Phase 1: Basic enablement of Geospatial model in vLLM

Changes for phase 1:

Step 1

Step 2

Step 3

Phase 2: Optimized integration of the model

Step 1

Step 2

Step 3

Step 4

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions