diff --git a/docs/source/serving/deploying_with_cerebrium.rst b/docs/source/serving/deploying_with_cerebrium.rst new file mode 100644 index 000000000000..ff0ac911108c --- /dev/null +++ b/docs/source/serving/deploying_with_cerebrium.rst @@ -0,0 +1,109 @@ +.. _deploying_with_cerebrium: + +Deploying with Cerebrium +============================ + +.. raw:: html + +

+ vLLM_plus_cerebrium +

+ +vLLM can be run on a cloud based GPU machine with `Cerebrium `__, a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications. + +To install the Cerebrium client, run: + +.. code-block:: console + + $ pip install cerebrium + $ cerebrium login + +Next, create your Cerebrium project, run: + +.. code-block:: console + + $ cerebrium init vllm-project + +Next, to install the required packages, add the following to your cerebrium.toml: + +.. code-block:: toml + + [cerebrium.dependencies.pip] + vllm = "latest" + +Next, let us add our code to handle inference for the LLM of your choice(`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your main.py`: + +.. code-block:: python + + from vllm import LLM, SamplingParams + + llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1") + + def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95): + + sampling_params = SamplingParams(temperature=temperature, top_p=top_p) + outputs = llm.generate(prompts, sampling_params) + + # Print the outputs. + results = [] + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + results.append({"prompt": prompt, "generated_text": generated_text}) + + return {"results": results} + + +Then, run the following code to deploy it to the cloud + +.. code-block:: console + + $ cerebrium deploy + +If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case /run) + +.. code-block:: python + + curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \ + -H 'Content-Type: application/json' \ + -H 'Authorization: ' \ + --data '{ + "prompts": [ + "Hello, my name is", + "The president of the United States is", + "The capital of France is", + "The future of AI is" + ] + }' + +You should get a response like: + +.. code-block:: python + + { + "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262", + "result": { + "result": [ + { + "prompt": "Hello, my name is", + "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of" + }, + { + "prompt": "The president of the United States is", + "generated_text": " elected every four years. This is a democratic system.\n\n5. What" + }, + { + "prompt": "The capital of France is", + "generated_text": " Paris.\n" + }, + { + "prompt": "The future of AI is", + "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective." + } + ] + }, + "run_time_ms": 152.53663063049316 + } + +You now have an autoscaling endpoint where you only pay for the compute you use! + diff --git a/docs/source/serving/integrations.rst b/docs/source/serving/integrations.rst index 83a8b5a88bd3..680ea523dfe9 100644 --- a/docs/source/serving/integrations.rst +++ b/docs/source/serving/integrations.rst @@ -8,6 +8,7 @@ Integrations deploying_with_kserve deploying_with_triton deploying_with_bentoml + deploying_with_cerebrium deploying_with_lws deploying_with_dstack serving_with_langchain