diff --git a/docs/source/serving/deploying_with_cerebrium.rst b/docs/source/serving/deploying_with_cerebrium.rst
new file mode 100644
index 000000000000..ff0ac911108c
--- /dev/null
+++ b/docs/source/serving/deploying_with_cerebrium.rst
@@ -0,0 +1,109 @@
+.. _deploying_with_cerebrium:
+
+Deploying with Cerebrium
+============================
+
+.. raw:: html
+
+
+
+
+
+vLLM can be run on a cloud based GPU machine with `Cerebrium `__, a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.
+
+To install the Cerebrium client, run:
+
+.. code-block:: console
+
+ $ pip install cerebrium
+ $ cerebrium login
+
+Next, create your Cerebrium project, run:
+
+.. code-block:: console
+
+ $ cerebrium init vllm-project
+
+Next, to install the required packages, add the following to your cerebrium.toml:
+
+.. code-block:: toml
+
+ [cerebrium.dependencies.pip]
+ vllm = "latest"
+
+Next, let us add our code to handle inference for the LLM of your choice(`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your main.py`:
+
+.. code-block:: python
+
+ from vllm import LLM, SamplingParams
+
+ llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
+
+ def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
+
+ sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
+ outputs = llm.generate(prompts, sampling_params)
+
+ # Print the outputs.
+ results = []
+ for output in outputs:
+ prompt = output.prompt
+ generated_text = output.outputs[0].text
+ results.append({"prompt": prompt, "generated_text": generated_text})
+
+ return {"results": results}
+
+
+Then, run the following code to deploy it to the cloud
+
+.. code-block:: console
+
+ $ cerebrium deploy
+
+If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case /run)
+
+.. code-block:: python
+
+ curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
+ -H 'Content-Type: application/json' \
+ -H 'Authorization: ' \
+ --data '{
+ "prompts": [
+ "Hello, my name is",
+ "The president of the United States is",
+ "The capital of France is",
+ "The future of AI is"
+ ]
+ }'
+
+You should get a response like:
+
+.. code-block:: python
+
+ {
+ "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
+ "result": {
+ "result": [
+ {
+ "prompt": "Hello, my name is",
+ "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
+ },
+ {
+ "prompt": "The president of the United States is",
+ "generated_text": " elected every four years. This is a democratic system.\n\n5. What"
+ },
+ {
+ "prompt": "The capital of France is",
+ "generated_text": " Paris.\n"
+ },
+ {
+ "prompt": "The future of AI is",
+ "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
+ }
+ ]
+ },
+ "run_time_ms": 152.53663063049316
+ }
+
+You now have an autoscaling endpoint where you only pay for the compute you use!
+
diff --git a/docs/source/serving/integrations.rst b/docs/source/serving/integrations.rst
index 83a8b5a88bd3..680ea523dfe9 100644
--- a/docs/source/serving/integrations.rst
+++ b/docs/source/serving/integrations.rst
@@ -8,6 +8,7 @@ Integrations
deploying_with_kserve
deploying_with_triton
deploying_with_bentoml
+ deploying_with_cerebrium
deploying_with_lws
deploying_with_dstack
serving_with_langchain