From c37c8af2fb42b64ad57e062059eafe945d386723 Mon Sep 17 00:00:00 2001 From: ixlmar <206748156+ixlmar@users.noreply.github.com> Date: Fri, 27 Jun 2025 16:09:41 +0200 Subject: [PATCH] [TRTLLM-5989, TRTLLM-5991, TRTLLM-5993] doc: Update container instructions (#5490) Cherry-picking abb7357f25858de38feb7eec3fd515c77e92bc21 (#5490) Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com> --- docker/Makefile | 13 ++++- docker/README.md | 44 ++++++++++++++--- docker/develop.md | 16 +++---- docker/release.md | 10 ++-- docs/requirements.txt | 1 + docs/source/conf.py | 25 +++++++++- docs/source/index.rst | 2 +- .../installation/build-from-source-linux.md | 23 +++++++-- docs/source/installation/containers.md | 10 ++++ docs/source/installation/grace-hopper.md | 20 -------- docs/source/installation/linux.md | 47 ++++++++++--------- docs/source/quick-start-guide.md | 26 ++++++++-- 12 files changed, 168 insertions(+), 69 deletions(-) create mode 100644 docs/source/installation/containers.md delete mode 100644 docs/source/installation/grace-hopper.md diff --git a/docker/Makefile b/docker/Makefile index 658aebfbfcc..6af004ef9cd 100644 --- a/docker/Makefile +++ b/docker/Makefile @@ -5,12 +5,19 @@ BASE_TAG ?= $(shell grep 'ARG BASE_TAG=' Dockerfile.multi | grep -o '= IMAGE_NAME ?= tensorrt_llm IMAGE_TAG ?= latest +# Used to share .cache when LOCAL_USER=1. Possibility of override is +# helpful, e.g., for use with Docker rootless mode. +HOME_DIR ?= $(HOME) + # Local user information USER_ID ?= $(shell id --user) USER_NAME ?= $(shell id --user --name) GROUP_ID ?= $(shell id --group) GROUP_NAME ?= $(shell id --group --name) +# Try to detect Docker rootless mode +IS_ROOTLESS ?= $(shell if [ "$$(docker context inspect --format '{{.Endpoints.docker.Host}}' "$$(docker context show)")" = "unix:///run/user/$(USER_ID)/docker.sock" ]; then echo 1; else echo 0; fi) + # Set this to 1 to add the current user to the docker image and run the container with the user LOCAL_USER ?= 0 ifeq ($(LOCAL_USER),1) @@ -108,7 +115,7 @@ endef @echo "Pulling docker image: $(IMAGE_WITH_TAG)" docker pull $(IMAGE_WITH_TAG) -DOCKER_RUN_OPTS ?= --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 +DOCKER_RUN_OPTS ?= --rm -it --ipc=host --ulimit stack=67108864 $(if $(filter 0,$(IS_ROOTLESS)),--ulimit memlock=-1) DOCKER_RUN_ARGS ?= # Check if NVIDIA_VISIBLE_DEVICES is set and not empty NVIDIA_VISIBLE_DEVICES_VAL = $(shell echo $$NVIDIA_VISIBLE_DEVICES) @@ -129,6 +136,9 @@ WORK_DIR ?= $(CODE_DIR) DOCKER_PULL ?= 0 %_run: +ifeq ($(IS_ROOTLESS),1) + @echo "Assuming Docker rootless mode." +endif ifeq ($(DOCKER_PULL),1) @$(MAKE) --no-print-directory $*_pull endif @@ -138,6 +148,7 @@ endif docker run $(DOCKER_RUN_OPTS) $(DOCKER_RUN_ARGS) \ $(GPU_OPTS) \ --volume $(SOURCE_DIR):$(CODE_DIR) \ + $(if $(filter 1,$(LOCAL_USER)),--volume ${HOME_DIR}/.cache:/home/${USER_NAME}/.cache:rw) \ --env "CCACHE_DIR=${CCACHE_DIR}" \ --env "CCACHE_BASEDIR=${CODE_DIR}" \ --env "CONAN_HOME=${CONAN_DIR}" \ diff --git a/docker/README.md b/docker/README.md index d986b8c8493..3bfac62a2c4 100644 --- a/docker/README.md +++ b/docker/README.md @@ -44,12 +44,17 @@ Containers can be started with the local user instead of `root` by appending `LO make -C docker devel_run LOCAL_USER=1 ``` -Specific CUDA architectures supported by the `wheel` can be specified WITH `CUDA_ARCHS`: +Specific CUDA architectures supported by the `wheel` can be specified with `CUDA_ARCHS`: ```bash make -C docker release_build CUDA_ARCHS="80-real;90-real" ``` +The `run` action maps the locally checked out source code into the `/code/tensorrt_llm` directory within the container. + +The `DOCKER_RUN_ARGS` option can be used to pass additional options to Docker, +e.g., in order to mount additional volumes into the container. + For more build options, see the variables defined in [`Makefile`](Makefile). ### NGC Integration @@ -62,8 +67,7 @@ make -C docker ngc-devel_run LOCAL_USER=1 DOCKER_PULL=1 ``` As before, specifying `LOCAL_USER=1` will run the container with the local user's identity. Specifying `DOCKER_PULL=1` -is optional, but it will pull the latest image from the NGC Catalog. This will map the source code into the container -in the directory `/code/tensorrt_llm`. +is optional, but it will pull the latest image from the NGC Catalog. We also provide an image with pre-installed binaries for release. This can be used like so: @@ -72,7 +76,15 @@ make -C docker ngc-release_run LOCAL_USER=1 DOCKER_PULL=1 ``` If you want to deploy a specific version of TensorRT-LLM, you can specify the version with -`TRT_LLM_VERSION=`. The application examples and benchmarks are installed in `/app/tensorrt_llm`. +`IMAGE_TAG=` (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)). The application examples and benchmarks are installed +in `/app/tensorrt_llm`. + +See the description of the `_run` make target in +[Building and Running Options](#building-and-running-options) for additional information and +running options. + +If you cannot access the NGC container images, you can instead locally build and use +equivalent containers as [described above](#building-docker-images-with-gnu-make). ### Jenkins Integration @@ -91,13 +103,21 @@ Start a new container using the same image as Jenkins using your local user acco make -C docker jenkins_run LOCAL_USER=1 ``` +If you do not have access to the [internal artifact repository](https://urm.nvidia.com/artifactory/sw-tensorrt-docker/tensorrt-llm/), you can instead either use the [NGC Develop +image](#ngc-integration) or [build an image locally](#building-docker-images-with-gnu-make). + +#### Release images based on Jenkins image + One may also build a release image based on the Jenkins development image: ```bash make -C docker trtllm_build CUDA_ARCHS="80-real;90-real" ``` -These images can be pushed to +Note that the above requires access to the Jenkins development image from the +[internal artifact repository](https://urm.nvidia.com/artifactory/sw-tensorrt-docker/tensorrt-llm/). + +The resulting images can be pushed to the [internal artifact repository](https://urm.nvidia.com/artifactory/sw-tensorrt-docker/tensorrt-llm-staging/release/): ```bash @@ -112,4 +132,16 @@ make -C docker trtllm_run LOCAL_USER=1 DOCKER_PULL=1 ``` The argument `DOCKER_PULL=1` instructs `make` to pull the latest version of the image before deploying it in the container. -By default, images are tagged by their `git` branch name and may be frequently updated. +By default, the release images built in the above manner are tagged by their `git` branch name and may be frequently updated. + +### Docker rootless + +Some aspects require special treatment when using [Docker rootless mode](https://docs.docker.com/engine/security/rootless/). The `docker/Makefile` contains heuristics to detect Docker rootless mode. When assuming +Docker rootless mode, the `%_run` targets in `docker/Makefile` will output +a corresponding message. The heuristics can be overridden by specifying +`IS_ROOTLESS=0` or `IS_ROOTLESS=1`, respectively. + +Since Docker rootless mode remaps the UID/GID and the remapped UIDs and GIDs + (typically configured in `/etc/subuid` and `/etc/subgid`) generally do not coincide +with the local UID/GID, both IDs need to be translated using a tool like `bindfs` in order to be able to smoothly share a local working directory with any containers +started with `LOCAL_USER=1`. In this case, the `SOURCE_DIR` and `HOME_DIR` Makefile variables need to be set to the locations of the translated versions of the TensorRT-LLM working copy and the user home directory, respectively. diff --git a/docker/develop.md b/docker/develop.md index b8bead93e66..2e1884b5cc4 100644 --- a/docker/develop.md +++ b/docker/develop.md @@ -1,8 +1,8 @@ # Description -TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support +TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to -create Python and C++ runtimes that orchestrate the inference execution in performant way. +create Python and C++ runtimes that orchestrate the inference execution in a performant way. # Overview @@ -22,15 +22,15 @@ With the top-level directory of the TensorRT-LLM repository cloned to your local command to start the development container: ```bash -make -C docker ngc-devel_run LOCAL_USER=1 DOCKER_PULL=1 IMAGE_TAG=x.xx.x +make -C docker ngc-devel_run LOCAL_USER=1 DOCKER_PULL=1 IMAGE_TAG=x.y.z ``` -where `x.xx.x` is the version of the TensorRT-LLM container to use. This command pulls the specified container from the +where `x.y.z` is the version of the TensorRT-LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel/tags)). This command pulls the specified container from the NVIDIA NGC registry, sets up the local user's account within the container, and launches it with full GPU support. The local source code of TensorRT-LLM will be mounted inside the container at the path `/code/tensorrt_llm` for seamless -integration. Ensure that the image version matches the version of TensorRT-LLM in your current local git branch. Not -specifying an `IMAGE_TAG` will attempt to resolve this automatically, but not every intermediate release might be -accompanied by development container. In that case, use the latest version preceding the version of your development +integration. Ensure that the image version matches the version of TensorRT-LLM in your currently checked out local git branch. Not +specifying a `IMAGE_TAG` will attempt to resolve this automatically, but not every intermediate release might be +accompanied by a development container. In that case, use the latest version preceding the version of your development branch. If you prefer launching the container directly with `docker`, you can use the following command: @@ -44,7 +44,7 @@ docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ --workdir /code/tensorrt_llm \ --tmpfs /tmp:exec \ --volume .:/code/tensorrt_llm \ - nvcr.io/nvidia/tensorrt-llm/devel:x.xx.x + nvcr.io/nvidia/tensorrt-llm/devel:x.y.z ``` Note that this will start the container with the user `root`, which may leave files with root ownership in your local diff --git a/docker/release.md b/docker/release.md index 30c07774fa8..b016a0b204e 100644 --- a/docker/release.md +++ b/docker/release.md @@ -1,8 +1,8 @@ # Description -TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support +TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to -create Python and C++ runtimes that orchestrate the inference execution in performant way. +create Python and C++ runtimes that orchestrate the inference execution in a performant way. # Overview @@ -18,10 +18,10 @@ A typical command to launch the container is: ```bash docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all \ - nvcr.io/nvidia/tensorrt-llm/release:x.xx.x + nvcr.io/nvidia/tensorrt-llm/release:x.y.z ``` -where x.xx.x is the version of the TensorRT-LLM container to use. To sanity check, run the following command: +where x.y.z is the version of the TensorRT-LLM container to use (cf. [release history on GitHub](https://github.com/NVIDIA/TensorRT-LLM/releases) and [tags in NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)). To sanity check, run the following command: ```bash python3 -c "import tensorrt_llm" @@ -34,7 +34,7 @@ Alternatively, if you have already cloned the TensorRT-LLM repository, you can u run the container: ```bash -make -C docker ngc-release_run LOCAL_USER=1 DOCKER_PULL=1 IMAGE_TAG=x.xx.x +make -C docker ngc-release_run LOCAL_USER=1 DOCKER_PULL=1 IMAGE_TAG=x.y.z ``` This command pulls the specified container from the NVIDIA NGC registry, sets up the local user's account within the diff --git a/docs/requirements.txt b/docs/requirements.txt index 4fa868db4d9..3255ae5e0cb 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -7,3 +7,4 @@ breathe pygit2 sphinx_copybutton autodoc_pydantic +sphinx-togglebutton diff --git a/docs/source/conf.py b/docs/source/conf.py index dcd043a7534..e3f05a859ab 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -56,7 +56,8 @@ 'sphinxarg.ext', 'sphinx_click', 'sphinx_copybutton', - 'sphinxcontrib.autodoc_pydantic' + 'sphinxcontrib.autodoc_pydantic', + 'sphinx_togglebutton', ] autodoc_pydantic_model_show_json = True @@ -77,8 +78,30 @@ myst_enable_extensions = [ "deflist", + "substitution", ] +myst_substitutions = { + "version": + version, + "version_quote": + f"`{version}`", + "container_tag_admonition": + r""" +```{admonition} Container image tags +:class: dropdown note +In the example shell commands, `x.y.z` corresponds to the TensorRT-LLM container +version to use. If omitted, `IMAGE_TAG` will default to `tensorrt_llm.__version__` +(e.g., this documentation was generated from the {{version_quote}} source tree). +If this does not work, e.g., because a container for the version you are +currently working with has not been released yet, you can try using a +container published for a previous +[GitHub pre-release or release](https://github.com/NVIDIA/TensorRT-LLM/releases) +(see also [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)). +``` + """, +} + autosummary_generate = True copybutton_exclude = '.linenos, .gp, .go' copybutton_prompt_text = ">>> |$ |# " diff --git a/docs/source/index.rst b/docs/source/index.rst index 313d799432b..405527b5f5f 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -25,9 +25,9 @@ Welcome to TensorRT-LLM's Documentation! .. installation/overview.md + installation/containers.md installation/linux.md installation/build-from-source-linux.md - installation/grace-hopper.md .. toctree:: diff --git a/docs/source/installation/build-from-source-linux.md b/docs/source/installation/build-from-source-linux.md index 9ed9ec0b21e..bf06b3b38f1 100644 --- a/docs/source/installation/build-from-source-linux.md +++ b/docs/source/installation/build-from-source-linux.md @@ -9,6 +9,8 @@ This document provides instructions for building TensorRT-LLM from source code o Use [Docker](https://www.docker.com) to build and run TensorRT-LLM. Instructions to install an environment to run Docker containers for the NVIDIA platform can be found [here](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). +If you intend to build any TensortRT-LLM artifacts, such as any of the container images (note that there exist pre-built [develop](#build-from-source-tip-develop-container) and [release](#build-from-source-tip-release-container) container images in NGC), or the TensorRT-LLM Python wheel, you first need to clone the TensorRT-LLM repository: + ```bash # TensorRT-LLM uses git-lfs, which needs to be installed in advance. apt-get update && apt-get -y install git git-lfs @@ -26,6 +28,11 @@ There are two options to create a TensorRT-LLM Docker image. The approximate dis ### Option 1: Build TensorRT-LLM in One Step +```{tip} +:name: build-from-source-tip-release-container +If you just want to run TensorRT-LLM, you can instead [use the pre-built TensorRT-LLM Release container images](containers). +``` + TensorRT-LLM contains a simple command to create a Docker image. Note that if you plan to develop on TensorRT-LLM, we recommend using [Option 2: Build TensorRT-LLM Step-By-Step](#option-2-build-tensorrt-llm-step-by-step). ```bash @@ -49,11 +56,16 @@ The `make` command supports the `LOCAL_USER=1` argument to switch to the local u Since TensorRT-LLM has been built and installed, you can skip the remaining steps. -### Option 2: Build TensorRT-LLM Step-by-Step +### Option 2: Container for building TensorRT-LLM Step-by-Step If you are looking for more flexibility, TensorRT-LLM has commands to create and run a development container in which TensorRT-LLM can be built. -#### Create the Container +```{tip} +:name: build-from-source-tip-develop-container +As an alternative to building the container image following the instructions below, +you can pull a pre-built [TensorRT-LLM Develop container image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel) from NGC (see [here](containers) for information on container tags). +Follow the linked catalog entry to enter a new container based on the pre-built container image, with the TensorRT source repository mounted into it. You can then skip this section and continue straight to [building TensorRT-LLM](#build-tensorrt-llm). +``` **On systems with GNU `make`** @@ -100,6 +112,11 @@ If you are looking for more flexibility, TensorRT-LLM has commands to create and Once inside the container, follow the next steps to build TensorRT-LLM from source. +### Advanced topics + +For more information on building and running various TensorRT-LLM container images, +check . + ## Build TensorRT-LLM ### Option 1: Full Build with C++ Compilation @@ -207,7 +224,7 @@ Alternatively, you can use editable installation for convenience during Python d TRTLLM_USE_PRECOMPILED=1 pip install -e . ``` -Setting `TRTLLM_USE_PRECOMPILED=1` enables downloading a prebuilt wheel of the version specified in `tensorrt_llm/version.py`, extracting compiled libraries into your current directory, thus skipping C++ compilation. +Setting `TRTLLM_USE_PRECOMPILED=1` enables downloading a prebuilt wheel of the version specified in `tensorrt_llm/version.py`, extracting compiled libraries into your current directory, thus skipping C++ compilation. This version can be overridden by specifying `TRTLLM_USE_PRECOMPILED=x.y.z`. You can specify a custom URL or local path for downloading using `TRTLLM_PRECOMPILED_LOCATION`. For example, to use version 0.16.0 from PyPI: diff --git a/docs/source/installation/containers.md b/docs/source/installation/containers.md new file mode 100644 index 00000000000..2c49d3ce374 --- /dev/null +++ b/docs/source/installation/containers.md @@ -0,0 +1,10 @@ +# Pre-built release container images on NGC + +Pre-built TensorRT-LLM releases are made available as container images +on NGC. This is likely the simplest way to obtain TensorRT-LLM. Please refer to the [documentation in NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) for usage instructions. + +{{container_tag_admonition}} + +Containers can also be built locally, see + +for all related options. diff --git a/docs/source/installation/grace-hopper.md b/docs/source/installation/grace-hopper.md deleted file mode 100644 index cddaea932dd..00000000000 --- a/docs/source/installation/grace-hopper.md +++ /dev/null @@ -1,20 +0,0 @@ -(grace-hopper)= - -# Installing on Grace Hopper - -1. Install TensorRT-LLM (tested on Ubuntu 24.04). - - ```bash - pip3 install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 - - sudo apt-get -y install libopenmpi-dev && pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm - ``` - - If using the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) image, the prerequisite steps for installing CUDA-enabled PyTorch package and `libopenmpi-dev` are not required. - -2. Sanity check the installation by running the following in Python (tested on Python 3.12): - - ```{literalinclude} ../../../examples/llm-api/quickstart_example.py - :language: python - :linenos: - ``` diff --git a/docs/source/installation/linux.md b/docs/source/installation/linux.md index 5c7d38b0f44..6f1383f3ef8 100644 --- a/docs/source/installation/linux.md +++ b/docs/source/installation/linux.md @@ -1,18 +1,37 @@ (linux)= -# Installing on Linux +# Installing on Linux via `pip` 1. Install TensorRT-LLM (tested on Ubuntu 24.04). - ```bash - (Optional) pip3 install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 + ### Install prerequisites - sudo apt-get -y install libopenmpi-dev && pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm - ``` + Before the pre-built Python wheel can be installed via `pip`, a few + prerequisites must be put into place: + + ```bash + # Optional step: Only required for Blackwell and Grace Hopper + pip3 install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 + + sudo apt-get -y install libopenmpi-dev + ``` - PyTorch CUDA 12.8 package is required for supporting NVIDIA Blackwell GPUs. On prior GPUs, this extra installation is not required. + PyTorch CUDA 12.8 package is required for supporting NVIDIA Blackwell and Grace Hopper GPUs. On prior GPUs, this extra installation is not required. - If using the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) image, the prerequisite steps for installing NVIDIA Blackwell-enabled PyTorch package and `libopenmpi-dev` are not required. + ```{tip} + Instead of manually installing the preqrequisites as described + above, it is also possible to use the pre-built [TensorRT-LLM Develop container + image hosted on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel) + (see [here](containers) for information on container tags). + ``` + + ### Install pre-built TensorRT-LLM wheel + + Once all prerequisites are in place, TensorRT-LLM can be installed as follows: + + ```bash + pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm + ``` 2. Sanity check the installation by running the following in Python (tested on Python 3.12): @@ -48,17 +67,3 @@ There are some known limitations when you pip install pre-built TensorRT-LLM whe Unable to load extension modelopt_cuda_ext and falling back to CPU version. ``` The installation of CUDA toolkit can be found in [CUDA Toolkit Documentation](https://docs.nvidia.com/cuda/). - -3. Install inside the PyTorch NGC Container - - The PyTorch NGC Container may lock Python package versions via the `/etc/pip/constraint.txt` file. When installing the pre-built TensorRT-LLM wheel inside the [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch), you need to clear this file first. - - ```bash - [ -f /etc/pip/constraint.txt ] && : > /etc/pip/constraint.txt - ``` - - PyTorch NGC Container typically includes a pre-installed `tensorrt` Python package. If there is a version mismatch between this pre-installed package and the version required by the TensorRT-LLM wheel, you will need to uninstall the existing `tensorrt` package before installing TensorRT-LLM. - - ```bash - pip uninstall -y tensorrt - ``` diff --git a/docs/source/quick-start-guide.md b/docs/source/quick-start-guide.md index 79517416418..e2b0582e8ca 100644 --- a/docs/source/quick-start-guide.md +++ b/docs/source/quick-start-guide.md @@ -2,9 +2,20 @@ # Quick Start Guide -This is the starting point to try out TensorRT-LLM. Specifically, this Quick Start Guide enables you to quickly get setup and send HTTP requests using TensorRT-LLM. +This is the starting point to try out TensorRT-LLM. Specifically, this Quick Start Guide enables you to quickly get set up and send HTTP requests using TensorRT-LLM. + +## Installation + +There are multiple ways to install and run TensorRT-LLM. For most users, the options below should be ordered from simple to complex. The approaches are equivalent in terms of the supported features. + +1. [](installation/containers) + +1. Pre-built release wheels on [PyPI](https://pypi.org/project/tensorrt-llm) (see [](installation/linux)) + +1. [Building from source](installation/build-from-source-linux) ## LLM API + The LLM API is a Python API designed to facilitate setup and inference with TensorRT-LLM directly within Python. It enables model optimization by simply specifying a HuggingFace repository name or a model checkpoint. The LLM API streamlines the process by managing checkpoint conversion, engine building, engine loading, and model inference, all through a single Python object. Here is a simple example to show how to use the LLM API with TinyLlama. @@ -72,8 +83,15 @@ _Example Output_ } ``` -For examples and command syntax, refer to the [trtllm-serve](commands/trtllm-serve.rst) section. +For detailed examples and command syntax, refer to the [trtllm-serve](commands/trtllm-serve.rst) section. If you are running `trtllm-server` inside a Docker container, you have two options for sending API requests: +1. Expose port `8000` to access the server from outside the container. + +2. Open a new terminal and use the following command to directly attach to the running container: + +```bash +docker exec -it bash +``` ## Model Definition API @@ -97,7 +115,7 @@ The model definition is a minimal example that shows some of the optimizations a ```console # From the root of the cloned repository, start the TensorRT-LLM container -make -C docker release_run LOCAL_USER=1 +make -C docker ngc-release_run LOCAL_USER=1 IMAGE_TAG=x.y.z # Log in to huggingface-cli # You can get your token from huggingface.co/settings/token @@ -115,6 +133,8 @@ trtllm-build --checkpoint_dir llama-3.1-8b-ckpt \ --output_dir ./llama-3.1-8b-engine ``` +{{container_tag_admonition}} + When you create a model definition with the TensorRT-LLM API, you build a graph of operations from [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) primitives that form the layers of your neural network. These operations map to specific kernels; prewritten programs for the GPU. In this example, we included the `gpt_attention` plugin, which implements a FlashAttention-like fused attention kernel, and the `gemm` plugin, that performs matrix multiplication with FP32 accumulation. We also called out the desired precision for the full model as FP16, matching the default precision of the weights that you downloaded from Hugging Face. For more information about plugins and quantizations, refer to the [Llama example](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama) and {ref}`precision` section.