[Doc[ Update LLM tutorials and docker build version for IPEX 2.7 (#5506)

ZhaoqiongZ · tye1 · web-flow · commit 0e47515e4092 · 2025-04-22T13:23:16.000+08:00
* update release link version in doc

* update llm readme with prebuilt wheels

* update llm env_activate with training

* remove training in env_activate.bat

---------

Co-authored-by: Ye Ting &lt;ting.ye@intel.com&gt;
diff --git a/docs/tutorials/llm/int4_weight_only_quantization.md b/docs/tutorials/llm/int4_weight_only_quantization.md
@@ -129,9 +129,9 @@ Intel® Extension for PyTorch\* implements Weight-Only Quantization for Intel®
 
 ### Environment Setup
 
-Please refer to the [env setup](https://github.com/intel/intel-extension-for-pytorch/blob/v2.6.10%2Bxpu/examples/gpu/llm/inference/README.md).
+Please refer to the [env setup](https://github.com/intel/intel-extension-for-pytorch/blob/v2.7.10%2Bxpu/examples/gpu/llm/inference/README.md).
 
-Example can be found at [Learn WOQ](https://github.com/intel/intel-extension-for-pytorch/tree/v2.6.10%2Bxpu/examples/gpu/llm/inference#learn-to-quantize-llm-and-run-inference).
+Example can be found at [Learn WOQ](https://github.com/intel/intel-extension-for-pytorch/tree/v2.7.10%2Bxpu/examples/gpu/llm/inference#learn-to-quantize-llm-and-run-inference).
 
 ### Run Weight-Only Quantization LLM on Intel® GPU
 
@@ -182,7 +182,7 @@ output = loaded_model.generate(inputs)
 ```
 
 
-#### Execute [WOQ benchmark script](https://github.com/intel/intel-extension-for-pytorch/blob/v2.6.10%2Bxpu/examples/gpu/llm/inference/run_benchmark_woq.sh)
+#### Execute [WOQ benchmark script](https://github.com/intel/intel-extension-for-pytorch/blob/v2.7.10%2Bxpu/examples/gpu/llm/inference/run_benchmark_woq.sh)
 
 ```python
 bash run_benchmark_woq.sh
diff --git a/docs/tutorials/llm/llm_optimize_transformers.md b/docs/tutorials/llm/llm_optimize_transformers.md
@@ -9,7 +9,7 @@ API documentation is available at [API Docs page](../api_doc.html#ipex.llm.optim
 
 ## Pseudocode of Common Usage Scenarios
 
-The following sections show pseudocode snippets to invoke Intel® Extension for PyTorch\* APIs to work with LLMs. Complete examples can be found at [the Example directory](https://github.com/intel/intel-extension-for-pytorch/tree/v2.6.10%2Bxpu/examples/gpu/llm/inference).
+The following sections show pseudocode snippets to invoke Intel® Extension for PyTorch\* APIs to work with LLMs. Complete examples can be found at [the Example directory](https://github.com/intel/intel-extension-for-pytorch/tree/v2.7.10%2Bxpu/examples/gpu/llm/inference).
 
 ### FP16
 
@@ -117,7 +117,7 @@ print(modelJit.graph_for(inference_dta))
 
 Distributed inference can be performed with `DeepSpeed`. Based on original Intel® Extension for PyTorch\* scripts, the following code changes are required.
 
-Check Distributed Examples in [LLM example](https://github.com/intel/intel-extension-for-pytorch/tree/v2.6.10%2Bxpu/examples/gpu/llm/inference) for complete codes.
+Check Distributed Examples in [LLM example](https://github.com/intel/intel-extension-for-pytorch/tree/v2.7.10%2Bxpu/examples/gpu/llm/inference) for complete codes.
 
 
 
diff --git a/examples/gpu/llm/README.md b/examples/gpu/llm/README.md
@@ -1,34 +1,92 @@
 # LLM Optimization Overview
 
-Here you can find benchmarking scripts for large language models (LLM) text generation. These scripts:
+Here you can find examples for large language models (LLM) text generation. These scripts:
 
-- Support Llama, GPT-J, Qwen, OPT, Bloom model families and some other models such as Baichuan2-13B and Phi3-mini. 
+> [!NOTE]  
+> New Llama models like Llama3.2-1B, Llama3.2-3B and Llama3.3-7B are also supported from release v2.7.10+xpu.
+
+- Include both inference/finetuning(lora)/bitsandbytes(qlora-finetuning)/training.
 - Include both single instance and distributed (DeepSpeed) use cases for FP16 optimization.
+- Support Llama, GPT-J, Qwen, OPT, Bloom model families and some other models such as Baichuan2-13B and Phi3-mini. 
 - Cover model generation inference with low precision cases for different models with best performance and accuracy (fp16 AMP and weight only quantization)
 
 ## Environment Setup
 
-### [Recommended] Docker-based environment setup with compilation from source
+### [Recommended] Docker-based environment setup with prebuilt wheel files
+
+```bash
+# Get the Intel® Extension for PyTorch* source code
+git clone https://github.com/intel/intel-extension-for-pytorch.git
+cd intel-extension-for-pytorch
+git checkout release/xpu/2.7.10
+git submodule sync
+git submodule update --init --recursive
+
+# Build an image with the provided Dockerfile by installing Intel® Extension for PyTorch* with prebuilt wheels
+docker build -f examples/gpu/llm/Dockerfile -t ipex-llm:xpu .
+
+# Run the container with command below
+docker run -it --rm --privileged -v /dev/dri/by-path:/dev/dri/by-path ipex-llm:xpu bash
+
+# When the command prompt shows inside the docker container, enter llm examples directory
+cd llm
+
+# Activate environment variables
+source ./tools/env_activate.sh  [inference|fine-tuning|bitsandbytes|training]
+# on Windows, use env_activate.bat instead
+call .\tools\env_activate.bat [inference|fine-tuning|bitsandbytes]
+```
+### Conda-based environment setup with prebuilt wheel files
+
+Make sure the driver packages are installed. Refer to [Installation Guide](https://intel.github.io/intel-extension-for-pytorch/#installation?platform=gpu&version=v2.7.10%2Bxpu&os=linux%2Fwsl2&package=pip).
+
+```bash
+
+# Get the Intel® Extension for PyTorch* source code
+git clone https://github.com/intel/intel-extension-for-pytorch.git
+cd intel-extension-for-pytorch
+git checkout release/xpu/2.7.10
+git submodule sync
+git submodule update --init --recursive
+
+# Make sure you have GCC >= 11 is installed on your system.
+# Create a conda environment
+conda create -n llm python=3.10 -y
+conda activate llm
+# Setup the environment with the provided script
+cd examples/gpu/llm
+# If you want to install Intel® Extension for PyTorch\* with prebuilt wheels, use the commands below:
+python ./tools/env_setup.py --setup --deploy
+conda deactivate
+conda activate llm
+source ./tools/env_activate.sh  [inference|fine-tuning|bitsandbytes|training]
+# on Windows, use env_activate.bat instead
+call .\tools\env_activate.bat [inference|fine-tuning|bitsandbytes]
+```
+
+### Docker-based environment setup with compilation from source
 
 ```bash
 # Get the Intel® Extension for PyTorch* source code
 git clone https://github.com/intel/intel-extension-for-pytorch.git
 cd intel-extension-for-pytorch
-git checkout xpu-main
+git checkout release/xpu/2.7.10
 git submodule sync
 git submodule update --init --recursive
 
 # Build an image with the provided Dockerfile by compiling Intel® Extension for PyTorch* from source
-docker build -f examples/gpu/llm/Dockerfile --build-arg COMPILE=ON -t ipex-llm:xpu-main .
+docker build -f examples/gpu/llm/Dockerfile --build-arg COMPILE=ON -t ipex-llm:xpu .
 
 # Run the container with command below
-docker run -it --rm --privileged -v /dev/dri/by-path:/dev/dri/by-path ipex-llm:xpu-main bash
+docker run -it --rm --privileged -v /dev/dri/by-path:/dev/dri/by-path ipex-llm:xpu bash
 
 # When the command prompt shows inside the docker container, enter llm examples directory
 cd llm
 
 # Activate environment variables
-source ./tools/env_activate.sh [inference|fine-tuning|bitsandbytes]
+source ./tools/env_activate.sh [inference|fine-tuning|bitsandbytes|training]
+# on Windows, use env_activate.bat instead
+call .\tools\env_activate.bat [inference|fine-tuning|bitsandbytes]
 ```
 
 ### Conda-based environment setup with compilation from source
@@ -40,7 +98,7 @@ Make sure the driver and Base Toolkit are installed. Refer to [Installation Guid
 # Get the Intel® Extension for PyTorch* source code
 git clone https://github.com/intel/intel-extension-for-pytorch.git
 cd intel-extension-for-pytorch
-git checkout xpu-main
+git checkout release/xpu/2.7.10
 git submodule sync
 git submodule update --init --recursive
 
@@ -56,7 +114,9 @@ python ./tools/env_setup.py --setup --install-pytorch compile --aot <AOT> --onea
 
 conda deactivate
 conda activate llm
-source ./tools/env_activate.sh [inference|fine-tuning|bitsandbytes]
+source ./tools/env_activate.sh [inference|fine-tuning|bitsandbytes|training]
+# on Windows, use env_activate.bat instead
+call .\tools\env_activate.bat [inference|fine-tuning|bitsandbytes]
 ```
 
 where <br />
@@ -74,3 +134,5 @@ For inference example scripts, visit the [inference](./inference/) directory.
 For fine-tuning example scripts, visit the [fine-tuning](./fine-tuning/) directory.
 
 For fine-tuning with quantized model, visit the [bitsandbytes](./bitsandbytes/) directory.
+
+For fine-tuning with quantized model, visit the [training](./training/) directory.
diff --git a/examples/gpu/llm/inference/README.md b/examples/gpu/llm/inference/README.md
@@ -1,6 +1,6 @@
 # LLM Inference Overview
 
-Here you can find the inference benchmarking scripts for large language models (LLM) text generation. These scripts:
+Here you can find the inference examples for large language models (LLM) text generation. These scripts:
 
 - Support Llama, GPT-J, Qwen, OPT, Bloom model families and some other Chinese models such as GLM4-9B, Baichuan2-13B and Phi3-mini. 
 - Include both single instance and distributed (DeepSpeed) use cases for FP16 optimization.
@@ -9,7 +9,7 @@ Here you can find the inference benchmarking scripts for large language models (
 
 ## Validated Models
 
-Currently, only support Transformers 4.44.2. Support for newer versions of Transformers and more models will be available in the future.
+Currently, only support Transformers 4.48.3. Support for newer versions of Transformers and more models will be available in the future.
 
 | MODEL FAMILY | Verified < MODEL ID > (Huggingface hub)| FP16 | Weight only quantization INT4 | Optimized on Intel® Data Center GPU Max Series (1550/1100) | Optimized on Intel® Core™ Ultra Processors with Intel® Arc™ Graphics | Optimized on Intel® Arc™ B-Series Graphics (B580) | 
 |---|:---:|:---:|:---:|:---:|:---:|:---:|
diff --git a/examples/gpu/llm/tools/env_activate.bat b/examples/gpu/llm/tools/env_activate.bat
@@ -2,7 +2,7 @@
 setlocal enabledelayedexpansion
 
 :: Usage message
-set "MSG_USAGE=Usage: call %~nx0 [inference|fine-tuning|bitsandbytes|training]"
+set "MSG_USAGE=Usage: call %~nx0 [inference|fine-tuning|bitsandbytes]"
 
 :: Check if an argument is provided
 if "%~1"=="" (
diff --git a/examples/gpu/llm/training/Mixtral/README.md b/examples/gpu/llm/training/Mixtral/README.md
@@ -23,7 +23,6 @@ We provide examples for `wikitext-2-raw-v1` dataset from Hugging Face.
 
 
 ```bash
-export OCL_ICD_VENDORS=/etc/OpenCL/vendors
 export TORCH_LLM_ALLREDUCE=1
 
 export model='mistralai/Mistral-7B-v0.1'