-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[LLM] fix doc test for Working with LLMs guide #55917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
kouroshHakha
merged 29 commits into
ray-project:master
from
nrghosh:nrghosh/llms-doctest
Sep 24, 2025
Merged
Changes from all commits
Commits
Show all changes
29 commits
Select commit
Hold shift + click to select a range
53de11a
[LLM] re-enable doc test for Working with LLMs guide #55796
nrghosh 3920f45
remove llm example from exclusion list
nrghosh 70db35f
Refactor LLM examples to use external files
nrghosh be7cd15
fix lint
nrghosh bf73dcb
more lint
nrghosh cb7fc7b
more lint
nrghosh 090e764
wip
nrghosh f71704a
doc lint wip
nrghosh 897ecc4
Replace explicit line refs with semantic tags
nrghosh 2f8d2d1
fix code snippet separation and doc comments
nrghosh 4e685ad
improve tag / code blocks and explanation
nrghosh 240738c
Merge remote-tracking branch 'origin/master' into nrghosh/llms-doctest
nrghosh 15d157a
wip
nrghosh 6c8761d
wip - vlm working
nrghosh ae20fb1
wip - basic llm working
nrghosh ac120b9
wip - basic llm working
nrghosh cd1f5bc
wip - formatting - all 3 examples working
nrghosh 8c01472
Merge branch 'master' into nrghosh/llms-doctest
nrghosh dbbf6bf
wip lint
nrghosh 525177e
wip
nrghosh 6d7b188
wip lint fix ci
nrghosh d499f5c
wip
nrghosh e39006a
wip lint
nrghosh 9721cf9
wip lint
nrghosh b4c1198
wip lint ci
nrghosh 5d11faf
wip imports lint
nrghosh a87af3e
wip - gpu
nrghosh 36dd406
gpu in ci
nrghosh 057127f
ci + refactor embedding example out
nrghosh File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
200 changes: 200 additions & 0 deletions
200
doc/source/data/doc_code/working-with-llms/basic_llm_example.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,200 @@ | ||
| """ | ||
| This file serves as a documentation example and CI test for basic LLM batch inference. | ||
|
|
||
| """ | ||
|
|
||
| # Dependency setup | ||
| import subprocess | ||
| import sys | ||
|
|
||
| subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "ray[llm]"]) | ||
| subprocess.check_call( | ||
| [sys.executable, "-m", "pip", "install", "--upgrade", "transformers"] | ||
| ) | ||
| subprocess.check_call([sys.executable, "-m", "pip", "install", "numpy==1.26.4"]) | ||
|
|
||
|
|
||
| # __basic_llm_example_start__ | ||
| import ray | ||
| from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor | ||
|
|
||
| # __basic_config_example_start__ | ||
| # Basic vLLM configuration | ||
| config = vLLMEngineProcessorConfig( | ||
| model_source="unsloth/Llama-3.1-8B-Instruct", | ||
| engine_kwargs={ | ||
| "enable_chunked_prefill": True, | ||
| "max_num_batched_tokens": 4096, # Reduce if CUDA OOM occurs | ||
| "max_model_len": 16384, | ||
| }, | ||
| concurrency=1, | ||
| batch_size=64, | ||
| ) | ||
| # __basic_config_example_end__ | ||
|
|
||
| processor = build_llm_processor( | ||
| config, | ||
| preprocess=lambda row: dict( | ||
| messages=[ | ||
| {"role": "system", "content": "You are a bot that responds with haikus."}, | ||
| {"role": "user", "content": row["item"]}, | ||
| ], | ||
| sampling_params=dict( | ||
| temperature=0.3, | ||
| max_tokens=250, | ||
| ), | ||
| ), | ||
| postprocess=lambda row: dict( | ||
| answer=row["generated_text"], | ||
| **row, # This will return all the original columns in the dataset. | ||
| ), | ||
| ) | ||
|
|
||
| ds = ray.data.from_items(["Start of the haiku is: Complete this for me..."]) | ||
|
|
||
| if __name__ == "__main__": | ||
| try: | ||
| import torch | ||
|
|
||
| if torch.cuda.is_available(): | ||
| ds = processor(ds) | ||
| ds.show(limit=1) | ||
| else: | ||
| print("Skipping basic LLM run (no GPU available)") | ||
| except Exception as e: | ||
| print(f"Skipping basic LLM run due to environment error: {e}") | ||
|
|
||
| # __hf_token_config_example_start__ | ||
| # Configuration with Hugging Face token | ||
| config_with_token = vLLMEngineProcessorConfig( | ||
| model_source="unsloth/Llama-3.1-8B-Instruct", | ||
| runtime_env={"env_vars": {"HF_TOKEN": "your_huggingface_token"}}, | ||
| concurrency=1, | ||
| batch_size=64, | ||
| ) | ||
| # __hf_token_config_example_end__ | ||
|
|
||
| # __parallel_config_example_start__ | ||
| # Model parallelism configuration for larger models | ||
| # tensor_parallel_size=2: Split model across 2 GPUs for tensor parallelism | ||
| # pipeline_parallel_size=2: Use 2 pipeline stages (total 4 GPUs needed) | ||
| # Total GPUs required = tensor_parallel_size * pipeline_parallel_size = 4 | ||
| config = vLLMEngineProcessorConfig( | ||
| model_source="unsloth/Llama-3.1-8B-Instruct", | ||
| engine_kwargs={ | ||
| "max_model_len": 16384, | ||
| "tensor_parallel_size": 2, | ||
| "pipeline_parallel_size": 2, | ||
| "enable_chunked_prefill": True, | ||
| "max_num_batched_tokens": 2048, | ||
| }, | ||
| concurrency=1, | ||
| batch_size=32, | ||
| accelerator_type="L4", | ||
| ) | ||
| # __parallel_config_example_end__ | ||
|
|
||
| # __runai_config_example_start__ | ||
| # RunAI streamer configuration for optimized model loading | ||
| # Note: Install vLLM with runai dependencies: pip install -U "vllm[runai]>=0.10.1" | ||
| config = vLLMEngineProcessorConfig( | ||
| model_source="unsloth/Llama-3.1-8B-Instruct", | ||
| engine_kwargs={ | ||
| "load_format": "runai_streamer", | ||
| "max_model_len": 16384, | ||
| }, | ||
| concurrency=1, | ||
| batch_size=64, | ||
| ) | ||
| # __runai_config_example_end__ | ||
|
|
||
| # __lora_config_example_start__ | ||
| # Multi-LoRA configuration | ||
| config = vLLMEngineProcessorConfig( | ||
| model_source="unsloth/Llama-3.1-8B-Instruct", | ||
| engine_kwargs={ | ||
| "enable_lora": True, | ||
| "max_lora_rank": 32, | ||
| "max_loras": 1, | ||
| "max_model_len": 16384, | ||
| }, | ||
| concurrency=1, | ||
| batch_size=32, | ||
| ) | ||
| # __lora_config_example_end__ | ||
|
|
||
| # __s3_config_example_start__ | ||
| # S3 hosted model configuration | ||
| s3_config = vLLMEngineProcessorConfig( | ||
| model_source="s3://your-bucket/your-model-path/", | ||
| engine_kwargs={ | ||
| "load_format": "runai_streamer", | ||
| "max_model_len": 16384, | ||
| }, | ||
| concurrency=1, | ||
| batch_size=64, | ||
| ) | ||
| # __s3_config_example_end__ | ||
cursor[bot] marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| # __gpu_memory_config_example_start__ | ||
| # GPU memory management configuration | ||
| # If you encounter CUDA out of memory errors, try these optimizations: | ||
| config_memory_optimized = vLLMEngineProcessorConfig( | ||
| model_source="unsloth/Llama-3.1-8B-Instruct", | ||
| engine_kwargs={ | ||
| "max_model_len": 8192, | ||
| "max_num_batched_tokens": 2048, | ||
| "enable_chunked_prefill": True, | ||
| "gpu_memory_utilization": 0.85, | ||
| "block_size": 16, | ||
| }, | ||
| concurrency=1, | ||
| batch_size=16, | ||
| ) | ||
|
|
||
| # For very large models or limited GPU memory: | ||
| config_minimal_memory = vLLMEngineProcessorConfig( | ||
| model_source="unsloth/Llama-3.1-8B-Instruct", | ||
| engine_kwargs={ | ||
| "max_model_len": 4096, | ||
| "max_num_batched_tokens": 1024, | ||
| "enable_chunked_prefill": True, | ||
| "gpu_memory_utilization": 0.75, | ||
| }, | ||
| concurrency=1, | ||
| batch_size=8, | ||
| ) | ||
| # __gpu_memory_config_example_end__ | ||
|
|
||
| # __embedding_config_example_start__ | ||
| # Embedding model configuration | ||
| embedding_config = vLLMEngineProcessorConfig( | ||
| model_source="sentence-transformers/all-MiniLM-L6-v2", | ||
| task_type="embed", | ||
| engine_kwargs=dict( | ||
| enable_prefix_caching=False, | ||
| enable_chunked_prefill=False, | ||
| max_model_len=256, | ||
| enforce_eager=True, | ||
| ), | ||
| batch_size=32, | ||
| concurrency=1, | ||
| apply_chat_template=False, | ||
| detokenize=False, | ||
| ) | ||
|
|
||
| # Example usage for embeddings | ||
| def create_embedding_processor(): | ||
| return build_llm_processor( | ||
| embedding_config, | ||
| preprocess=lambda row: dict(prompt=row["text"]), | ||
| postprocess=lambda row: { | ||
| "text": row["prompt"], | ||
| "embedding": row["embeddings"], | ||
| }, | ||
| ) | ||
|
|
||
|
|
||
| # __embedding_config_example_end__ | ||
|
|
||
| # __basic_llm_example_end__ | ||
63 changes: 63 additions & 0 deletions
63
doc/source/data/doc_code/working-with-llms/embedding_example.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,63 @@ | ||
| """ | ||
| Documentation example and test for embedding model batch inference. | ||
|
|
||
| """ | ||
|
|
||
| import subprocess | ||
| import sys | ||
|
|
||
| subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", "ray[llm]"]) | ||
| subprocess.check_call([sys.executable, "-m", "pip", "install", "numpy==1.26.4"]) | ||
|
|
||
|
|
||
| def run_embedding_example(): | ||
| # __embedding_example_start__ | ||
| import ray | ||
| from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor | ||
|
|
||
| embedding_config = vLLMEngineProcessorConfig( | ||
| model_source="sentence-transformers/all-MiniLM-L6-v2", | ||
| task_type="embed", | ||
| engine_kwargs=dict( | ||
| enable_prefix_caching=False, | ||
| enable_chunked_prefill=False, | ||
| max_model_len=256, | ||
| enforce_eager=True, | ||
| ), | ||
| batch_size=32, | ||
| concurrency=1, | ||
| apply_chat_template=False, | ||
| detokenize=False, | ||
| ) | ||
|
|
||
| embedding_processor = build_llm_processor( | ||
| embedding_config, | ||
| preprocess=lambda row: dict(prompt=row["text"]), | ||
| postprocess=lambda row: { | ||
| "text": row["prompt"], | ||
| "embedding": row["embeddings"], | ||
| }, | ||
| ) | ||
|
|
||
| texts = [ | ||
| "Hello world", | ||
| "This is a test sentence", | ||
| "Embedding models convert text to vectors", | ||
| ] | ||
| ds = ray.data.from_items([{"text": text} for text in texts]) | ||
|
|
||
| embedded_ds = embedding_processor(ds) | ||
| embedded_ds.show(limit=1) | ||
| # __embedding_example_end__ | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| try: | ||
| import torch | ||
|
|
||
| if torch.cuda.is_available(): | ||
| run_embedding_example() | ||
| else: | ||
| print("Skipping embedding example (no GPU available)") | ||
| except Exception as e: | ||
| print(f"Skipping embedding example: {e}") |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.