diff --git a/doc/_static/custom.css b/doc/_static/custom.css index 4fa4614e..0badd741 100644 --- a/doc/_static/custom.css +++ b/doc/_static/custom.css @@ -48,6 +48,6 @@ h3 { } p { - font-size: 0.875rem; + font-size: 1.0rem; color: #4b5563; } diff --git a/doc/getting_started.md b/doc/getting_started.md index 1ba925ce..c9af4649 100644 --- a/doc/getting_started.md +++ b/doc/getting_started.md @@ -4,9 +4,7 @@ This guide will help you get started with the SageMaker HyperPod CLI and SDK to perform basic operations. -## Cluster Management - -### List Available Clusters +## List Available Clusters List all available SageMaker HyperPod clusters in your account: @@ -19,20 +17,14 @@ hyp list-cluster [--region ] [--namespace ] [--output [--namespace ] ````{tab-item} SDK ```python -from sagemaker.hyperpod.hyperpod_manager import HyperPodManager +from sagemaker.hyperpod import set_cluster_context -HyperPodManager.set_context('', region='us-east-2') +set_cluster_context('', region='us-east-2') ``` ```` ````` -**Parameters:** -- `cluster-name` (string) - Required. The SageMaker HyperPod cluster name to configure with. -- `namespace` (string) - Optional. The namespace to connect to. If not specified, the CLI will automatically discover accessible namespaces. - -### Get Current Cluster Context +## Get Current Cluster Context View information about the currently configured cluster context: @@ -69,61 +57,13 @@ hyp get-cluster-context ````{tab-item} SDK ```python -from sagemaker.hyperpod.hyperpod_manager import HyperPodManager - -# Get current context information -context = HyperPodManager.get_context() -print(context) -``` -```` -````` - -## Job Management - -### List Pods for a Training Job +from sagemaker.hyperpod import get_cluster_context -View all pods associated with a specific training job: - -`````{tab-set} -````{tab-item} CLI -```bash -hyp list-pods hyp-pytorch-job --job-name -``` -```` - -````{tab-item} SDK -```python -# List all pods created for this job -pytorch_job.list_pods() -``` -```` -````` - -**Parameters:** -- `job-name` (string) - Required. The name of the job to list pods for. - -### Access Pod Logs - -View logs for a specific pod within a training job: - -`````{tab-set} -````{tab-item} CLI -```bash -hyp get-logs hyp-pytorch-job --pod-name --job-name -``` -```` - -````{tab-item} SDK -```python -# Check the logs from pod0 -pytorch_job.get_logs_from_pod("demo-pod-0") +get_cluster_context() ``` ```` ````` -**Parameters:** -- `job-name` (string) - Required. The name of the job to get logs for. -- `pod-name` (string) - Required. The name of the pod to get logs from. ## Next Steps diff --git a/doc/index.md b/doc/index.md index d39c7541..eda4c433 100644 --- a/doc/index.md +++ b/doc/index.md @@ -14,7 +14,9 @@ Example Notebooks API reference <_apidoc/modules> ``` -SageMaker HyperPod CLI and SDK provide a seamless way to manage distributed training and inference workloads on EKS-hosted SageMaker HyperPod clusters—without needing Kubernetes expertise. Use the powerful CLI to launch and monitor training jobs and endpoints, or leverage the Python SDK to do the same programmatically with minimal code, including support for JumpStart models, custom endpoints, and built-in monitoring. +SageMaker HyperPod Command Line Interface (CLI) and Software Development Kit (SDK) provide a seamless way to manage distributed training and inference workloads on EKS-hosted SageMaker HyperPod clusters—without needing Kubernetes expertise. Use the powerful CLI to launch and monitor training jobs and endpoints, or leverage the Python SDK to do the same programmatically with minimal code, including support for JumpStart models, custom endpoints, and built-in monitoring. + +## Start Here ## Start Here diff --git a/doc/inference.md b/doc/inference.md index 1ab86485..fce675c4 100644 --- a/doc/inference.md +++ b/doc/inference.md @@ -25,11 +25,9 @@ You can create inference endpoints using either JumpStart models or custom model ````{tab-item} CLI ```bash hyp create hyp-jumpstart-endpoint \ - --version 1.0 \ - --model-id jumpstart-model-id \ - --instance-type ml.g5.8xlarge \ - --endpoint-name endpoint-jumpstart \ - --tls-output-s3-uri s3://sample-bucket + --model-id jumpstart-model-id \ + --instance-type ml.g5.8xlarge \ + --endpoint-name endpoint-jumpstart ``` ```` @@ -69,19 +67,21 @@ js_endpoint.create() ````{tab-item} CLI ```bash hyp create hyp-custom-endpoint \ - --version 1.0 \ - --endpoint-name endpoint-custom \ - --model-uri s3://my-bucket/model-artifacts \ - --image 123456789012.dkr.ecr.us-west-2.amazonaws.com/my-inference-image:latest \ - --instance-type ml.g5.8xlarge \ - --tls-output-s3-uri s3://sample-bucket + --version 1.0 \ + --endpoint-name endpoint-s3 \ + --model-name \ + --model-source-type s3 \ + --instance-type \ + --image-uri \ + --container-port 8080 \ + --model-volume-mount-name model-weights ``` ```` ````{tab-item} SDK ```python from sagemaker.hyperpod.inference.config.hp_custom_endpoint_config import Model, Server, SageMakerEndpoint, TlsConfig, EnvironmentVariables -from sagemaker.hyperpod.inference.hp_custom_endpoint import HPCustomEndpoint +from sagemaker.hyperpod.inference.hp_custom_endpoint import HPEndpoint model = Model( model_source_type="s3", @@ -115,7 +115,7 @@ endpoint_name = SageMakerEndpoint(name="endpoint-custom-pytorch") tls_config = TlsConfig(tls_certificate_output_s3_uri="s3://sample-bucket") -custom_endpoint = HPCustomEndpoint( +custom_endpoint = HPEndpoint( model=model, server=server, resources=resources, @@ -129,16 +129,18 @@ custom_endpoint.create() ```` ````` -## Key Parameters +### Key Parameters When creating an inference endpoint, you'll need to specify: - **endpoint-name**: Unique identifier for your endpoint -- **model-id** (JumpStart): ID of the pre-trained JumpStart model -- **model-uri** (Custom): S3 location of your model artifacts -- **image** (Custom): Docker image containing your inference code - **instance-type**: The EC2 instance type to use -- **tls-output-s3-uri**: S3 location to store TLS certificates +- **model-id** (JumpStart): ID of the pre-trained JumpStart model +- **image-uri** (Custom): Docker image containing your inference code +- **model-name** (Custom): Name of model to create on SageMaker +- **model-source-type** (Custom): Source type: fsx or s3 +- **model-volume-mount-name** (Custom): Name of the model volume mount +- **container-port** (Custom): Port on which the model server listens ## Managing Inference Endpoints @@ -158,14 +160,14 @@ hyp list hyp-custom-endpoint ````{tab-item} SDK ```python from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint -from sagemaker.hyperpod.inference.hp_custom_endpoint import HPCustomEndpoint +from sagemaker.hyperpod.inference.hp_custom_endpoint import HPEndpoint # List JumpStart endpoints jumpstart_endpoints = HPJumpStartEndpoint.list() print(jumpstart_endpoints) # List custom endpoints -custom_endpoints = HPCustomEndpoint.list() +custom_endpoints = HPEndpoint.list() print(custom_endpoints) ``` ```` @@ -177,24 +179,24 @@ print(custom_endpoints) ````{tab-item} CLI ```bash # Describe JumpStart endpoint -hyp describe hyp-jumpstart-endpoint --endpoint-name +hyp describe hyp-jumpstart-endpoint --name # Describe custom endpoint -hyp describe hyp-custom-endpoint --endpoint-name +hyp describe hyp-custom-endpoint --name ``` ```` ````{tab-item} SDK ```python -from sagemaker.hyperpod.inference import HyperPodJumpstartEndpoint, HyperPodCustomEndpoint +from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint +from sagemaker.hyperpod.inference.hp_custom_endpoint import HPEndpoint # Get JumpStart endpoint details -jumpstart_endpoint = HyperPodJumpstartEndpoint.load(endpoint_name="endpoint-jumpstart") -jumpstart_details = jumpstart_endpoint.describe() -print(jumpstart_details) +jumpstart_endpoint = HPJumpStartEndpoint.get(name="js-endpoint-name", namespace="test") +print(jumpstart_endpoint) # Get custom endpoint details -custom_endpoint = HyperPodCustomEndpoint.load(endpoint_name="endpoint-custom") +custom_endpoint = HPEndpoint.get(endpoint_name="endpoint-custom") custom_details = custom_endpoint.describe() print(custom_details) ``` @@ -206,11 +208,15 @@ print(custom_details) `````{tab-set} ````{tab-item} CLI ```bash +# Invoke Jumpstart endpoint +hyp invoke hyp-jumpstart-endpoint \ + --endpoint-name \ + --body '{"inputs":"What is the capital of USA?"}' + # Invoke custom endpoint hyp invoke hyp-custom-endpoint \ --endpoint-name \ - --content-type "application/json" \ - --payload '{"inputs": "What is machine learning?"}' + --body '{"inputs": "What is machine learning?"}' ``` ```` @@ -223,29 +229,78 @@ print(response) ```` ````` +### List Pods + +`````{tab-set} +````{tab-item} CLI +```bash +# JumpStart endpoint +hyp list-pods hyp-jumpstart-endpoint + +# Custom endpoint +hyp list-pods hyp-custom-endpoint +``` +```` +````` + +### Get Logs + +`````{tab-set} +````{tab-item} CLI +```bash +# JumpStart endpoint +hyp get-logs hyp-jumpstart-endpoint --pod-name + +# Custom endpoint +hyp get-logs hyp-custom-endpoint --pod-name +``` +```` +````` + +### Get Operator Logs + +`````{tab-set} +````{tab-item} CLI +```bash +# JumpStart endpoint +hyp get-operator-logs hyp-jumpstart-endpoint --since-hours 0.5 + +# Custom endpoint +hyp get-operator-logs hyp-custom-endpoint --since-hours 0.5 +``` +```` + +````{tab-item} SDK +```python +print(endpoint.get_operator_logs(since_hours=0.1)) +``` +```` +````` + ### Delete an Endpoint `````{tab-set} ````{tab-item} CLI ```bash # Delete JumpStart endpoint -hyp delete hyp-jumpstart-endpoint --endpoint-name +hyp delete hyp-jumpstart-endpoint --name # Delete custom endpoint -hyp delete hyp-custom-endpoint --endpoint-name +hyp delete hyp-custom-endpoint --name ``` ```` ````{tab-item} SDK ```python -from sagemaker.hyperpod.inference import HyperPodJumpstartEndpoint, HyperPodCustomEndpoint +from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint +from sagemaker.hyperpod.inference.hp_custom_endpoint import HPEndpoint # Delete JumpStart endpoint -jumpstart_endpoint = HyperPodJumpstartEndpoint.load(endpoint_name="endpoint-jumpstart") +jumpstart_endpoint = HPJumpStartEndpoint.get(endpoint_name="endpoint-jumpstart") jumpstart_endpoint.delete() # Delete custom endpoint -custom_endpoint = HyperPodCustomEndpoint.load(endpoint_name="endpoint-custom") +custom_endpoint = HPEndpoint.get(endpoint_name="endpoint-custom") custom_endpoint.delete() ``` ```` diff --git a/doc/training.md b/doc/training.md index 76bb4c79..33ca9e89 100644 --- a/doc/training.md +++ b/doc/training.md @@ -27,8 +27,8 @@ hyp create hyp-pytorch-job \ --version 1.0 \ --job-name test-pytorch-job \ --image pytorch/pytorch:latest \ - --command '["python", "train.py"]' \ - --args '["--epochs", "10", "--batch-size", "32"]' \ + --command '[python, train.py]' \ + --args '[--epochs=10, --batch-size=32]' \ --environment '{"PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:32"}' \ --pull-policy "IfNotPresent" \ --instance-type ml.p4d.24xlarge \ @@ -39,76 +39,68 @@ hyp create hyp-pytorch-job \ --queue-name "training-queue" \ --priority "high" \ --max-retry 3 \ - --volumes '["data-vol", "model-vol", "checkpoint-vol"]' \ - --persistent-volume-claims '["shared-data-pvc", "model-registry-pvc"]' \ + --volumes '[data-vol, model-vol, checkpoint-vol]' \ + --persistent-volume-claims '[shared-data-pvc, model-registry-pvc]' \ --output-s3-uri s3://my-bucket/model-artifacts ``` ```` ````{tab-item} SDK ```python -from sagemaker.hyperpod import HyperPodPytorchJob -from sagemaker.hyperpod.job import ReplicaSpec, Template, Spec, Container, Resources, RunPolicy, Metadata +from sagemaker.hyperpod.training import ( + HyperPodPytorchJob, + Containers, + ReplicaSpec, + Resources, + RunPolicy, + Spec, + Template, +) +from sagemaker.hyperpod.common.config import Metadata + -# Define job specifications -nproc_per_node = "1" # Number of processes per node -replica_specs = [ +nproc_per_node="1" +replica_specs=[ ReplicaSpec( - name = "pod", # Replica name - template = Template( - spec = Spec( - containers = [ - Container( - # Container name - name="container-name", - - # Training image - image="123456789012.dkr.ecr.us-west-2.amazonaws.com/my-training-image:latest", - - # Always pull image - image_pull_policy="Always", + name="pod", + template=Template( + spec=Spec( + containers=[ + Containers( + name="container-name", + image="448049793756.dkr.ecr.us-west-2.amazonaws.com/ptjob:mnist", + image_pull_policy="Always", resources=Resources( - # No GPUs requested - requests={"nvidia.com/gpu": "0"}, - # No GPU limit - limits={"nvidia.com/gpu": "0"}, + requests={"nvidia.com/gpu": "0"}, + limits={"nvidia.com/gpu": "0"}, ), - # Command to run - command=["python", "train.py"], - # Script arguments - args=["--epochs", "10", "--batch-size", "32"], + # command=[] ) ] ) ), ) ] +run_policy=RunPolicy(clean_pod_policy="None") -# Create the PyTorch job pytorch_job = HyperPodPytorchJob( - job_name="my-pytorch-job", + metadata=Metadata(name="demo"), + nproc_per_node="1", replica_specs=replica_specs, - run_policy=RunPolicy( - clean_pod_policy="Running" # Keep pods after completion - ) + run_policy=run_policy, ) -# Submit the job pytorch_job.create() ``` ```` ````` -## Key Parameters +### Key Parameters When creating a training job, you'll need to specify: - **job-name**: Unique identifier for your training job - **image**: Docker image containing your training environment -- **command**: Command to run inside the container -- **args**: Arguments to pass to the command -- **instance-type**: The EC2 instance type to use -- **tasks-per-node**: Number of processes to run per node -- **output-s3-uri**: S3 location to store model artifacts + ## Managing Training Jobs @@ -122,11 +114,12 @@ hyp list hyp-pytorch-job ```` ````{tab-item} SDK ```python -from sagemaker.hyperpod import HyperPodManager +from sagemaker.hyperpod.training import HyperPodPytorchJob +import yaml # List all PyTorch jobs -jobs = HyperPodManager.list_jobs(job_type="hyp-pytorch-job") -print(jobs) +jobs = HyperPodPytorchJob.list() +print(yaml.dump(jobs)) ``` ```` ````` @@ -141,14 +134,44 @@ hyp describe hyp-pytorch-job --job-name ```` ````{tab-item} SDK ```python -from sagemaker.hyperpod import HyperPodPytorchJob +from sagemaker.hyperpod.training import HyperPodPytorchJob # Get an existing job -job = HyperPodPytorchJob.load(job_name="my-pytorch-job") +job = HyperPodPytorchJob.get(name="my-pytorch-job", namespace="my-namespace") + +print(job) +``` +```` +````` + +### List Pods for a Training Job + +`````{tab-set} +````{tab-item} CLI +```bash +hyp list-pods hyp-pytorch-job --job-name +``` +```` -# Get job details -job_details = job.describe() -print(job_details) +````{tab-item} SDK +```python +print(pytorch_job.list_pods()) +``` +```` +````` + +### Get Logs from a Pod + +`````{tab-set} +````{tab-item} CLI +```bash +hyp get-logs hyp-pytorch-job --pod-name test-pytorch-job-cli-pod-0 --job-name test-pytorch-job-cli +``` +```` + +````{tab-item} SDK +```python +print(pytorch_job.get_logs_from_pod("pod-name")) ``` ```` ````` @@ -163,10 +186,10 @@ hyp delete hyp-pytorch-job --job-name ```` ````{tab-item} SDK ```python -from sagemaker.hyperpod import HyperPodPytorchJob +from sagemaker.hyperpod.training import HyperPodPytorchJob # Get an existing job -job = HyperPodPytorchJob.load(job_name="my-pytorch-job") +job = HyperPodPytorchJob.get(name="my-pytorch-job", namespace="my-namespace") # Delete the job job.delete()