diff --git a/frameworks/pytorch/get_started_mnist_train.ipynb b/frameworks/pytorch/get_started_mnist_train.ipynb deleted file mode 100644 index 88ab2958d1..0000000000 --- a/frameworks/pytorch/get_started_mnist_train.ipynb +++ /dev/null @@ -1,458 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Train an MNIST model with PyTorch\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "\n", - "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", - "\n", - "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/frameworks|pytorch|get_started_mnist_train.ipynb)\n", - "\n", - "---" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). This tutorial shows how to train and test an MNIST model on SageMaker using PyTorch. \n", - "\n", - "## Runtime\n", - "\n", - "This notebook takes approximately 5 minutes to run.\n", - "\n", - "## Contents\n", - "\n", - "1. [PyTorch Estimator](#PyTorch-Estimator)\n", - "1. [Implement the entry point for training](#Implement-the-entry-point-for-training)\n", - "1. [Set hyperparameters](#Set-hyperparameters)\n", - "1. [Set up channels for the training and testing data](#Set-up-channels-for-the-training-and-testing-data)\n", - "1. [Run the training script on SageMaker](#Run-the-training-script-on-SageMaker)\n", - "1. [Inspect and store model data](#Inspect-and-store-model-data)\n", - "1. [Test and debug the entry point before executing the training container](#Test-and-debug-the-entry-point-before-executing-the-training-container)\n", - "1. [Conclusion](#Conclusion)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import json\n", - "\n", - "import sagemaker\n", - "from sagemaker.pytorch import PyTorch\n", - "from sagemaker import get_execution_role\n", - "\n", - "\n", - "sess = sagemaker.Session()\n", - "region = sess.boto_region_name\n", - "\n", - "role = get_execution_role()\n", - "\n", - "output_path = \"s3://\" + sess.default_bucket() + \"/DEMO-mnist\"" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## PyTorch Estimator\n", - "\n", - "The `PyTorch` class allows you to run your training script on SageMaker\n", - "infrastracture in a containerized environment. In this notebook, we\n", - "refer to this container as *training container*. \n", - "\n", - "You need to configure\n", - "it with the following parameters to set up the environment:\n", - "\n", - "- `entry_point`: A user-defined Python file used by the training container as the \n", - "instructions for training. We further discuss this file in the next subsection.\n", - "\n", - "- `role`: An IAM role to make AWS service requests\n", - "\n", - "- `instance_type`: The type of SageMaker instance to run your training script. \n", - "Set it to `local` if you want to run the training job on \n", - "the SageMaker instance you are using to run this notebook\n", - "\n", - "- `instance_count`: The number of instances to run your training job on. \n", - "Multiple instances are needed for distributed training.\n", - "\n", - "- `output_path`: \n", - "S3 bucket URI to save training output (model artifacts and output files)\n", - "\n", - "- `framework_version`: The version of PyTorch to use\n", - "\n", - "- `py_version`: The Python version to use\n", - "\n", - "For more information, see the [EstimatorBase API reference](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase)\n", - "\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Implement the entry point for training\n", - "\n", - "The entry point for training is a Python script that provides all \n", - "the code for training a PyTorch model. It is used by the SageMaker \n", - "PyTorch Estimator (`PyTorch` class above) as the entry point for running the training job.\n", - "\n", - "Under the hood, SageMaker PyTorch Estimator creates a docker image\n", - "with runtime environemnts \n", - "specified by the parameters you provide to initiate the\n", - "estimator class, and it injects the training script into the \n", - "docker image as the entry point to run the container.\n", - "\n", - "In the rest of the notebook, we use *training image* to refer to the \n", - "docker image specified by the PyTorch Estimator and *training container*\n", - "to refer to the container that runs the training image. \n", - "\n", - "This means your training script is very similar to a training script\n", - "you might run outside Amazon SageMaker, but it can access the useful environment \n", - "variables provided by the training image. See [the complete list of environment variables](https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md) for a complete \n", - "description of all environment variables your training script\n", - "can access. \n", - "\n", - "In this example, we use the training script `code/train.py`\n", - "as the entry point for our PyTorch Estimator.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pygmentize 'code/train.py'" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Set hyperparameters\n", - "\n", - "In addition, the PyTorch estimator allows you to parse command line arguments\n", - "to your training script via `hyperparameters`.\n", - "\n", - "Note: local mode is not supported in SageMaker Studio. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Set local_mode to True to run the training script on the machine that runs this notebook\n", - "\n", - "local_mode = False\n", - "\n", - "if local_mode:\n", - " instance_type = \"local\"\n", - "else:\n", - " instance_type = \"ml.c4.xlarge\"\n", - "\n", - "est = PyTorch(\n", - " entry_point=\"train.py\",\n", - " source_dir=\"code\", # directory of your training script\n", - " role=role,\n", - " framework_version=\"1.5.0\",\n", - " py_version=\"py3\",\n", - " instance_type=instance_type,\n", - " instance_count=1,\n", - " volume_size=250,\n", - " output_path=output_path,\n", - " hyperparameters={\"batch-size\": 128, \"epochs\": 1, \"learning-rate\": 1e-3, \"log-interval\": 100},\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The training container executes your training script like:\n", - "\n", - "```\n", - "python train.py --batch-size 100 --epochs 1 --learning-rate 1e-3 --log-interval 100\n", - "```" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Set up channels for the training and testing data\n", - "\n", - "Tell the `PyTorch` estimator where to find the training and \n", - "testing data. It can be a path to an S3 bucket, or a path\n", - "in your local file system if you use local mode. In this example,\n", - "we download the MNIST data from a public S3 bucket and upload it \n", - "to your default bucket. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import logging\n", - "import boto3\n", - "from botocore.exceptions import ClientError\n", - "\n", - "# Download training and testing data from a public S3 bucket\n", - "\n", - "\n", - "def download_from_s3(data_dir=\"./data\", train=True):\n", - " \"\"\"Download MNIST dataset and convert it to numpy array\n", - "\n", - " Args:\n", - " data_dir (str): directory to save the data\n", - " train (bool): download training set\n", - "\n", - " Returns:\n", - " None\n", - " \"\"\"\n", - "\n", - " if not os.path.exists(data_dir):\n", - " os.makedirs(data_dir)\n", - "\n", - " if train:\n", - " images_file = \"train-images-idx3-ubyte.gz\"\n", - " labels_file = \"train-labels-idx1-ubyte.gz\"\n", - " else:\n", - " images_file = \"t10k-images-idx3-ubyte.gz\"\n", - " labels_file = \"t10k-labels-idx1-ubyte.gz\"\n", - "\n", - " # download objects\n", - " s3 = boto3.client(\"s3\")\n", - " bucket = f\"sagemaker-example-files-prod-{region}\"\n", - " for obj in [images_file, labels_file]:\n", - " key = os.path.join(\"datasets/image/MNIST\", obj)\n", - " dest = os.path.join(data_dir, obj)\n", - " if not os.path.exists(dest):\n", - " s3.download_file(bucket, key, dest)\n", - " return\n", - "\n", - "\n", - "download_from_s3(\"./data\", True)\n", - "download_from_s3(\"./data\", False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Upload to the default bucket\n", - "\n", - "prefix = \"DEMO-mnist\"\n", - "bucket = sess.default_bucket()\n", - "loc = sess.upload_data(path=\"./data\", bucket=bucket, key_prefix=prefix)\n", - "\n", - "channels = {\"training\": loc, \"testing\": loc}" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The keys of the `channels` dictionary are passed to the training image,\n", - "and it creates the environment variable `SM_CHANNEL_`. \n", - "\n", - "In this example, `SM_CHANNEL_TRAINING` and `SM_CHANNEL_TESTING` are created in the training image (see \n", - "how `code/train.py` accesses these variables). For more information,\n", - "see: [SM_CHANNEL_{channel_name}](https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md#sm_channel_channel_name).\n", - "\n", - "If you want, you can create a channel for validation:\n", - "```\n", - "channels = {\n", - " 'training': train_data_loc,\n", - " 'validation': val_data_loc,\n", - " 'test': test_data_loc\n", - "}\n", - "```\n", - "You can then access this channel within your training script via\n", - "`SM_CHANNEL_VALIDATION`.\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Run the training script on SageMaker\n", - "Now, the training container has everything to execute your training\n", - "script. Start the container by calling the `fit()` method." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "est.fit(inputs=channels)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Inspect and store model data\n", - "\n", - "Now, the training is finished, and the model artifact has been saved in \n", - "the `output_path`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pt_mnist_model_data = est.model_data\n", - "print(\"Model artifact saved at:\\n\", pt_mnist_model_data)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We store the variable `pt_mnist_model_data` in the current notebook kernel." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%store pt_mnist_model_data" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Test and debug the entry point before executing the training container\n", - "\n", - "The entry point `code/train.py` can be executed in the training container. \n", - "When you develop your own training script, it is a good practice to simulate the container environment \n", - "in the local shell and test it before sending it to SageMaker, because debugging in a containerized environment\n", - "is rather cumbersome. The following script shows how you can test your training script:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pygmentize code/test_train.py" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conclusion\n", - "\n", - "In this notebook, we trained a PyTorch model on the MNIST dataset by fitting a SageMaker estimator. For next steps on how to deploy the trained model and perform inference, see [Deploy a Trained PyTorch Model](https://sagemaker-examples.readthedocs.io/en/latest/frameworks/pytorch/get_started_mnist_deploy.html)." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Notebook CI Test Results\n", - "\n", - "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", - "\n", - "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/frameworks|pytorch|get_started_mnist_train.ipynb)\n", - "\n", - "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/frameworks|pytorch|get_started_mnist_train.ipynb)\n", - "\n", - "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/frameworks|pytorch|get_started_mnist_train.ipynb)\n", - "\n", - "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/frameworks|pytorch|get_started_mnist_train.ipynb)\n", - "\n", - "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/frameworks|pytorch|get_started_mnist_train.ipynb)\n", - "\n", - "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/frameworks|pytorch|get_started_mnist_train.ipynb)\n", - "\n", - "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/frameworks|pytorch|get_started_mnist_train.ipynb)\n", - "\n", - "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/frameworks|pytorch|get_started_mnist_train.ipynb)\n", - "\n", - "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/frameworks|pytorch|get_started_mnist_train.ipynb)\n", - "\n", - "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/frameworks|pytorch|get_started_mnist_train.ipynb)\n", - "\n", - "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/frameworks|pytorch|get_started_mnist_train.ipynb)\n", - "\n", - "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/frameworks|pytorch|get_started_mnist_train.ipynb)\n", - "\n", - "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/frameworks|pytorch|get_started_mnist_train.ipynb)\n", - "\n", - "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/frameworks|pytorch|get_started_mnist_train.ipynb)\n", - "\n", - "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/frameworks|pytorch|get_started_mnist_train.ipynb)\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (PyTorch 1.13 Python 3.9 CPU Optimized)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/pytorch-1.13-cpu-py39" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.16" - }, - "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/frameworks/tensorflow/get_started_mnist_train.ipynb b/frameworks/tensorflow/get_started_mnist_train.ipynb deleted file mode 100644 index d5b5233846..0000000000 --- a/frameworks/tensorflow/get_started_mnist_train.ipynb +++ /dev/null @@ -1,460 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Train an MNIST model with TensorFlow\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "\n", - "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", - "\n", - "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/frameworks|tensorflow|get_started_mnist_train.ipynb)\n", - "\n", - "---" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "MNIST is a widely-used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). This tutorial will show how to train a TensorFlow V2 model on MNIST model on SageMaker.\n", - "\n", - "## Runtime\n", - "\n", - "This notebook takes approximately 5 minutes to run.\n", - "\n", - "## Contents\n", - "\n", - "1. [TensorFlow Estimator](#TensorFlow-Estimator)\n", - "1. [Implement the training entry point](#Implement-the-training-entry-point)\n", - "1. [Set hyperparameters](#Set-hyperparameters)\n", - "1. [Set up channels for training and testing data](#Set-up-channels-for-training-and-testing-data)\n", - "1. [Run the training script on SageMaker](#Run-the-training-script-on-SageMaker)\n", - "1. [Inspect and store model data](#Inspect-and-store-model-data)\n", - "1. [Test and debug the entry point before running the training container](#Test-and-debug-the-entry-point-before-running-the-training-container)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import json\n", - "\n", - "import sagemaker\n", - "from sagemaker.tensorflow import TensorFlow\n", - "from sagemaker import get_execution_role\n", - "\n", - "sess = sagemaker.Session()\n", - "\n", - "role = get_execution_role()\n", - "\n", - "output_path = \"s3://\" + sess.default_bucket() + \"/DEMO-tensorflow/mnist\"" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## TensorFlow Estimator\n", - "\n", - "The `TensorFlow` class allows you to run your training script on SageMaker\n", - "infrastracture in a containerized environment. In this notebook, we\n", - "refer to this container as the \"training container.\" \n", - "\n", - "Configure it with the following parameters to set up the environment:\n", - "\n", - "- `entry_point`: A user-defined Python file used by the training container as the instructions for training. We will further discuss this file in the next subsection.\n", - "\n", - "- `role`: An IAM role to make AWS service requests\n", - "\n", - "- `instance_type`: The type of SageMaker instance to run your training script. Set it to `local` if you want to run the training job on the SageMaker instance you are using to run this notebook.\n", - "\n", - "- `model_dir`: S3 bucket URI where the checkpoint data and models can be exported to during training (default: None). \n", - "To disable having model_dir passed to your training script, set `model_dir`=False\n", - "\n", - "- `instance_count`: The number of instances to run your training job on. Multiple instances are needed for distributed training.\n", - "\n", - "- `output_path`: the S3 bucket URI to save training output (model artifacts and output files).\n", - "\n", - "- `framework_version`: The TensorFlow version to use.\n", - "\n", - "- `py_version`: The Python version to use.\n", - "\n", - "For more information, see the [EstimatorBase API reference](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase).\n", - "\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Implement the training entry point\n", - "\n", - "The entry point for training is a Python script that provides all \n", - "the code for training a TensorFlow model. It is used by the SageMaker \n", - "TensorFlow Estimator (`TensorFlow` class above) as the entry point for running the training job.\n", - "\n", - "Under the hood, SageMaker TensorFlow Estimator downloads a docker image\n", - "with runtime environments \n", - "specified by the parameters to initiate the\n", - "estimator class and it injects the training script into the \n", - "docker image as the entry point to run the container.\n", - "\n", - "In the rest of the notebook, we use *training image* to refer to the \n", - "docker image specified by the TensorFlow Estimator and *training container*\n", - "to refer to the container that runs the training image. \n", - "\n", - "This means your training script is very similar to a training script\n", - "you might run outside Amazon SageMaker, but it can access the useful environment \n", - "variables provided by the training image. See [the complete list of environment variables](https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md) for a complete \n", - "description of all environment variables your training script\n", - "can access. \n", - "\n", - "In this example, we use the training script `code/train.py`\n", - "as the entry point for our TensorFlow Estimator. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pygmentize 'code/train.py'" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Set hyperparameters\n", - "\n", - "In addition, the TensorFlow estimator allows you to parse command line arguments\n", - "to your training script via `hyperparameters`.\n", - "\n", - " Note: local mode is not supported in SageMaker Studio. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Set local_mode to be True if you want to run the training script on the machine that runs this notebook\n", - "\n", - "local_mode = False\n", - "\n", - "if local_mode:\n", - " instance_type = \"local\"\n", - "else:\n", - " instance_type = \"ml.c4.xlarge\"\n", - "\n", - "est = TensorFlow(\n", - " entry_point=\"train.py\",\n", - " source_dir=\"code\", # directory of your training script\n", - " role=role,\n", - " framework_version=\"2.3.1\",\n", - " model_dir=False, # don't pass --model_dir to your training script\n", - " py_version=\"py37\",\n", - " instance_type=instance_type,\n", - " instance_count=1,\n", - " volume_size=250,\n", - " output_path=output_path,\n", - " hyperparameters={\n", - " \"batch-size\": 512,\n", - " \"epochs\": 1,\n", - " \"learning-rate\": 1e-3,\n", - " \"beta_1\": 0.9,\n", - " \"beta_2\": 0.999,\n", - " },\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The training container runs your training script like:\n", - "\n", - "```\n", - "python train.py --batch-size 32 --epochs 1 --learning-rate 0.001 --beta_1 0.9 --beta_2 0.999\n", - "```" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Set up channels for training and testing data\n", - "\n", - "Tell `TensorFlow` estimator where to find the training and \n", - "testing data. It can be a path to an S3 bucket, or a path\n", - "in your local file system if you use local mode. In this example,\n", - "we download the MNIST data from a public S3 bucket and upload it \n", - "to your default bucket. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import logging\n", - "import boto3\n", - "from botocore.exceptions import ClientError\n", - "\n", - "# Download training and testing data from a public S3 bucket\n", - "\n", - "\n", - "def download_from_s3(data_dir=\"./data\", train=True):\n", - " \"\"\"Download MNIST dataset and convert it to numpy array\n", - "\n", - " Args:\n", - " data_dir (str): directory to save the data\n", - " train (bool): download training set\n", - "\n", - " Returns:\n", - " None\n", - " \"\"\"\n", - "\n", - " if not os.path.exists(data_dir):\n", - " os.makedirs(data_dir)\n", - "\n", - " if train:\n", - " images_file = \"train-images-idx3-ubyte.gz\"\n", - " labels_file = \"train-labels-idx1-ubyte.gz\"\n", - " else:\n", - " images_file = \"t10k-images-idx3-ubyte.gz\"\n", - " labels_file = \"t10k-labels-idx1-ubyte.gz\"\n", - "\n", - " # download objects\n", - " s3 = boto3.client(\"s3\")\n", - " bucket = f\"sagemaker-example-files-prod-{boto3.session.Session().region_name}\"\n", - " for obj in [images_file, labels_file]:\n", - " key = os.path.join(\"datasets/image/MNIST\", obj)\n", - " dest = os.path.join(data_dir, obj)\n", - " if not os.path.exists(dest):\n", - " s3.download_file(bucket, key, dest)\n", - " return\n", - "\n", - "\n", - "download_from_s3(\"./data\", True)\n", - "download_from_s3(\"./data\", False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Upload to the default bucket\n", - "\n", - "prefix = \"DEMO-mnist\"\n", - "bucket = sess.default_bucket()\n", - "loc = sess.upload_data(path=\"./data\", bucket=bucket, key_prefix=prefix)\n", - "\n", - "channels = {\"training\": loc, \"testing\": loc}" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The keys of the `channels` dictionary are passed to the training image,\n", - "and it creates the environment variable `SM_CHANNEL_`. \n", - "\n", - "In this example, `SM_CHANNEL_TRAINING` and `SM_CHANNEL_TESTING` are created in the training image (see \n", - "how `code/train.py` accesses these variables). For more information,\n", - "see: [SM_CHANNEL_{channel_name}](https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md#sm_channel_channel_name).\n", - "\n", - "If you want, you can create a channel for validation:\n", - "```\n", - "channels = {\n", - " 'training': train_data_loc,\n", - " 'validation': val_data_loc,\n", - " 'test': test_data_loc\n", - "}\n", - "```\n", - "You can then access this channel within your training script via\n", - "`SM_CHANNEL_VALIDATION`." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Run the training script on SageMaker\n", - "Now, the training container has everything to run your training\n", - "script. Start the container by calling the `fit()` method." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "est.fit(inputs=channels)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Inspect and store model data\n", - "\n", - "Now, the training is finished, and the model artifact has been saved in \n", - "the `output_path`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "tf_mnist_model_data = est.model_data\n", - "print(\"Model artifact saved at:\\n\", tf_mnist_model_data)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We store the variable `tf_mnist_model_data` in the current notebook kernel. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%store tf_mnist_model_data" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Test and debug the entry point before running the training container\n", - "\n", - "The entry point `code/train.py` provided here has been tested and it can be runs in the training container. \n", - "When you develop your own training script, it is a good practice to simulate the container environment \n", - "in the local shell and test it before sending it to SageMaker, because debugging in a containerized environment\n", - "is rather cumbersome. The following script shows how you can test your training script:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pygmentize code/test_train.py" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conclusion\n", - "\n", - "In this notebook, we trained a TensorFlow model on the MNIST dataset by fitting a SageMaker estimator. For next steps on how to deploy the trained model and perform inference, see [Deploy a Trained TensorFlow V2 Model](https://sagemaker-examples.readthedocs.io/en/latest/frameworks/tensorflow/get_started_mnist_deploy.html)." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Notebook CI Test Results\n", - "\n", - "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", - "\n", - "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/frameworks|tensorflow|get_started_mnist_train.ipynb)\n", - "\n", - "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/frameworks|tensorflow|get_started_mnist_train.ipynb)\n", - "\n", - "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/frameworks|tensorflow|get_started_mnist_train.ipynb)\n", - "\n", - "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/frameworks|tensorflow|get_started_mnist_train.ipynb)\n", - "\n", - "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/frameworks|tensorflow|get_started_mnist_train.ipynb)\n", - "\n", - "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/frameworks|tensorflow|get_started_mnist_train.ipynb)\n", - "\n", - "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/frameworks|tensorflow|get_started_mnist_train.ipynb)\n", - "\n", - "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/frameworks|tensorflow|get_started_mnist_train.ipynb)\n", - "\n", - "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/frameworks|tensorflow|get_started_mnist_train.ipynb)\n", - "\n", - "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/frameworks|tensorflow|get_started_mnist_train.ipynb)\n", - "\n", - "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/frameworks|tensorflow|get_started_mnist_train.ipynb)\n", - "\n", - "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/frameworks|tensorflow|get_started_mnist_train.ipynb)\n", - "\n", - "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/frameworks|tensorflow|get_started_mnist_train.ipynb)\n", - "\n", - "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/frameworks|tensorflow|get_started_mnist_train.ipynb)\n", - "\n", - "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/frameworks|tensorflow|get_started_mnist_train.ipynb)\n" - ] - } - ], - "metadata": { - "instance_type": "ml.t3.medium", - "kernelspec": { - "display_name": "Python 3 (TensorFlow 2.10.0 Python 3.9 CPU Optimized)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/tensorflow-2.10.1-cpu-py39-ubuntu20.04-sagemaker-v1.2" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.16" - }, - "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/hyperparameter_tuning/tensorflow2_mnist/hpo_tensorflow2_mnist.ipynb b/hyperparameter_tuning/tensorflow2_mnist/hpo_tensorflow2_mnist.ipynb deleted file mode 100644 index 4a6c6a781c..0000000000 --- a/hyperparameter_tuning/tensorflow2_mnist/hpo_tensorflow2_mnist.ipynb +++ /dev/null @@ -1,454 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Hyperparameter Tuning with the SageMaker TensorFlow Container\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "\n", - "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", - "\n", - "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", - "\n", - "---" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "This tutorial focuses on how to create a convolutional neural network model to train the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) using the SageMaker TensorFlow container. It leverages hyperparameter tuning to run multiple training jobs with different hyperparameter combinations, to find the one with the best model training result.\n", - "\n", - "## Runtime\n", - "\n", - "This notebook takes approximately 10 minutes to run.\n", - "\n", - "## Contents\n", - "\n", - "1. [Set Up the Environment](#Set-Up-the-Environment)\n", - "1. [Data](#Data)\n", - "1. [Run a TensorFlow Training Job](#Run-a-TensorFlow-Training-Job)\n", - "1. [Set Up Channels for Training and Testing Data](#Set-Up-Channels-for-Training-and-Testing-Data)\n", - "1. [Run a Hyperparameter Tuning Job](#Run-a-Hyperparameter-Tuning-Job)\n", - "1. [Deploy the Best Model](#Deploy-the-Best-Model)\n", - "1. [Evaluate](#Evaluate)\n", - "1. [Cleanup](#Cleanup)\n", - "\n", - "## Set Up the Environment \n", - "Set up a few things before starting the workflow:\n", - "\n", - "1. A boto3 session object to manage interactions with the Amazon SageMaker APIs. \n", - "2. An execution role which is passed to SageMaker to access your AWS resources." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import json\n", - "\n", - "import sagemaker\n", - "from sagemaker.tensorflow import TensorFlow\n", - "from sagemaker import get_execution_role\n", - "\n", - "sess = sagemaker.Session()\n", - "region = sess.boto_region_name\n", - "role = get_execution_role()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Data\n", - "Download the MNIST data from a public S3 bucket and save it in a temporary directory." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import logging\n", - "import boto3\n", - "from botocore.exceptions import ClientError\n", - "\n", - "public_bucket = f\"sagemaker-example-files-prod-{region}\"\n", - "local_data_dir = \"/tmp/data\"\n", - "\n", - "\n", - "# Download training and testing data from a public S3 bucket\n", - "def download_from_s3(data_dir=\"/tmp/data\", train=True):\n", - " \"\"\"Download MNIST dataset and convert it to numpy array\n", - "\n", - " Args:\n", - " data_dir (str): directory to save the data\n", - " train (bool): download training set\n", - "\n", - " Returns:\n", - " None\n", - " \"\"\"\n", - " # project root\n", - " if not os.path.exists(data_dir):\n", - " os.makedirs(data_dir)\n", - "\n", - " if train:\n", - " images_file = \"train-images-idx3-ubyte.gz\"\n", - " labels_file = \"train-labels-idx1-ubyte.gz\"\n", - " else:\n", - " images_file = \"t10k-images-idx3-ubyte.gz\"\n", - " labels_file = \"t10k-labels-idx1-ubyte.gz\"\n", - "\n", - " # download objects\n", - " s3 = boto3.client(\"s3\")\n", - " bucket = public_bucket\n", - " for obj in [images_file, labels_file]:\n", - " key = os.path.join(\"datasets/image/MNIST\", obj)\n", - " dest = os.path.join(data_dir, obj)\n", - " if not os.path.exists(dest):\n", - " s3.download_file(bucket, key, dest)\n", - " return\n", - "\n", - "\n", - "download_from_s3(local_data_dir, True)\n", - "download_from_s3(local_data_dir, False)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Run a TensorFlow Training Job\n", - "A TensorFlow training job is defined by using the `TensorFlow` estimator class. It lets you run your training script on SageMaker infrastructure in a containerized environment. For more information on how to instantiate it, see the example [Train an MNIST model with TensorFlow](https://sagemaker-examples.readthedocs.io/en/latest/frameworks/tensorflow/get_started_mnist_train.html#TensorFlow-Estimator)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "est = TensorFlow(\n", - " entry_point=\"train.py\",\n", - " source_dir=\"code\", # directory of your training script\n", - " role=role,\n", - " framework_version=\"2.3.1\",\n", - " model_dir=\"/opt/ml/model\",\n", - " py_version=\"py37\",\n", - " instance_type=\"ml.m5.4xlarge\",\n", - " instance_count=1,\n", - " volume_size=250,\n", - " hyperparameters={\n", - " \"batch-size\": 512,\n", - " \"epochs\": 4,\n", - " },\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Set Up Channels for Training and Testing Data\n", - "Upload the MNIST data to the default bucket of your AWS account and pass the S3 URI as the channels of training and testing data for the `TensorFlow` estimator class. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "prefix = \"mnist\"\n", - "bucket = sess.default_bucket()\n", - "loc = sess.upload_data(path=local_data_dir, bucket=bucket, key_prefix=prefix)\n", - "\n", - "channels = {\"training\": loc, \"testing\": loc}" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Run a Hyperparameter Tuning Job\n", - "Now that you have set up the training job and the input data channels, you are ready to train the model with hyperparameter search.\n", - "\n", - "Set up the hyperparameter tuning job with the following steps:\n", - "* Define the ranges of hyperparameters we plan to tune. In this example, we tune the learning rate.\n", - "* Define the objective metric for the tuning job to optimize.\n", - "* Create a hyperparameter tuner with the above setting, as well as tuning resource configurations.\n", - "\n", - "\n", - "\n", - "\n", - "For a typical ML model, there are three kinds of hyperparamters:\n", - "\n", - "- Categorical parameters need to take one value from a discrete set. We define this by passing the list of possible values to `CategoricalParameter(list)`\n", - "- Continuous parameters can take any real number value between the minimum and maximum value, defined by `ContinuousParameter(min, max)`\n", - "- Integer parameters can take any integer value between the minimum and maximum value, defined by `IntegerParameter(min, max)`\n", - "\n", - "Learning rate is a continuous variable, so we define its range\n", - "by `ContinuousParameter`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sagemaker.tuner import ContinuousParameter, HyperparameterTuner\n", - "\n", - "hyperparamter_range = {\"learning-rate\": ContinuousParameter(1e-4, 1e-3)}" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next we specify the objective metric that we'd like to tune and its definition, which includes the regular expression (regex) needed to extract that metric from the CloudWatch logs of the training job. In this particular case, our script emits average loss value and we use it as the objective metric. We set the `objective_type` to `Minimize`, so that hyperparameter tuning seeks to minimize the objective metric when searching for the best hyperparameter value." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "objective_metric_name = \"average test loss\"\n", - "objective_type = \"Minimize\"\n", - "metric_definitions = [\n", - " {\n", - " \"Name\": \"average test loss\",\n", - " \"Regex\": \"Test Loss: ([0-9\\\\.]+)\",\n", - " }\n", - "]" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, you'll create a `HyperparameterTuner` object. It takes the following parameters:\n", - "- The `TensorFlow` estimator you previously created.\n", - "- Your hyperparameter ranges.\n", - "- Objective metric name and definition.\n", - "- Tuning resource configurations such as the number of training jobs to run in total, and how many training jobs to run in parallel." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "tuner = HyperparameterTuner(\n", - " est,\n", - " objective_metric_name,\n", - " hyperparamter_range,\n", - " metric_definitions,\n", - " max_jobs=3,\n", - " max_parallel_jobs=3,\n", - " objective_type=objective_type,\n", - ")\n", - "\n", - "tuner.fit(inputs=channels)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Deploy the Best Model\n", - "After training with hyperparameter optimization, you can deploy the best-performing model (by the objective metric you defined) to a SageMaker endpoint. For more information about deploying a model to a SageMaker endpoint, see the example [Deploy a Trained TensorFlow V2 Model](https://sagemaker-examples.readthedocs.io/en/latest/frameworks/tensorflow/get_started_mnist_deploy.html)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "predictor = tuner.deploy(initial_instance_count=1, instance_type=\"ml.m5.xlarge\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Evaluate\n", - "Now, you can evaluate the best-performing model by invoking the endpoint with the MNIST test set. The test data needs to be readily consumable by the model, so we arrange them into the correct shape that is accepted by a TensorFlow model. We also normalize them so that the pixel values have mean 0 and standard deviation 1, since this is the convention used to train the model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import random\n", - "import gzip\n", - "import numpy as np\n", - "import matplotlib.pyplot as plt\n", - "\n", - "%matplotlib inline\n", - "\n", - "\n", - "images_file = \"t10k-images-idx3-ubyte.gz\"\n", - "\n", - "\n", - "def read_mnist(data_dir, images_file):\n", - " \"\"\"Byte string to numpy arrays\"\"\"\n", - " with gzip.open(os.path.join(data_dir, images_file), \"rb\") as f:\n", - " images = np.frombuffer(f.read(), np.uint8, offset=16).reshape(-1, 28, 28)\n", - " return images\n", - "\n", - "\n", - "X = read_mnist(local_data_dir, images_file)\n", - "\n", - "# randomly sample 16 images to inspect\n", - "mask = random.sample(range(X.shape[0]), 16)\n", - "samples = X[mask]\n", - "\n", - "# plot the images\n", - "fig, axs = plt.subplots(nrows=1, ncols=16, figsize=(16, 1))\n", - "\n", - "for i, splt in enumerate(axs):\n", - " splt.imshow(samples[i])\n", - "\n", - "# preprocess the data to be consumed by the model\n", - "\n", - "\n", - "def normalize(x, axis):\n", - " eps = np.finfo(float).eps\n", - "\n", - " mean = np.mean(x, axis=axis, keepdims=True)\n", - " # avoid division by zero\n", - " std = np.std(x, axis=axis, keepdims=True) + eps\n", - " return (x - mean) / std\n", - "\n", - "\n", - "samples = normalize(samples, axis=(1, 2))\n", - "samples = np.expand_dims(samples, axis=3)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "predictions = predictor.predict(samples)[\"predictions\"]\n", - "\n", - "# softmax to logit\n", - "predictions = np.array(predictions, dtype=np.float32)\n", - "predictions = np.argmax(predictions, axis=1)\n", - "\n", - "print(\"Predictions: \", *predictions)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Cleanup\n", - "If you do not plan to continue using the endpoint, delete it to free up resources." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "predictor.delete_endpoint()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Notebook CI Test Results\n", - "\n", - "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", - "\n", - "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", - "\n", - "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", - "\n", - "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", - "\n", - "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", - "\n", - "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", - "\n", - "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", - "\n", - "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", - "\n", - "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", - "\n", - "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", - "\n", - "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", - "\n", - "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", - "\n", - "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", - "\n", - "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", - "\n", - "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n", - "\n", - "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/hyperparameter_tuning|tensorflow2_mnist|hpo_tensorflow2_mnist.ipynb)\n" - ] - } - ], - "metadata": { - "instance_type": "ml.t3.medium", - "kernelspec": { - "display_name": "Python 3 (TensorFlow 2.10.0 Python 3.9 CPU Optimized)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/tensorflow-2.10.1-cpu-py39-ubuntu20.04-sagemaker-v1.2" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.16" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/introduction_to_applying_machine_learning/huggingface_sentiment_classification/huggingface_sentiment.ipynb b/introduction_to_applying_machine_learning/huggingface_sentiment_classification/huggingface_sentiment.ipynb deleted file mode 100644 index c3e0729705..0000000000 --- a/introduction_to_applying_machine_learning/huggingface_sentiment_classification/huggingface_sentiment.ipynb +++ /dev/null @@ -1,1158 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Hugging Face Sentiment Classification\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "\n", - "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", - "\n", - "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "__Binary Classification with `Trainer` and `sst2` dataset__" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Runtime\n", - "\n", - "This notebook takes approximately 45 minutes to run.\n", - "\n", - "## Contents\n", - "\n", - "1. [Introduction](#Introduction) \n", - "2. [Development environment and permissions](#Development-environment-and-permissions)\n", - " 1. [Installation](#Installation) \n", - " 2. [Development environment](#Development-environment) \n", - " 3. [Permissions](#Permissions)\n", - "3. [Pre-processing](#Pre-processing) \n", - " 1. [Tokenize sentences](#Tokenize-sentences) \n", - " 2. [Upload data to sagemaker_session_bucket](#Upload-data-to-sagemaker_session_bucket) \n", - "4. [Fine-tune the model and start a SageMaker training job](#Fine-tune-the-model-and-start-a-SageMaker-training-job) \n", - " 1. [Create an Estimator and start a training job](#Create-an-Estimator-and-start-a-training-job) \n", - " 2. [Estimator Parameters](#Estimator-Parameters) \n", - " 3. [Attach a previous training job to an estimator](#Attach-a-previous-training-job-to-an-estimator) " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Introduction\n", - "\n", - "Welcome to our end-to-end binary text classification example. This notebook uses Hugging Face's `transformers` library with a custom Amazon sagemaker-sdk extension to fine-tune a pre-trained transformer on binary text classification. The pre-trained model is fine-tuned using the `sst2` dataset. To get started, we need to set up the environment with a few prerequisite steps for permissions, configurations, and so on. \n", - "\n", - "This notebook is adapted from Hugging Face's notebook [Huggingface Sagemaker-sdk - Getting Started Demo](https://github.com/huggingface/notebooks/blob/master/sagemaker/01_getting_started_pytorch/sagemaker-notebook.ipynb) and provided here courtesy of Hugging Face.\n", - "\n", - "\n", - "\n", - "## Runtime\n", - "\n", - "This notebook takes approximately 40 minutes to run.\n", - "\n", - "NOTE: You can run this notebook in SageMaker Studio, a SageMaker notebook instance, or your local machine. This notebook was tested in a notebook instance using the conda\\_pytorch\\_p36 kernel.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Development environment and permissions " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Installation\n", - "\n", - "_*Note:* We install the required libraries from Hugging Face and AWS. You also need PyTorch, if you haven't installed it already._" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "!pip install \"sagemaker\" \"transformers\" \"datasets[s3]\" \"s3fs\" --upgrade" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Development environment " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import sagemaker.huggingface" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Permissions" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "_If you are going to use SageMaker in a local environment, you need access to an IAM Role with the required permissions for SageMaker. You can read more at [SageMaker Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html)._" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import sagemaker\n", - "\n", - "sess = sagemaker.Session()\n", - "# The SageMaker session bucket is used for uploading data, models and logs\n", - "# SageMaker will automatically create this bucket if it doesn't exist\n", - "sagemaker_session_bucket = None\n", - "if sagemaker_session_bucket is None and sess is not None:\n", - " # Set to default bucket if a bucket name is not given\n", - " sagemaker_session_bucket = sess.default_bucket()\n", - "\n", - "role = sagemaker.get_execution_role()\n", - "sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)\n", - "\n", - "print(f\"Role arn: {role}\")\n", - "print(f\"Bucket: {sess.default_bucket()}\")\n", - "print(f\"Region: {sess.boto_region_name}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Pre-processing\n", - "\n", - "We use the `datasets` library to pre-process the `sst2` dataset (Stanford Sentiment Treebank). After pre-processing, the dataset is uploaded to the `sagemaker_session_bucket` for use within the training job. The [sst2](https://nlp.stanford.edu/sentiment/index.html) dataset consists of 67349 training samples and _ testing samples of highly polar movie reviews." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Download the dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from datasets import Dataset\n", - "from transformers import AutoTokenizer\n", - "import pandas as pd\n", - "import boto3\n", - "\n", - "# Tokenizer used in pre-processing\n", - "tokenizer_name = \"distilbert-base-uncased\"\n", - "\n", - "# S3 key prefix for the data\n", - "s3_prefix = \"DEMO-samples/datasets/sst\"\n", - "\n", - "# Download the SST2 data from s3\n", - "s3 = boto3.client(\"s3\")\n", - "s3.download_file(\n", - " f\"sagemaker-example-files-prod-{sess.boto_region_name}\",\n", - " \"datasets/text/SST2/sst2.test\",\n", - " \"sst2.test\",\n", - ")\n", - "s3.download_file(\n", - " f\"sagemaker-example-files-prod-{sess.boto_region_name}\",\n", - " \"datasets/text/SST2/sst2.train\",\n", - " \"sst2.train\",\n", - ")\n", - "s3.download_file(\n", - " f\"sagemaker-example-files-prod-{sess.boto_region_name}\",\n", - " \"datasets/text/SST2/sst2.val\",\n", - " \"sst2.val\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Tokenize sentences" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Download tokenizer\n", - "tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)\n", - "\n", - "\n", - "# Tokenizer helper function\n", - "def tokenize(batch):\n", - " return tokenizer(batch[\"text\"], padding=\"max_length\", truncation=True)\n", - "\n", - "\n", - "# Load dataset\n", - "test_df = pd.read_csv(\"sst2.test\", sep=\"delimiter\", header=None, engine=\"python\", names=[\"line\"])\n", - "train_df = pd.read_csv(\"sst2.train\", sep=\"delimiter\", header=None, engine=\"python\", names=[\"line\"])\n", - "\n", - "test_df[[\"label\", \"text\"]] = test_df[\"line\"].str.split(\" \", 1, expand=True)\n", - "train_df[[\"label\", \"text\"]] = train_df[\"line\"].str.split(\" \", 1, expand=True)\n", - "\n", - "test_df.drop(\"line\", axis=1, inplace=True)\n", - "train_df.drop(\"line\", axis=1, inplace=True)\n", - "\n", - "test_df[\"label\"] = pd.to_numeric(test_df[\"label\"], downcast=\"integer\")\n", - "train_df[\"label\"] = pd.to_numeric(train_df[\"label\"], downcast=\"integer\")\n", - "\n", - "train_dataset = Dataset.from_pandas(train_df)\n", - "test_dataset = Dataset.from_pandas(test_df)\n", - "\n", - "# Tokenize dataset\n", - "train_dataset = train_dataset.map(tokenize, batched=True)\n", - "test_dataset = test_dataset.map(tokenize, batched=True)\n", - "\n", - "# Set format for pytorch\n", - "train_dataset = train_dataset.rename_column(\"label\", \"labels\")\n", - "train_dataset.set_format(\"torch\", columns=[\"input_ids\", \"attention_mask\", \"labels\"])\n", - "\n", - "test_dataset = test_dataset.rename_column(\"label\", \"labels\")\n", - "test_dataset.set_format(\"torch\", columns=[\"input_ids\", \"attention_mask\", \"labels\"])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Upload data to `sagemaker_session_bucket`\n", - "\n", - "After processing the `datasets`, we use the `FileSystem` [integration](https://huggingface.co/docs/datasets/filesystems.html) to upload the dataset to S3." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import botocore\n", - "from datasets.filesystems import S3FileSystem\n", - "\n", - "s3 = S3FileSystem()\n", - "\n", - "# save train_dataset to s3\n", - "training_input_path = f\"s3://{sess.default_bucket()}/{s3_prefix}/train\"\n", - "train_dataset.save_to_disk(training_input_path, fs=s3)\n", - "\n", - "# save test_dataset to s3\n", - "test_input_path = f\"s3://{sess.default_bucket()}/{s3_prefix}/test\"\n", - "test_dataset.save_to_disk(test_input_path, fs=s3)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Fine-tune the model and start a SageMaker training job\n", - "\n", - "In order to create a SageMaker training job, we need a `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In an Estimator, we define which fine-tuning script should be used as `entry_point`, which `instance_type` should be used, which `hyperparameters` are passed in, etc:\n", - "\n", - "\n", - "\n", - "```python\n", - "hf_estimator = HuggingFace(entry_point=\"train.py\",\n", - " source_dir=\"./scripts\",\n", - " base_job_name=\"huggingface-sdk-extension\",\n", - " instance_type=\"ml.p3.2xlarge\",\n", - " instance_count=1,\n", - " transformers_version=\"4.4\",\n", - " pytorch_version=\"1.6\",\n", - " py_version=\"py36\",\n", - " role=role,\n", - " hyperparameters = {\"epochs\": 1,\n", - " \"train_batch_size\": 32,\n", - " \"model_name\":\"distilbert-base-uncased\"\n", - " })\n", - "```\n", - "\n", - "When we create a SageMaker training job, SageMaker takes care of starting and managing all the required EC2 instances for us with the `huggingface` container, uploads the provided fine-tuning script `train.py`, and downloads the data from the `sagemaker_session_bucket` into the container at `/opt/ml/input/data`. Then, it starts the training job by running:\n", - "\n", - "```python\n", - "/opt/conda/bin/python train.py --epochs 1 --model_name distilbert-base-uncased --train_batch_size 32\n", - "```\n", - "\n", - "The `hyperparameters` defined in the `HuggingFace` estimator are passed in as named arguments. \n", - "\n", - "SageMaker provides useful properties about the training environment through various environment variables, including the following:\n", - "\n", - "* `SM_MODEL_DIR`: A string representing the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting.\n", - "\n", - "* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.\n", - "\n", - "* `SM_CHANNEL_XXXX:` A string representing the path to the directory that contains the input data for the specified channel. For example, if you specify two input channels in the Hugging Face estimator's `fit()` call, named `train` and `test`, the environment variables `SM_CHANNEL_TRAIN` and `SM_CHANNEL_TEST` are set.\n", - "\n", - "\n", - "To run the training job locally, you can define `instance_type=\"local\"` or `instance_type=\"local_gpu\"` for GPU usage.\n", - "\n", - "_Note: local mode is not supported in SageMaker Studio._\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "!pygmentize ./scripts/train.py" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Create an Estimator and start a training job" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from sagemaker.huggingface import HuggingFace\n", - "\n", - "# Hyperparameters which are passed into the training job\n", - "hyperparameters = {\"epochs\": 1, \"train_batch_size\": 32, \"model_name\": \"distilbert-base-uncased\"}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "hf_estimator = HuggingFace(\n", - " entry_point=\"train.py\",\n", - " source_dir=\"./scripts\",\n", - " instance_type=\"ml.p3.2xlarge\",\n", - " instance_count=1,\n", - " role=role,\n", - " transformers_version=\"4.12\",\n", - " pytorch_version=\"1.9\",\n", - " py_version=\"py38\",\n", - " hyperparameters=hyperparameters,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Start the training job with the uploaded dataset as input\n", - "hf_estimator.fit({\"train\": training_input_path, \"test\": test_input_path})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Deploy the endpoint\n", - "\n", - "To deploy the endpoint, call `deploy()` on the HuggingFace estimator object, passing in the desired number of instances and instance type." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "predictor = hf_estimator.deploy(1, \"ml.p3.2xlarge\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Then use the returned predictor object to perform inference." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "sentiment_input = {\"inputs\": \"I love using the new Inference DLC.\"}\n", - "\n", - "predictor.predict(sentiment_input)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We see that the fine-tuned model classifies the test sentence \"I love using the new Inference DLC.\" as having positive sentiment with 98% probability!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, delete the endpoint." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "predictor.delete_endpoint()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Extras" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Estimator Parameters" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true, - "tags": [] - }, - "outputs": [], - "source": [ - "print(f\"Container image used for training job: \\n{hf_estimator.image_uri}\\n\")\n", - "print(f\"S3 URI where the trained model is located: \\n{hf_estimator.model_data}\\n\")\n", - "print(f\"Latest training job name for this estimator: \\n{hf_estimator.latest_training_job.name}\\n\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "hf_estimator.sagemaker_session.logs_for_job(hf_estimator.latest_training_job.name)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Attach a previous training job to an estimator\n", - "\n", - "In SageMaker, you can attach a previous training job to an estimator to continue training, get results, etc." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from sagemaker.estimator import Estimator\n", - "\n", - "# Uncomment the following lines and supply your training job name\n", - "\n", - "# old_training_job_name = \"\"\n", - "# hf_estimator_loaded = Estimator.attach(old_training_job_name)\n", - "# hf_estimator_loaded.model_data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Notebook CI Test Results\n", - "\n", - "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", - "\n", - "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)\n", - "\n", - "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)\n", - "\n", - "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)\n", - "\n", - "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)\n", - "\n", - "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)\n", - "\n", - "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)\n", - "\n", - "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)\n", - "\n", - "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)\n", - "\n", - "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)\n", - "\n", - "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)\n", - "\n", - "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)\n", - "\n", - "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)\n", - "\n", - "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)\n", - "\n", - "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)\n", - "\n", - "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/introduction_to_applying_machine_learning|huggingface_sentiment_classification|huggingface_sentiment.ipynb)\n" - ] - } - ], - "metadata": { - "availableInstances": [ - { - "_defaultOrder": 0, - "_isFastLaunch": true, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 4, - "name": "ml.t3.medium", - "vcpuNum": 2 - }, - { - "_defaultOrder": 1, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 8, - "name": "ml.t3.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 2, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.t3.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 3, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.t3.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 4, - "_isFastLaunch": true, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 8, - "name": "ml.m5.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 5, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.m5.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 6, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.m5.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 7, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.m5.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 8, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.m5.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 9, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.m5.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 10, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.m5.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 11, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 384, - "name": "ml.m5.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 12, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 8, - "name": "ml.m5d.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 13, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.m5d.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 14, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.m5d.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 15, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.m5d.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 16, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.m5d.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 17, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.m5d.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 18, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.m5d.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 19, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 384, - "name": "ml.m5d.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 20, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": true, - "memoryGiB": 0, - "name": "ml.geospatial.interactive", - "supportedImageNames": [ - "sagemaker-geospatial-v1-0" - ], - "vcpuNum": 0 - }, - { - "_defaultOrder": 21, - "_isFastLaunch": true, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 4, - "name": "ml.c5.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 22, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 8, - "name": "ml.c5.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 23, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.c5.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 24, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.c5.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 25, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 72, - "name": "ml.c5.9xlarge", - "vcpuNum": 36 - }, - { - "_defaultOrder": 26, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 96, - "name": "ml.c5.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 27, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 144, - "name": "ml.c5.18xlarge", - "vcpuNum": 72 - }, - { - "_defaultOrder": 28, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.c5.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 29, - "_isFastLaunch": true, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.g4dn.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 30, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.g4dn.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 31, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.g4dn.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 32, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.g4dn.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 33, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 4, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.g4dn.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 34, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.g4dn.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 35, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 61, - "name": "ml.p3.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 36, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 4, - "hideHardwareSpecs": false, - "memoryGiB": 244, - "name": "ml.p3.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 37, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 488, - "name": "ml.p3.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 38, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 768, - "name": "ml.p3dn.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 39, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.r5.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 40, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.r5.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 41, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.r5.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 42, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.r5.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 43, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.r5.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 44, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 384, - "name": "ml.r5.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 45, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 512, - "name": "ml.r5.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 46, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 768, - "name": "ml.r5.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 47, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.g5.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 48, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.g5.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 49, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.g5.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 50, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.g5.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 51, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.g5.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 52, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 4, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.g5.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 53, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 4, - "hideHardwareSpecs": false, - "memoryGiB": 384, - "name": "ml.g5.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 54, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 768, - "name": "ml.g5.48xlarge", - "vcpuNum": 192 - } - ], - "instance_type": "ml.t3.medium", - "interpreter": { - "hash": "c281c456f1b8161c8906f4af2c08ed2c40c50136979eaae69688b01f70e9f4a9" - }, - "kernelspec": { - "display_name": "Python 3 (PyTorch 1.13 Python 3.9 CPU Optimized)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/pytorch-1.13-cpu-py39" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.16" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.ipynb b/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.ipynb deleted file mode 100644 index 841e87e101..0000000000 --- a/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.ipynb +++ /dev/null @@ -1,1844 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "9b08c378", - "metadata": { - "papermill": { - "duration": 0.018505, - "end_time": "2021-06-07T00:09:44.379517", - "exception": false, - "start_time": "2021-06-07T00:09:44.361012", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "# Customer Churn Prediction with XGBoost\n" - ] - }, - { - "cell_type": "markdown", - "id": "1b98b6df", - "metadata": {}, - "source": [ - "---\n", - "\n", - "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", - "\n", - "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/introduction_to_applying_machine_learning|xgboost_customer_churn|xgboost_customer_churn.ipynb)\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "id": "0bd14a6e", - "metadata": { - "papermill": { - "duration": 0.018505, - "end_time": "2021-06-07T00:09:44.379517", - "exception": false, - "start_time": "2021-06-07T00:09:44.361012", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "_**Using Gradient Boosted Trees to Predict Mobile Customer Departure**_\n", - "\n", - "---\n", - "\n", - "---\n", - "\n", - "## Runtime\n", - "\n", - "This notebook takes approximately 8 minutes to run.\n", - "\n", - "## Contents\n", - "\n", - "1. [Background](#Background)\n", - "1. [Setup](#Setup)\n", - "1. [Data](#Data)\n", - "1. [Train](#Train)\n", - "1. [Host](#Host)\n", - " 1. [Evaluate](#Evaluate)\n", - " 1. [Relative cost of errors](#Relative-cost-of-errors)\n", - "1. [Extensions](#Extensions)\n", - "\n", - "---\n", - "\n", - "## Background\n", - "\n", - "_This notebook has been adapted from an [AWS blog post](https://aws.amazon.com/blogs/ai/predicting-customer-churn-with-amazon-machine-learning/)_\n", - "\n", - "Losing customers is costly for any business. Identifying unhappy customers early on gives you a chance to offer them incentives to stay. This notebook describes using machine learning (ML) for the automated identification of unhappy customers, also known as customer churn prediction. ML models rarely give perfect predictions though, so this notebook is also about how to incorporate the relative costs of prediction mistakes when determining the financial outcome of using ML.\n", - "\n", - "We use a familiar example of churn: leaving a mobile phone operator. Seems like one can always find fault with their provider du jour! And if the provider knows that a customer is thinking of leaving, it can offer timely incentives - such as a phone upgrade or perhaps having a new feature activated – and the customer may stick around. Incentives are often much more cost-effective than losing and reacquiring a customer.\n", - "\n", - "---\n", - "\n", - "## Setup\n", - "\n", - "_This notebook was created and tested on a `ml.m4.xlarge` notebook instance._\n", - "\n", - "Let's start by updating the required packages i.e. SageMaker Python SDK, `pandas` and `numpy`, and specifying:\n", - "\n", - "- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance or Studio, training, and hosting.\n", - "- The IAM role ARN used to give training and hosting access to your data. See the documentation for how to create these. Note: if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with the appropriate full IAM role ARN string(s)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4f00baad", - "metadata": {}, - "outputs": [], - "source": [ - "import sys\n", - "\n", - "!{sys.executable} -m pip install sagemaker pandas numpy --upgrade\n", - "!pip3 install -U sagemaker" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e4c1b3c0", - "metadata": { - "isConfigCell": true, - "papermill": { - "duration": 1.209916, - "end_time": "2021-06-07T00:09:45.607159", - "exception": false, - "start_time": "2021-06-07T00:09:44.397243", - "status": "completed" - }, - "tags": [ - "parameters" - ] - }, - "outputs": [], - "source": [ - "import sagemaker\n", - "\n", - "sess = sagemaker.Session()\n", - "bucket = sess.default_bucket()\n", - "prefix = \"sagemaker/DEMO-xgboost-churn\"\n", - "\n", - "# Define IAM role\n", - "import boto3\n", - "import re\n", - "from sagemaker import get_execution_role\n", - "\n", - "role = get_execution_role()" - ] - }, - { - "cell_type": "markdown", - "id": "e02e6dbb", - "metadata": { - "papermill": { - "duration": 0.017739, - "end_time": "2021-06-07T00:09:45.683322", - "exception": false, - "start_time": "2021-06-07T00:09:45.665583", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "Next, we'll import the Python libraries we'll need for the remainder of the example." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "08714702", - "metadata": { - "papermill": { - "duration": 0.666347, - "end_time": "2021-06-07T00:09:46.367361", - "exception": false, - "start_time": "2021-06-07T00:09:45.701014", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import numpy as np\n", - "import matplotlib.pyplot as plt\n", - "import io\n", - "import os\n", - "import sys\n", - "import time\n", - "import json\n", - "from IPython.display import display\n", - "from time import strftime, gmtime\n", - "from sagemaker.inputs import TrainingInput\n", - "from sagemaker.serializers import CSVSerializer" - ] - }, - { - "cell_type": "markdown", - "id": "6c810d34", - "metadata": { - "papermill": { - "duration": 0.021555, - "end_time": "2021-06-07T00:09:46.406743", - "exception": false, - "start_time": "2021-06-07T00:09:46.385188", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "---\n", - "## Data\n", - "\n", - "Mobile operators have historical records on which customers ultimately ended up churning and which continued using the service. We can use this historical information to construct an ML model of one mobile operator’s churn using a process called training. After training the model, we can pass the profile information of an arbitrary customer (the same profile information that we used to train the model) to the model, and have the model predict whether this customer is going to churn. Of course, we expect the model to make mistakes. After all, predicting the future is tricky business! But we'll learn how to deal with prediction errors.\n", - "\n", - "The dataset we use is publicly available and was mentioned in the book [Discovering Knowledge in Data](https://www.amazon.com/dp/0470908742/) by Daniel T. Larose. It is attributed by the author to the University of California Irvine Repository of Machine Learning Datasets. Let's download and read that dataset in now:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2f01c890", - "metadata": { - "papermill": { - "duration": 1.671215, - "end_time": "2021-06-07T00:09:48.098151", - "exception": false, - "start_time": "2021-06-07T00:09:46.426936", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "s3 = boto3.client(\"s3\")\n", - "s3.download_file(\n", - " f\"sagemaker-example-files-prod-{sess.boto_region_name}\",\n", - " \"datasets/tabular/synthetic/churn.txt\",\n", - " \"churn.txt\",\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b89ecb3f", - "metadata": { - "papermill": { - "duration": 0.06925, - "end_time": "2021-06-07T00:09:48.185909", - "exception": false, - "start_time": "2021-06-07T00:09:48.116659", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "churn = pd.read_csv(\"./churn.txt\")\n", - "pd.set_option(\"display.max_columns\", 500)\n", - "churn" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2d3c3733", - "metadata": {}, - "outputs": [], - "source": [ - "len(churn.columns)" - ] - }, - { - "cell_type": "markdown", - "id": "a1380adb", - "metadata": { - "papermill": { - "duration": 0.019033, - "end_time": "2021-06-07T00:09:48.224277", - "exception": false, - "start_time": "2021-06-07T00:09:48.205244", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "By modern standards, it’s a relatively small dataset, with only 5,000 records, where each record uses 21 attributes to describe the profile of a customer of an unknown US mobile operator. The attributes are:\n", - "\n", - "- `State`: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ\n", - "- `Account Length`: the number of days that this account has been active\n", - "- `Area Code`: the three-digit area code of the corresponding customer’s phone number\n", - "- `Phone`: the remaining seven-digit phone number\n", - "- `Int’l Plan`: whether the customer has an international calling plan: yes/no\n", - "- `VMail Plan`: whether the customer has a voice mail feature: yes/no\n", - "- `VMail Message`: the average number of voice mail messages per month\n", - "- `Day Mins`: the total number of calling minutes used during the day\n", - "- `Day Calls`: the total number of calls placed during the day\n", - "- `Day Charge`: the billed cost of daytime calls\n", - "- `Eve Mins, Eve Calls, Eve Charge`: the billed cost for calls placed during the evening\n", - "- `Night Mins`, `Night Calls`, `Night Charge`: the billed cost for calls placed during nighttime\n", - "- `Intl Mins`, `Intl Calls`, `Intl Charge`: the billed cost for international calls\n", - "- `CustServ Calls`: the number of calls placed to Customer Service\n", - "- `Churn?`: whether the customer left the service: true/false\n", - "\n", - "The last attribute, `Churn?`, is known as the target attribute: the attribute that we want the ML model to predict. Because the target attribute is binary, our model will be performing binary prediction, also known as binary classification.\n", - "\n", - "Let's begin exploring the data:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a35b9f65", - "metadata": { - "papermill": { - "duration": 2.396119, - "end_time": "2021-06-07T00:09:50.639536", - "exception": false, - "start_time": "2021-06-07T00:09:48.243417", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Frequency tables for each categorical feature\n", - "for column in churn.select_dtypes(include=[\"object\"]).columns:\n", - " display(pd.crosstab(index=churn[column], columns=\"% observations\", normalize=\"columns\"))\n", - "\n", - "# Histograms for each numeric features\n", - "display(churn.describe())\n", - "%matplotlib inline\n", - "hist = churn.hist(bins=30, sharey=True, figsize=(10, 10))" - ] - }, - { - "cell_type": "markdown", - "id": "2046fbb8", - "metadata": { - "papermill": { - "duration": 0.022357, - "end_time": "2021-06-07T00:09:50.685414", - "exception": false, - "start_time": "2021-06-07T00:09:50.663057", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "We can see immediately that:\n", - "- `State` appears to be quite evenly distributed.\n", - "- `Phone` takes on too many unique values to be of any practical use. It's possible that parsing out the prefix could have some value, but without more context on how these are allocated, we should avoid using it.\n", - "- Most of the numeric features are surprisingly nicely distributed, with many showing bell-like `gaussianity`. `VMail Message` is a notable exception (and `Area Code` showing up as a feature we should convert to non-numeric)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "28552f05", - "metadata": { - "papermill": { - "duration": 0.030406, - "end_time": "2021-06-07T00:09:50.738287", - "exception": false, - "start_time": "2021-06-07T00:09:50.707881", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "churn = churn.drop(\"Phone\", axis=1)\n", - "churn[\"Area Code\"] = churn[\"Area Code\"].astype(object)" - ] - }, - { - "cell_type": "markdown", - "id": "197581c1", - "metadata": { - "papermill": { - "duration": 0.022422, - "end_time": "2021-06-07T00:09:50.783342", - "exception": false, - "start_time": "2021-06-07T00:09:50.760920", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "Next let's look at the relationship between each of the features and our target variable." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5cee110f", - "metadata": { - "papermill": { - "duration": 4.645229, - "end_time": "2021-06-07T00:09:55.451149", - "exception": false, - "start_time": "2021-06-07T00:09:50.805920", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "for column in churn.select_dtypes(include=[\"object\"]).columns:\n", - " if column != \"Churn?\":\n", - " display(pd.crosstab(index=churn[column], columns=churn[\"Churn?\"], normalize=\"columns\"))\n", - "\n", - "for column in churn.select_dtypes(exclude=[\"object\"]).columns:\n", - " print(column)\n", - " hist = churn[[column, \"Churn?\"]].hist(by=\"Churn?\", bins=30)\n", - " plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f1e491a6", - "metadata": { - "papermill": { - "duration": 18.552066, - "end_time": "2021-06-07T00:10:14.041717", - "exception": false, - "start_time": "2021-06-07T00:09:55.489651", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "display(churn.corr(numeric_only=True))\n", - "pd.plotting.scatter_matrix(churn, figsize=(12, 12))\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "3217f3c5", - "metadata": { - "papermill": { - "duration": 0.050687, - "end_time": "2021-06-07T00:10:14.143830", - "exception": false, - "start_time": "2021-06-07T00:10:14.093143", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "We see several features that essentially have 100% correlation with one another. Including these feature pairs in some machine learning algorithms can create catastrophic problems, while in others it will only introduce minor redundancy and bias. Let's remove one feature from each of the highly correlated pairs: `Day Charge` from the pair with `Day Mins`, `Night Charge` from the pair with `Night Mins`, `Intl Charge` from the pair with `Intl Mins`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c904a9d2", - "metadata": { - "papermill": { - "duration": 0.057009, - "end_time": "2021-06-07T00:10:14.251061", - "exception": false, - "start_time": "2021-06-07T00:10:14.194052", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "churn = churn.drop([\"Day Charge\", \"Eve Charge\", \"Night Charge\", \"Intl Charge\"], axis=1)" - ] - }, - { - "cell_type": "markdown", - "id": "a3ce9711", - "metadata": { - "papermill": { - "duration": 0.050512, - "end_time": "2021-06-07T00:10:14.352000", - "exception": false, - "start_time": "2021-06-07T00:10:14.301488", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "Now that we've cleaned up our dataset, let's determine which algorithm to use. As mentioned above, there appear to be some variables where both high and low (but not intermediate) values are predictive of churn. In order to accommodate this in an algorithm like linear regression, we'd need to generate polynomial (or bucketed) terms. Instead, let's attempt to model this problem using gradient boosted trees. Amazon SageMaker provides an XGBoost container that we can use to train in a managed, distributed setting, and then host as a real-time prediction endpoint. XGBoost uses gradient boosted trees which naturally account for non-linear relationships between features and the target variable, as well as accommodating complex interactions between features.\n", - "\n", - "Amazon SageMaker XGBoost can train on data in either a CSV or LibSVM format. For this example, we'll stick with CSV. It should:\n", - "- Have the predictor variable in the first column\n", - "- Not have a header row\n", - "\n", - "But first, let's convert our categorical features into numeric features." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8b3ea731", - "metadata": { - "papermill": { - "duration": 0.07096, - "end_time": "2021-06-07T00:10:14.473383", - "exception": false, - "start_time": "2021-06-07T00:10:14.402423", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "model_data = pd.get_dummies(churn)\n", - "model_data = pd.concat(\n", - " [model_data[\"Churn?_True.\"], model_data.drop([\"Churn?_False.\", \"Churn?_True.\"], axis=1)], axis=1\n", - ")\n", - "model_data = model_data.astype(float)" - ] - }, - { - "cell_type": "markdown", - "id": "664ad1dc", - "metadata": { - "papermill": { - "duration": 0.050777, - "end_time": "2021-06-07T00:10:14.574494", - "exception": false, - "start_time": "2021-06-07T00:10:14.523717", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "And now let's split the data into training, validation, and test sets. This will help prevent us from overfitting the model, and allow us to test the model's accuracy on data it hasn't already seen." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "298362cf", - "metadata": { - "papermill": { - "duration": 0.246303, - "end_time": "2021-06-07T00:10:14.871668", - "exception": false, - "start_time": "2021-06-07T00:10:14.625365", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "train_data, validation_data, test_data = np.split(\n", - " model_data.sample(frac=1, random_state=1729),\n", - " [int(0.7 * len(model_data)), int(0.9 * len(model_data))],\n", - ")\n", - "train_data.to_csv(\"train.csv\", header=False, index=False)\n", - "validation_data.to_csv(\"validation.csv\", header=False, index=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b6a5d489", - "metadata": {}, - "outputs": [], - "source": [ - "len(train_data.columns)" - ] - }, - { - "cell_type": "markdown", - "id": "31cd03d7", - "metadata": { - "papermill": { - "duration": 0.050591, - "end_time": "2021-06-07T00:10:14.972677", - "exception": false, - "start_time": "2021-06-07T00:10:14.922086", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "Now we'll upload these files to S3." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5b8d288f", - "metadata": { - "papermill": { - "duration": 0.79455, - "end_time": "2021-06-07T00:10:15.817950", - "exception": false, - "start_time": "2021-06-07T00:10:15.023400", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "boto3.Session().resource(\"s3\").Bucket(bucket).Object(\n", - " os.path.join(prefix, \"train/train.csv\")\n", - ").upload_file(\"train.csv\")\n", - "boto3.Session().resource(\"s3\").Bucket(bucket).Object(\n", - " os.path.join(prefix, \"validation/validation.csv\")\n", - ").upload_file(\"validation.csv\")" - ] - }, - { - "cell_type": "markdown", - "id": "15beea62", - "metadata": { - "papermill": { - "duration": 0.050157, - "end_time": "2021-06-07T00:10:15.918579", - "exception": false, - "start_time": "2021-06-07T00:10:15.868422", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "---\n", - "## Train\n", - "\n", - "Moving onto training, first we'll need to specify the locations of the XGBoost algorithm containers." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "79682277", - "metadata": { - "papermill": { - "duration": 0.071985, - "end_time": "2021-06-07T00:10:16.040629", - "exception": false, - "start_time": "2021-06-07T00:10:15.968644", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "container = sagemaker.image_uris.retrieve(\"xgboost\", sess.boto_region_name, \"1.7-1\")\n", - "display(container)" - ] - }, - { - "cell_type": "markdown", - "id": "6be2c94d", - "metadata": { - "papermill": { - "duration": 0.050814, - "end_time": "2021-06-07T00:10:16.142405", - "exception": false, - "start_time": "2021-06-07T00:10:16.091591", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "Then, because we're training with the CSV file format, we'll create `TrainingInput`s that our training function can use as a pointer to the files in S3." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fb3b53d1", - "metadata": { - "papermill": { - "duration": 0.05658, - "end_time": "2021-06-07T00:10:16.249848", - "exception": false, - "start_time": "2021-06-07T00:10:16.193268", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "s3_input_train = TrainingInput(\n", - " s3_data=\"s3://{}/{}/train\".format(bucket, prefix), content_type=\"csv\"\n", - ")\n", - "s3_input_validation = TrainingInput(\n", - " s3_data=\"s3://{}/{}/validation/\".format(bucket, prefix), content_type=\"csv\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "d0e18e91", - "metadata": { - "papermill": { - "duration": 0.050343, - "end_time": "2021-06-07T00:10:16.350919", - "exception": false, - "start_time": "2021-06-07T00:10:16.300576", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "Now, we can specify a few parameters like what type of training instances we'd like to use and how many, as well as our XGBoost hyperparameters. A few key hyperparameters are:\n", - "- `max_depth` controls how deep each tree within the algorithm can be built. Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting. There is typically some trade-off in model performance that needs to be explored between numerous shallow trees and a smaller number of deeper trees.\n", - "- `subsample` controls sampling of the training data. This technique can help reduce overfitting, but setting it too low can also starve the model of data.\n", - "- `num_round` controls the number of boosting rounds. This is essentially the subsequent models that are trained using the residuals of previous iterations. Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.\n", - "- `eta` controls how aggressive each round of boosting is. Larger values lead to more conservative boosting.\n", - "- `gamma` controls how aggressively trees are grown. Larger values lead to more conservative models.\n", - "\n", - "More detail on XGBoost's hyper-parameters can be found on their GitHub [page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3aea5a5c", - "metadata": { - "papermill": { - "duration": 252.035305, - "end_time": "2021-06-07T00:14:28.436818", - "exception": false, - "start_time": "2021-06-07T00:10:16.401513", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "sess = sagemaker.Session()\n", - "\n", - "xgb = sagemaker.estimator.Estimator(\n", - " container,\n", - " role,\n", - " instance_count=1,\n", - " instance_type=\"ml.m4.xlarge\",\n", - " output_path=\"s3://{}/{}/output\".format(bucket, prefix),\n", - " sagemaker_session=sess,\n", - ")\n", - "xgb.set_hyperparameters(\n", - " max_depth=5,\n", - " eta=0.2,\n", - " gamma=4,\n", - " min_child_weight=6,\n", - " subsample=0.8,\n", - " verbosity=0,\n", - " objective=\"binary:logistic\",\n", - " num_round=100,\n", - ")\n", - "\n", - "xgb.fit({\"train\": s3_input_train, \"validation\": s3_input_validation})" - ] - }, - { - "cell_type": "markdown", - "id": "171515b0", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "source": [ - "---\n", - "## Host\n", - "\n", - "Now that we've trained the algorithm, let's create a model and deploy it to a hosted endpoint." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8f0232f5", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "xgb_predictor = xgb.deploy(\n", - " initial_instance_count=1, instance_type=\"ml.m4.xlarge\", serializer=CSVSerializer()\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "29ab4cae", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "source": [ - "### Evaluate\n", - "\n", - "Now that we have a hosted endpoint running, we can make real-time predictions from our model very easily, simply by making a `http` POST request. But first, we'll need to set up serializers and deserializers for passing our `test_data` NumPy arrays to the model behind the endpoint." - ] - }, - { - "cell_type": "markdown", - "id": "6f03c792", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "source": [ - "Now, we'll use a simple function to:\n", - "1. Loop over our test dataset\n", - "1. Split it into mini-batches of rows \n", - "1. Convert those mini-batchs to CSV string payloads\n", - "1. Retrieve mini-batch predictions by invoking the XGBoost endpoint\n", - "1. Collect predictions and convert from the CSV output our model provides into a NumPy array" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "42d1317f", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "def predict(data, rows=500):\n", - " split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))\n", - " predictions = \"\"\n", - " for array in split_array:\n", - " predictions = \"\".join([predictions, xgb_predictor.predict(array).decode(\"utf-8\")])\n", - "\n", - " return predictions.split(\"\\n\")[:-1]\n", - "\n", - "\n", - "predictions = predict(test_data.to_numpy()[:, 1:])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "745e08d2", - "metadata": {}, - "outputs": [], - "source": [ - "predictions = np.array([float(num) for num in predictions])\n", - "print(predictions)" - ] - }, - { - "cell_type": "markdown", - "id": "b35e2bf7", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "source": [ - "There are many ways to compare the performance of a machine learning model, but let's start by simply by comparing actual to predicted values. In this case, we're simply predicting whether the customer churned (`1`) or not (`0`), which produces a confusion matrix." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d69d58f4", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "pd.crosstab(\n", - " index=test_data.iloc[:, 0],\n", - " columns=np.round(predictions),\n", - " rownames=[\"actual\"],\n", - " colnames=[\"predictions\"],\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "58cc9077", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "source": [ - "_Note, due to randomized elements of the algorithm, your results may differ slightly._\n", - "\n", - "Of the 48 churners, we've correctly predicted 39 of them (true positives). We also incorrectly predicted 4 customers would churn who then ended up not doing so (false positives). There are also 9 customers who ended up churning, that we predicted would not (false negatives).\n", - "\n", - "An important point here is that because of the `np.round()` function above, we are using a simple threshold (or cutoff) of 0.5. Our predictions from `xgboost` yield continuous values between 0 and 1, and we force them into the binary classes that we began with. However, because a customer that churns is expected to cost the company more than proactively trying to retain a customer who we think might churn, we should consider lowering this cutoff. That will almost certainly increase the number of false positives, but it can also be expected to increase the number of true positives and reduce the number of false negatives.\n", - "\n", - "To get a rough intuition here, let's look at the continuous values of our predictions." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2cc8123e", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "plt.hist(predictions)\n", - "plt.xlabel(\"Predicted churn probability\")\n", - "plt.ylabel(\"Number of customers\")\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "55ce4027", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "source": [ - "The continuous valued predictions coming from our model tend to skew toward 0 or 1, but there is sufficient mass between 0.1 and 0.9 that adjusting the cutoff should indeed shift a number of customers' predictions. For example..." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dce5dca1", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "pd.crosstab(index=test_data.iloc[:, 0], columns=np.where(predictions > 0.3, 1, 0))" - ] - }, - { - "cell_type": "markdown", - "id": "18f2c2f1", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "source": [ - "We can see that lowering the cutoff from 0.5 to 0.3 results in 1 more true positive, 3 more false positives, and 1 fewer false negative. The numbers are small overall here, but that's 6-10% of customers overall that are shifting because of a change to the cutoff. Was this the right decision? We may end up retaining 3 extra customers, but we also unnecessarily incentivized 5 more customers who would have stayed anyway. Determining optimal cutoffs is a key step in properly applying machine learning in a real-world setting. Let's discuss this more broadly and then apply a specific, hypothetical solution for our current problem.\n", - "\n", - "### Relative cost of errors\n", - "\n", - "Any practical binary classification problem is likely to produce a similarly sensitive cutoff. That by itself isn’t a problem. After all, if the scores for two classes are really easy to separate, the problem probably isn’t very hard to begin with and might even be solvable with deterministic rules instead of ML.\n", - "\n", - "More important, if we put an ML model into production, there are costs associated with the model erroneously assigning false positives and false negatives. We also need to look at similar costs associated with correct predictions of true positives and true negatives. Because the choice of the cutoff affects all four of these statistics, we need to consider the relative costs to the business for each of these four outcomes for each prediction.\n", - "\n", - "#### Assigning costs\n", - "\n", - "What are the costs for our problem of mobile operator churn? The costs, of course, depend on the specific actions that the business takes. Let's make some assumptions here.\n", - "\n", - "First, assign the true negatives the cost of \\$0. Our model essentially correctly identified a happy customer in this case, and we don’t need to do anything.\n", - "\n", - "False negatives are the most problematic, because they incorrectly predict that a churning customer will stay. We lose the customer and will have to pay all the costs of acquiring a replacement customer, including foregone revenue, advertising costs, administrative costs, point of sale costs, and likely a phone hardware subsidy. A quick search on the Internet reveals that such costs typically run in the hundreds of dollars so, for the purposes of this example, let's assume \\$500. This is the cost of false negatives.\n", - "\n", - "Finally, for customers that our model identifies as churning, let's assume a retention incentive in the amount of \\\\$100. If a provider offered a customer such a concession, they may think twice before leaving. This is the cost of both true positive and false positive outcomes. In the case of false positives (the customer is happy, but the model mistakenly predicted churn), we will “waste” the \\\\$100 concession. We probably could have spent that \\$100 more effectively, but it's possible we increased the loyalty of an already loyal customer, so that’s not so bad." - ] - }, - { - "cell_type": "markdown", - "id": "a51ea034", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "source": [ - "#### Finding the optimal cutoff\n", - "\n", - "It’s clear that false negatives are substantially more costly than false positives. Instead of optimizing for error based on the number of customers, we should be minimizing a cost function that looks like this:\n", - "\n", - "```\n", - "$500 * FN(C) + $0 * TN(C) + $100 * FP(C) + $100 * TP(C)\n", - "```\n", - "\n", - "FN(C) means that the false negative percentage is a function of the cutoff, C, and similar for TN, FP, and TP. We need to find the cutoff, C, where the result of the expression is smallest.\n", - "\n", - "A straightforward way to do this is to simply run a simulation over numerous possible cutoffs. We test 100 possible values in the for-loop below." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "324c9f5c", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "cutoffs = np.arange(0.01, 1, 0.01)\n", - "costs = []\n", - "for c in cutoffs:\n", - " costs.append(\n", - " np.sum(\n", - " np.sum(\n", - " np.array([[0, 100], [500, 100]])\n", - " * pd.crosstab(index=test_data.iloc[:, 0], columns=np.where(predictions > c, 1, 0))\n", - " )\n", - " )\n", - " )\n", - "\n", - "costs = np.array(costs)\n", - "plt.plot(cutoffs, costs)\n", - "plt.xlabel(\"Cutoff\")\n", - "plt.ylabel(\"Cost\")\n", - "plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ae213bd8", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "print(\n", - " \"Cost is minimized near a cutoff of:\",\n", - " cutoffs[np.argmin(costs)],\n", - " \"for a cost of:\",\n", - " np.min(costs),\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "54e86315", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "source": [ - "The above chart shows how picking a threshold too low results in costs skyrocketing as all customers are given a retention incentive. Meanwhile, setting the threshold too high results in too many lost customers, which ultimately grows to be nearly as costly. The overall cost can be minimized at \\\\$8400 by setting the cutoff to 0.46, which is substantially better than the \\$20k+ we would expect to lose by not taking any action." - ] - }, - { - "cell_type": "markdown", - "id": "ce4a0e5b", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "source": [ - "---\n", - "## Extensions\n", - "\n", - "This notebook showcased how to build a model that predicts whether a customer is likely to churn, and then how to optimally set a threshold that accounts for the cost of true positives, false positives, and false negatives. There are several means of extending it including:\n", - "- Some customers who receive retention incentives will still churn. Including a probability of churning despite receiving an incentive in our cost function would provide a better ROI on our retention programs.\n", - "- Customers who switch to a lower-priced plan or who deactivate a paid feature represent different kinds of churn that could be modeled separately.\n", - "- Modeling the evolution of customer behavior. If usage is dropping and the number of calls placed to Customer Service is increasing, you are more likely to experience churn then if the trend is the opposite. A customer profile should incorporate behavior trends.\n", - "- Actual training data and monetary cost assignments could be more complex.\n", - "- Multiple models for each type of churn could be needed.\n", - "\n", - "Regardless of additional complexity, similar principles described in this notebook are likely applied." - ] - }, - { - "cell_type": "markdown", - "id": "ced6f363", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "source": [ - "### (Optional) Clean-up\n", - "\n", - "If you're ready to be done with this notebook, please run the cell below. This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "16febdfe", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "xgb_predictor.delete_endpoint()" - ] - }, - { - "cell_type": "markdown", - "id": "f32cb035", - "metadata": {}, - "source": [ - "## Notebook CI Test Results\n", - "\n", - "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", - "\n", - "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/introduction_to_applying_machine_learning|xgboost_customer_churn|xgboost_customer_churn.ipynb)\n", - "\n", - "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/introduction_to_applying_machine_learning|xgboost_customer_churn|xgboost_customer_churn.ipynb)\n", - "\n", - "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/introduction_to_applying_machine_learning|xgboost_customer_churn|xgboost_customer_churn.ipynb)\n", - "\n", - "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/introduction_to_applying_machine_learning|xgboost_customer_churn|xgboost_customer_churn.ipynb)\n", - "\n", - "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/introduction_to_applying_machine_learning|xgboost_customer_churn|xgboost_customer_churn.ipynb)\n", - "\n", - "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/introduction_to_applying_machine_learning|xgboost_customer_churn|xgboost_customer_churn.ipynb)\n", - "\n", - "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/introduction_to_applying_machine_learning|xgboost_customer_churn|xgboost_customer_churn.ipynb)\n", - "\n", - "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/introduction_to_applying_machine_learning|xgboost_customer_churn|xgboost_customer_churn.ipynb)\n", - "\n", - "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/introduction_to_applying_machine_learning|xgboost_customer_churn|xgboost_customer_churn.ipynb)\n", - "\n", - "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/introduction_to_applying_machine_learning|xgboost_customer_churn|xgboost_customer_churn.ipynb)\n", - "\n", - "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/introduction_to_applying_machine_learning|xgboost_customer_churn|xgboost_customer_churn.ipynb)\n", - "\n", - "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/introduction_to_applying_machine_learning|xgboost_customer_churn|xgboost_customer_churn.ipynb)\n", - "\n", - "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/introduction_to_applying_machine_learning|xgboost_customer_churn|xgboost_customer_churn.ipynb)\n", - "\n", - "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/introduction_to_applying_machine_learning|xgboost_customer_churn|xgboost_customer_churn.ipynb)\n", - "\n", - "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/introduction_to_applying_machine_learning|xgboost_customer_churn|xgboost_customer_churn.ipynb)\n" - ] - } - ], - "metadata": { - "availableInstances": [ - { - "_defaultOrder": 0, - "_isFastLaunch": true, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 4, - "name": "ml.t3.medium", - "vcpuNum": 2 - }, - { - "_defaultOrder": 1, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 8, - "name": "ml.t3.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 2, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.t3.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 3, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.t3.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 4, - "_isFastLaunch": true, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 8, - "name": "ml.m5.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 5, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.m5.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 6, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.m5.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 7, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.m5.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 8, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.m5.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 9, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.m5.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 10, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.m5.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 11, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 384, - "name": "ml.m5.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 12, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 8, - "name": "ml.m5d.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 13, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.m5d.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 14, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.m5d.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 15, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.m5d.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 16, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.m5d.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 17, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.m5d.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 18, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.m5d.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 19, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 384, - "name": "ml.m5d.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 20, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": true, - "memoryGiB": 0, - "name": "ml.geospatial.interactive", - "supportedImageNames": [ - "sagemaker-geospatial-v1-0" - ], - "vcpuNum": 0 - }, - { - "_defaultOrder": 21, - "_isFastLaunch": true, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 4, - "name": "ml.c5.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 22, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 8, - "name": "ml.c5.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 23, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.c5.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 24, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.c5.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 25, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 72, - "name": "ml.c5.9xlarge", - "vcpuNum": 36 - }, - { - "_defaultOrder": 26, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 96, - "name": "ml.c5.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 27, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 144, - "name": "ml.c5.18xlarge", - "vcpuNum": 72 - }, - { - "_defaultOrder": 28, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.c5.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 29, - "_isFastLaunch": true, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.g4dn.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 30, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.g4dn.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 31, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.g4dn.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 32, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.g4dn.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 33, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 4, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.g4dn.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 34, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.g4dn.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 35, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 61, - "name": "ml.p3.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 36, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 4, - "hideHardwareSpecs": false, - "memoryGiB": 244, - "name": "ml.p3.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 37, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 488, - "name": "ml.p3.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 38, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 768, - "name": "ml.p3dn.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 39, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.r5.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 40, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.r5.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 41, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.r5.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 42, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.r5.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 43, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.r5.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 44, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 384, - "name": "ml.r5.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 45, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 512, - "name": "ml.r5.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 46, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 768, - "name": "ml.r5.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 47, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.g5.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 48, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.g5.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 49, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.g5.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 50, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.g5.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 51, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.g5.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 52, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 4, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.g5.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 53, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 4, - "hideHardwareSpecs": false, - "memoryGiB": 384, - "name": "ml.g5.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 54, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 768, - "name": "ml.g5.48xlarge", - "vcpuNum": 192 - }, - { - "_defaultOrder": 55, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 1152, - "name": "ml.p4d.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 56, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 1152, - "name": "ml.p4de.24xlarge", - "vcpuNum": 96 - } - ], - "celltoolbar": "Tags", - "kernelspec": { - "display_name": "Python 3 (Data Science 3.0)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/sagemaker-data-science-310-v1" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.6" - }, - "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.", - "papermill": { - "default_parameters": {}, - "duration": 311.728138, - "end_time": "2021-06-07T00:14:55.273560", - "environment_variables": {}, - "exception": true, - "input_path": "xgboost_customer_churn.ipynb", - "output_path": "/opt/ml/processing/output/xgboost_customer_churn-2021-06-07-00-06-03.ipynb", - "parameters": { - "kms_key": "arn:aws:kms:us-west-2:521695447989:key/6e9984db-50cf-4c7e-926c-877ec47a8b25" - }, - "start_time": "2021-06-07T00:09:43.545422", - "version": "2.3.3" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/sagemaker-clarify/fairness_and_explainability/fairness_and_explainability.ipynb b/sagemaker-clarify/fairness_and_explainability/fairness_and_explainability.ipynb deleted file mode 100644 index 3b2255959c..0000000000 --- a/sagemaker-clarify/fairness_and_explainability/fairness_and_explainability.ipynb +++ /dev/null @@ -1,1361 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Fairness and Explainability with SageMaker Clarify" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "\n", - "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", - "\n", - "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability.ipynb)\n", - "\n", - "---" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Runtime\n", - "\n", - "This notebook takes approximately 30 minutes to run.\n", - "\n", - "## Contents\n", - "\n", - "1. [Overview](#Overview)\n", - "1. [Prerequisites and Data](#Prerequisites-and-Data)\n", - " 1. [Import Libraries](#Import-Libraries)\n", - " 1. [Set Configurations](#Set-Configurations)\n", - " 1. [Download data](#Download-data)\n", - " 1. [Loading the data: Adult Dataset](#Loading-the-data:-Adult-Dataset) \n", - " 1. [Data inspection](#Data-inspection) \n", - " 1. [Encode and Upload the Dataset](#Encode-and-Upload-the-Dataset) \n", - "1. [Train and Deploy XGBoost Model](#Train-XGBoost-Model)\n", - " 1. [Train Model](#Train-Model)\n", - " 1. [Create Model](#Create-Model)\n", - "1. [Amazon SageMaker Clarify](#Amazon-SageMaker-Clarify)\n", - " 1. [Detecting Bias](#Detecting-Bias)\n", - " 1. [Writing DataConfig](#Writing-DataConfig)\n", - " 1. [Writing ModelConfig](#Writing-ModelConfig)\n", - " 1. [Writing ModelPredictedLabelConfig](#Writing-ModelPredictedLabelConfig)\n", - " 1. [Writing BiasConfig](#Writing-BiasConfig)\n", - " 1. [Pre-training Bias](#Pre-training-Bias)\n", - " 1. [Post-training Bias](#Post-training-Bias)\n", - " 1. [Viewing the Bias Report](#Viewing-the-Bias-Report)\n", - " 1. [Explaining Predictions](#Explaining-Predictions)\n", - " 1. [Viewing the Explainability Report](#Viewing-the-Explainability-Report)\n", - " 1. [Analysis of local explanations](#Analysis-of-local-explanations)\n", - "1. [Clean Up](#Clean-Up)\n", - "\n", - "## Overview\n", - "Amazon SageMaker Clarify helps improve your machine learning models by detecting potential bias and helping explain how these models make predictions. The fairness and explainability functionality provided by SageMaker Clarify takes a step towards enabling AWS customers to build trustworthy and understandable machine learning models. The product comes with the tools to help you with the following tasks.\n", - "\n", - "* Measure biases that can occur during each stage of the ML lifecycle (data collection, model training and tuning, and monitoring of ML models deployed for inference).\n", - "* Generate model governance reports targeting risk and compliance teams and external regulators.\n", - "* Provide explanations of the data, models, and monitoring used to assess predictions.\n", - "\n", - "This sample notebook walks you through: \n", - "1. Key terms and concepts needed to understand SageMaker Clarify\n", - "1. Measuring the pre-training bias of a dataset and post-training bias of a model\n", - "1. Explaining the importance of the various input features on the model's decision\n", - "1. Accessing the reports through SageMaker Studio if you have an instance set up.\n", - "\n", - "In doing so, the notebook first trains a [SageMaker XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) model using training dataset, then use [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) to launch SageMaker Clarify jobs to analyze an example dataset in CSV format. \n", - "\n", - "SageMaker Clarify also supports analyzing dataset in [SageMaker JSON Lines dense format](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html#common-in-formats), which is illustrated in [another notebook](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker_processing/fairness_and_explainability/fairness_and_explainability_jsonlines_format.ipynb). Additionally, there is a [peer example available](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-clarify/fairness_and_explainability/fairness_and_explainability_boto3.ipynb) that utilizes the [AWS SDK for Python](https://aws.amazon.com/sdk-for-python/) to launch SageMaker Clarify jobs to analyze data in CSV format. " - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Prerequisites and Data\n", - "### Import Libraries" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "import numpy as np\n", - "import os\n", - "import boto3\n", - "from datetime import datetime\n", - "from sagemaker import get_execution_role, session" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Set Configurations" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Region: ap-south-1\n", - "Role: arn:aws:iam::000000000000:role/service-role/SMClarifySageMaker-ExecutionRole\n" - ] - } - ], - "source": [ - "# Initialize sagemaker session\n", - "sagemaker_session = session.Session()\n", - "\n", - "region = sagemaker_session.boto_region_name\n", - "print(f\"Region: {region}\")\n", - "\n", - "role = get_execution_role()\n", - "print(f\"Role: {role}\")\n", - "\n", - "bucket = sagemaker_session.default_bucket()\n", - "\n", - "prefix = \"sagemaker/DEMO-sagemaker-clarify\"" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Download data\n", - "Data Source: [https://archive.ics.uci.edu/ml/machine-learning-databases/adult/](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/)\n", - "\n", - "Let's __download__ the data and save it in the local folder with the name adult.data and adult.test from UCI repository$^{[2]}$.\n", - "\n", - "$^{[2]}$Dua Dheeru, and Efi Karra Taniskidou. \"[UCI Machine Learning Repository](http://archive.ics.uci.edu/ml)\". Irvine, CA: University of California, School of Information and Computer Science (2017)." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "adult.data already on disk.\n", - "adult.test already on disk.\n" - ] - } - ], - "source": [ - "from sagemaker.s3 import S3Downloader\n", - "\n", - "adult_columns = [\n", - " \"Age\",\n", - " \"Workclass\",\n", - " \"fnlwgt\",\n", - " \"Education\",\n", - " \"Education-Num\",\n", - " \"Marital Status\",\n", - " \"Occupation\",\n", - " \"Relationship\",\n", - " \"Ethnic group\",\n", - " \"Sex\",\n", - " \"Capital Gain\",\n", - " \"Capital Loss\",\n", - " \"Hours per week\",\n", - " \"Country\",\n", - " \"Target\",\n", - "]\n", - "if not os.path.isfile(\"adult.data\"):\n", - " S3Downloader.download(\n", - " s3_uri=\"s3://{}/{}\".format(\n", - " f\"sagemaker-example-files-prod-{region}\", \"datasets/tabular/uci_adult/adult.data\"\n", - " ),\n", - " local_path=\"./\",\n", - " sagemaker_session=sagemaker_session,\n", - " )\n", - " print(\"adult.data saved!\")\n", - "else:\n", - " print(\"adult.data already on disk.\")\n", - "\n", - "if not os.path.isfile(\"adult.test\"):\n", - " S3Downloader.download(\n", - " s3_uri=\"s3://{}/{}\".format(\n", - " f\"sagemaker-example-files-prod-{region}\", \"datasets/tabular/uci_adult/adult.test\"\n", - " ),\n", - " local_path=\"./\",\n", - " sagemaker_session=sagemaker_session,\n", - " )\n", - " print(\"adult.test saved!\")\n", - "else:\n", - " print(\"adult.test already on disk.\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Loading the data: Adult Dataset\n", - "From the UCI repository of machine learning datasets, this database contains 14 features concerning demographic characteristics of 45,222 rows (32,561 for training and 12,661 for testing). The task is to predict whether a person has a yearly income that is more or less than $50,000.\n", - "\n", - "Here are the features and their possible values:\n", - "\n", - "1. **Age**: continuous.\n", - "1. **Workclass**: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.\n", - "1. **Fnlwgt**: continuous (the number of people the census takers believe that observation represents).\n", - "1. **Education**: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.\n", - "1. **Education-num**: continuous.\n", - "1. **Marital-status**: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.\n", - "1. **Occupation**: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.\n", - "1. **Relationship**: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.\n", - "1. **Ethnic group**: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.\n", - "1. **Sex**: Female, Male.\n", - " * **Note**: this data is extracted from the 1994 Census and enforces a binary option on Sex\n", - "1. **Capital-gain**: continuous.\n", - "1. **Capital-loss**: continuous.\n", - "1. **Hours-per-week**: continuous.\n", - "1. **Native-country**: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.\n", - "\n", - "Next, we specify our binary prediction task: \n", - "\n", - "15. **Target**: <=50,000, >$50,000." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": { - "scrolled": true - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
AgeWorkclassfnlwgtEducationEducation-NumMarital StatusOccupationRelationshipEthnic groupSexCapital GainCapital LossHours per weekCountryTarget
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K
\n", - "
" - ], - "text/plain": [ - " Age Workclass fnlwgt Education Education-Num \\\n", - "0 39 State-gov 77516 Bachelors 13 \n", - "1 50 Self-emp-not-inc 83311 Bachelors 13 \n", - "2 38 Private 215646 HS-grad 9 \n", - "3 53 Private 234721 11th 7 \n", - "4 28 Private 338409 Bachelors 13 \n", - "\n", - " Marital Status Occupation Relationship Ethnic group Sex \\\n", - "0 Never-married Adm-clerical Not-in-family White Male \n", - "1 Married-civ-spouse Exec-managerial Husband White Male \n", - "2 Divorced Handlers-cleaners Not-in-family White Male \n", - "3 Married-civ-spouse Handlers-cleaners Husband Black Male \n", - "4 Married-civ-spouse Prof-specialty Wife Black Female \n", - "\n", - " Capital Gain Capital Loss Hours per week Country Target \n", - "0 2174 0 40 United-States <=50K \n", - "1 0 0 13 United-States <=50K \n", - "2 0 0 40 United-States <=50K \n", - "3 0 0 40 United-States <=50K \n", - "4 0 0 40 Cuba <=50K " - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "training_data = pd.read_csv(\n", - " \"adult.data\", names=adult_columns, sep=r\"\\s*,\\s*\", engine=\"python\", na_values=\"?\"\n", - ").dropna()\n", - "\n", - "testing_data = pd.read_csv(\n", - " \"adult.test\", names=adult_columns, sep=r\"\\s*,\\s*\", engine=\"python\", na_values=\"?\", skiprows=1\n", - ").dropna()\n", - "\n", - "training_data.head()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Data inspection\n", - "Plotting histograms for the distribution of the different features is a good way to visualize the data. Let's plot a few of the features that can be considered _sensitive_. \n", - "Let's take a look specifically at the Sex feature of a census respondent. In the first plot we see that there are fewer Female respondents as a whole but especially in the positive outcomes, where they form ~$\\frac{1}{7}$th of respondents." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "scrolled": true - }, - "outputs": [ - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAAEICAYAAABfz4NwAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8/fFQqAAAACXBIWXMAAAsTAAALEwEAmpwYAAAY2UlEQVR4nO3de7RedX3n8fenhCJqoVyONE2iQYm2kKlhJWZQq4uu2BIvFZwBDdMKtKwVZeFYl3ZmQG11tJkRFWmZJdg4MARGuYyIsCpUEarUEcGDRsK1hItyTAYOiBiqRBO/88fzO/pw8uTcc04u79dae539fPfvt/fv4XI+z/7t/ZydqkKSpF+b6QFIknYOBoIkCTAQJEmNgSBJAgwESVJjIEiSAANB2ikkOSTJTUk2JTl7psejPZOBoF1Skv+QpD/JU0k2Jrkuye9Pw3EryWE7YNcrgceA/arqPT2OOzfJlUkeS/JkknVJTtkB49AebNZMD0AaryTvBs4A3g58CfgZsBw4Fvj6DA5tMl4A3FXb/6boJcB3W7vNwL8BfmuaxqY9RVW5uOwyC7A/8BRwwght9gH+FtjQlr8F9mnbTgG+Pqx9AYe19YuATwJfBDYBtwAvattuam3/tY3hLcDBwD8APwJ+CPwz8GvbGdcrgG8BT7afr+g65s/pBNtTwGt69H0KWDTCez4K+EYbx3eBo7uO+Rgwr71+aWvzOzP979Jl51ucMtKu5uXAs4CrRmjzPjq/IBfR+QW4FHj/OI5xIvBfgQOA9cAqgKp6ddv+0qp6blVdDrwHGAD6gEOA99IJjWdIciCdkDkXOAj4BPDFJAdV1SnAZ4CPtv1+pceYvgl8MsmKJM8ftu85bd9/AxwI/CVwZZK+qvoG8PfAmiT70jnTeH9V3TOOfx7aQxgI2tUcBDxWVVtGaPMnwIeq6tGqGqTzy/2t4zjG56vq1naMz9AJlu35OTAbeEFV/byq/rmqek37vB64r6ouqaotVXUpcA/wx2Mc0wl0zj7+CngwydokL2vb/hS4tqqurapfVNX1QD/wurb9g3TOrG6lc8b0yTEeU3sYA0G7mseBg5OMdP3rt4Hvdb3+XquN1f/rWv8J8NwR2n6MzlnEl5M8kOSMMY5paFxzxjKgqnqiqs6oqiPonImsBb6QJHSuK5yQ5EdDC/D7dIKKqvo5nWmphcDZ2wksyUDQLudm4GnguBHabKDzS3LI81sNOvP/zx7akGRSF2aralNVvaeqXkjn0/67kywbw5iGxvWDCRzzMeDjdELmQOBh4JKq+s2u5TlV9RH45ZTSB4D/BZydZJ/xHlN7BgNBu5SqehL4azrz6ccleXaSvZO8NslHW7NLgfcn6UtycGv/v9u27wJHJFmU5Fl0plPG4xHghUMvkrwhyWHtk/qPga1tGe5a4MXtdtlZSd4CHE7ngvSokpyVZGHr+xvAacD6qnq8vbc/TnJMkr2SPCvJ0e1W1dA5O7gAOBXYCHx4nO9ZewgDQbucqvoE8G46F4oH6XxCfgfwhdbkb+jMod8OrAO+3WpU1b8AHwK+AtzH+G9T/SCdC7Q/SvJmYEHb11N0zl7Oq6qv9hjz48Ab6FyEfhz4z8Ab2qf9sXg2nQvpPwIeoHO28ca274fp3HL7Xn71z+M/0fn/+510ppj+qk0V/RnwZ0leNc73rT1AnE6UJIFnCJKkxkCQJAEGgiSpMRAkScAu/MftDj744Jo/f/5MD0OSdim33XbbY1XV12vbLhsI8+fPp7+/f6aHIUm7lCTDvzH/S04ZSZIAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkoAxfFM5yTzgYuC3gF8Aq6vq75IcCFwOzAceAt5cVU+0PmfSeTrTVuCdVfWlVl9M5+lN+9J5gtRfVFW1R/pdDCym8/CQt1TVQ1P2LiVtY/4ZX5zpIexWHvrI62d6CJM2ljOELcB7qup3gaOA05McDpwB3FBVC4Ab2mvathXAEcBy4Lwke7V9nQ+spPOUqQVtO3TC44mqOgw4BzhrCt6bJGkcRg2EqtpYVd9u65uAu4E5dB7Zt6Y1W8OvHnp+LHBZVW2uqgeB9cDSJLOB/arq5vYov4uH9Rna1+eAZe1ZsJKkaTKuawhJ5gNHArcAh1TVRuiEBvC81mwOnWe6DhlotTltfXj9GX2qagvwJHDQeMYmSZqcMQdCkucCVwLvqqofj9S0R61GqI/UZ/gYVibpT9I/ODg42pAlSeMwpkBIsjedMPhMVX2+lR9p00C0n4+2+gAwr6v7XGBDq8/tUX9GnySzgP2BHw4fR1WtrqolVbWkr6/nn/OWJE3QqIHQ5vIvAO6uqk90bboGOLmtnwxc3VVfkWSfJIfSuXh8a5tW2pTkqLbPk4b1GdrX8cCN7TqDJGmajOUBOa8E3gqsS7K21d4LfAS4IsmpwPeBEwCq6s4kVwB30blD6fSq2tr6ncavbju9ri3QCZxLkqync2awYnJvS5I0XqMGQlV9nd5z/ADLttNnFbCqR70fWNij/jQtUCRJM8NvKkuSAANBktQYCJIkwECQJDUGgiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkoCxPVP5wiSPJrmjq3Z5krVteWjo0ZpJ5if5ade2T3X1WZxkXZL1Sc5tz1WmPXv58la/Jcn8qX+bkqTRjOUM4SJgeXehqt5SVYuqahFwJfD5rs33D22rqrd31c8HVgIL2jK0z1OBJ6rqMOAc4KyJvBFJ0uSMGghVdROdB99vo33KfzNw6Uj7SDIb2K+qbq6qAi4GjmubjwXWtPXPAcuGzh4kSdNnstcQXgU8UlX3ddUOTfKdJF9L8qpWmwMMdLUZaLWhbQ8DVNUW4EngoF4HS7IySX+S/sHBwUkOXZLUbbKBcCLPPDvYCDy/qo4E3g18Nsl+QK9P/NV+jrTtmcWq1VW1pKqW9PX1TWLYkqThZk20Y5JZwL8DFg/VqmozsLmt35bkfuDFdM4I5nZ1nwtsaOsDwDxgoO1zf7YzRSVJ2nEmc4bwGuCeqvrlVFCSviR7tfUX0rl4/EBVbQQ2JTmqXR84Cbi6dbsGOLmtHw/c2K4zSJKm0VhuO70UuBl4SZKBJKe2TSvY9mLyq4Hbk3yXzgXit1fV0Kf904D/CawH7geua/ULgIOSrKczzXTGJN6PJGmCRp0yqqoTt1M/pUftSjq3ofZq3w8s7FF/GjhhtHFIknYsv6ksSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkwECQJDUGgiQJMBAkSc1Ynph2YZJHk9zRVftgkh8kWduW13VtOzPJ+iT3Jjmmq744ybq27dz2KE2S7JPk8la/Jcn8KX6PkqQxGMsZwkXA8h71c6pqUVuuBUhyOJ1Hax7R+pw39Ixl4HxgJZ3nLC/o2uepwBNVdRhwDnDWBN+LJGkSRg2EqroJ+OFo7ZpjgcuqanNVPUjn+clLk8wG9quqm6uqgIuB47r6rGnrnwOWDZ09SJKmz2SuIbwjye1tSumAVpsDPNzVZqDV5rT14fVn9KmqLcCTwEG9DphkZZL+JP2Dg4OTGLokabiJBsL5wIuARcBG4OxW7/XJvkaoj9Rn22LV6qpaUlVL+vr6xjVgSdLIJhQIVfVIVW2tql8AnwaWtk0DwLyupnOBDa0+t0f9GX2SzAL2Z+xTVJKkKTKhQGjXBIa8CRi6A+kaYEW7c+hQOhePb62qjcCmJEe16wMnAVd39Tm5rR8P3NiuM0iSptGs0RokuRQ4Gjg4yQDwAeDoJIvoTO08BLwNoKruTHIFcBewBTi9qra2XZ1G546lfYHr2gJwAXBJkvV0zgxWTMH7kiSN06iBUFUn9ihfMEL7VcCqHvV+YGGP+tPACaONQ5K0Y/lNZUkSYCBIkhoDQZIEGAiSpMZAkCQBBoIkqTEQJEmAgSBJagwESRJgIEiSGgNBkgQYCJKkxkCQJAEGgiSpMRAkSYCBIElqDARJEjCGQEhyYZJHk9zRVftYknuS3J7kqiS/2erzk/w0ydq2fKqrz+Ik65KsT3Jue7Yy7fnLl7f6LUnmT/3blCSNZixnCBcBy4fVrgcWVtXvAf8CnNm17f6qWtSWt3fVzwdWAgvaMrTPU4Enquow4BzgrHG/C0nSpI0aCFV1E/DDYbUvV9WW9vKbwNyR9pFkNrBfVd1cVQVcDBzXNh8LrGnrnwOWDZ09SJKmz1RcQ/hz4Lqu14cm+U6SryV5VavNAQa62gy02tC2hwFayDwJHNTrQElWJulP0j84ODgFQ5ckDZlUICR5H7AF+EwrbQSeX1VHAu8GPptkP6DXJ/4a2s0I255ZrFpdVUuqaklfX99khi5JGmbWRDsmORl4A7CsTQNRVZuBzW39tiT3Ay+mc0bQPa00F9jQ1geAecBAklnA/gybopIk7XgTOkNIshz4L8Abq+onXfW+JHu19RfSuXj8QFVtBDYlOapdHzgJuLp1uwY4ua0fD9w4FDCSpOkz6hlCkkuBo4GDkwwAH6BzV9E+wPXt+u832x1FrwY+lGQLsBV4e1UNfdo/jc4dS/vSueYwdN3hAuCSJOvpnBmsmJJ3Jkkal1EDoapO7FG+YDttrwSu3M62fmBhj/rTwAmjjUOStGP5TWVJEmAgSJIaA0GSBBgIkqTGQJAkAQaCJKkxECRJgIEgSWoMBEkSYCBIkhoDQZIEGAiSpMZAkCQBBoIkqTEQJEmAgSBJakYNhCQXJnk0yR1dtQOTXJ/kvvbzgK5tZyZZn+TeJMd01RcnWde2ndsepUmSfZJc3uq3JJk/xe9RkjQGYzlDuAhYPqx2BnBDVS0AbmivSXI4nUdgHtH6nDf0jGXgfGAlnecsL+ja56nAE1V1GHAOcNZE34wkaeJGDYSquonOs467HQusaetrgOO66pdV1eaqehBYDyxNMhvYr6purqoCLh7WZ2hfnwOWDZ09SJKmz0SvIRxSVRsB2s/ntfoc4OGudgOtNqetD68/o09VbQGeBA7qddAkK5P0J+kfHByc4NAlSb1M9UXlXp/sa4T6SH22LVatrqolVbWkr69vgkOUJPUya4L9Hkkyu6o2tumgR1t9AJjX1W4usKHV5/aod/cZSDIL2J9tp6h2WfPP+OJMD2G38tBHXj/TQ5B2WxM9Q7gGOLmtnwxc3VVf0e4cOpTOxeNb27TSpiRHtesDJw3rM7Sv44Eb23UGSdI0GvUMIcmlwNHAwUkGgA8AHwGuSHIq8H3gBICqujPJFcBdwBbg9Kra2nZ1Gp07lvYFrmsLwAXAJUnW0zkzWDEl70ySNC6jBkJVnbidTcu2034VsKpHvR9Y2KP+NC1QJEkzx28qS5IAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVIz4UBI8pIka7uWHyd5V5IPJvlBV/11XX3OTLI+yb1JjumqL06yrm07tz1mU5I0jSYcCFV1b1UtqqpFwGLgJ8BVbfM5Q9uq6lqAJIfTeTzmEcBy4Lwke7X25wMr6TyDeUHbLkmaRlM1ZbQMuL+qvjdCm2OBy6pqc1U9CKwHliaZDexXVTdXVQEXA8dN0bgkSWM0VYGwAri06/U7ktye5MIkB7TaHODhrjYDrTanrQ+vS5Km0aQDIcmvA28E/k8rnQ+8CFgEbATOHmrao3uNUO91rJVJ+pP0Dw4OTmbYkqRhpuIM4bXAt6vqEYCqeqSqtlbVL4BPA0tbuwFgXle/ucCGVp/bo76NqlpdVUuqaklfX98UDF2SNGQqAuFEuqaL2jWBIW8C7mjr1wArkuyT5FA6F49vraqNwKYkR7W7i04Crp6CcUmSxmHWZDoneTbwh8DbusofTbKIzrTPQ0PbqurOJFcAdwFbgNOramvrcxpwEbAvcF1bJEnTaFKBUFU/AQ4aVnvrCO1XAat61PuBhZMZiyRpcvymsiQJMBAkSY2BIEkCDARJUmMgSJIAA0GS1BgIkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1BoIkCZhkICR5KMm6JGuT9LfagUmuT3Jf+3lAV/szk6xPcm+SY7rqi9t+1ic5tz1bWZI0jabiDOEPqmpRVS1pr88AbqiqBcAN7TVJDgdWAEcAy4HzkuzV+pwPrAQWtGX5FIxLkjQOO2LK6FhgTVtfAxzXVb+sqjZX1YPAemBpktnAflV1c1UVcHFXH0nSNJlsIBTw5SS3JVnZaodU1UaA9vN5rT4HeLir70CrzWnrw+vbSLIySX+S/sHBwUkOXZLUbdYk+7+yqjYkeR5wfZJ7Rmjb67pAjVDftli1GlgNsGTJkp5tJEkTM6kzhKra0H4+ClwFLAUeadNAtJ+PtuYDwLyu7nOBDa0+t0ddkjSNJhwISZ6T5DeG1oE/Au4ArgFObs1OBq5u69cAK5Lsk+RQOhePb23TSpuSHNXuLjqpq48kaZpMZsroEOCqdofoLOCzVfWPSb4FXJHkVOD7wAkAVXVnkiuAu4AtwOlVtbXt6zTgImBf4Lq2SJKm0YQDoaoeAF7ao/44sGw7fVYBq3rU+4GFEx2LJGny/KayJAkwECRJjYEgSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkwECQJDWTeYTmvCT/lOTuJHcm+YtW/2CSHyRZ25bXdfU5M8n6JPcmOaarvjjJurbt3PYoTUnSNJrMIzS3AO+pqm+3ZyvfluT6tu2cqvp4d+MkhwMrgCOA3wa+kuTF7TGa5wMrgW8C1wLL8TGakjStJnyGUFUbq+rbbX0TcDcwZ4QuxwKXVdXmqnoQWA8sTTIb2K+qbq6qAi4GjpvouCRJEzMl1xCSzAeOBG5ppXckuT3JhUkOaLU5wMNd3QZabU5bH17vdZyVSfqT9A8ODk7F0CVJzaQDIclzgSuBd1XVj+lM/7wIWARsBM4eatqje41Q37ZYtbqqllTVkr6+vskOXZLUZVKBkGRvOmHwmar6PEBVPVJVW6vqF8CngaWt+QAwr6v7XGBDq8/tUZckTaPJ3GUU4ALg7qr6RFd9dlezNwF3tPVrgBVJ9klyKLAAuLWqNgKbkhzV9nkScPVExyVJmpjJ3GX0SuCtwLoka1vtvcCJSRbRmfZ5CHgbQFXdmeQK4C46dyid3u4wAjgNuAjYl87dRd5hJEnTbMKBUFVfp/f8/7Uj9FkFrOpR7wcWTnQskqTJ85vKkiTAQJAkNQaCJAkwECRJjYEgSQIMBElSYyBIkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEmNgSBJAgwESVJjIEiSAANBktQYCJIkYCcKhCTLk9ybZH2SM2Z6PJK0p9kpAiHJXsAngdcCh9N5LvPhMzsqSdqz7BSBACwF1lfVA1X1M+Ay4NgZHpMk7VFmzfQAmjnAw12vB4B/O7xRkpXAyvbyqST3TsPY9hQHA4/N9CBGk7NmegSaAf63ObVesL0NO0sgpEettilUrQZW7/jh7HmS9FfVkpkehzSc/21On51lymgAmNf1ei6wYYbGIkl7pJ0lEL4FLEhyaJJfB1YA18zwmCRpj7JTTBlV1ZYk7wC+BOwFXFhVd87wsPY0TsVpZ+V/m9MkVdtM1UuS9kA7y5SRJGmGGQiSJMBA2C0k2Zpkbdcyfwce66EkB++o/WvPkKSSXNL1elaSwST/MEq/o0dro4nbKS4qa9J+WlWLZnoQ0jj8K7Awyb5V9VPgD4EfzPCY9nieIeymkixO8rUktyX5UpLZrf7VJOckuSnJ3UleluTzSe5L8jdd/b/Q+t7ZviHe6xh/muTWdlby9+1vUkljdR3w+rZ+InDp0IYkS5N8I8l32s+XDO+c5DlJLkzyrdbOP3czSQbC7mHfrumiq5LsDfwP4PiqWgxcCKzqav+zqno18CngauB0YCFwSpKDWps/b32XAO/sqgOQ5HeBtwCvbGcnW4E/2XFvUbuhy4AVSZ4F/B5wS9e2e4BXV9WRwF8D/61H//cBN1bVy4A/AD6W5Dk7eMy7NaeMdg/PmDJKspDOL/jrk0Dnux0bu9oPfelvHXBnVW1s/R6g843xx+mEwJtau3nAglYfsgxYDHyrHWNf4NEpfVfarVXV7e1614nAtcM27w+sSbKAzp+x2bvHLv4IeGOSv2yvnwU8H7h7x4x492cg7J5C5xf9y7ezfXP7+Yuu9aHXs5IcDbwGeHlV/STJV+n8zzb8GGuq6sypGrT2SNcAHweOBrrPQj8M/FNVvamFxld79A3w76vKP3I5RZwy2j3dC/QleTlAkr2THDGO/vsDT7Qw+B3gqB5tbgCOT/K8dowDk2z3ryhK23Eh8KGqWjesvj+/ush8ynb6fgn4j2mnqEmO3CEj3IMYCLuh9kyJ44GzknwXWAu8Yhy7+Ec6Zwq30/mk9s0ex7gLeD/w5dbuemD2JIeuPUxVDVTV3/XY9FHgvyf5v3SmPHv5MJ2ppNuT3NFeaxL80xWSJMAzBElSYyBIkgADQZLUGAiSJMBAkCQ1BoIkCTAQJEnN/wcqlWaXRFP9cQAAAABJRU5ErkJggg==", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "%matplotlib inline\n", - "training_data[\"Sex\"].value_counts().sort_values().plot(kind=\"bar\", title=\"Counts of Sex\", rot=0)" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": { - "scrolled": true - }, - "outputs": [ - { - "data": { - "text/plain": [ - "$50K'}>" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAEICAYAAACzliQjAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8/fFQqAAAACXBIWXMAAAsTAAALEwEAmpwYAAAYxElEQVR4nO3dfZRdVX3G8e9jEkIEAsRMYpwJDmAUklSgGUJQa7FBCaImtkaDLySWmmUWrbalqyupreJLWnxrFQvYVDCDL8Qs30iRqDEYFY2EQV7SJKTM4i3TRDKAYEAbSPz1j7NHjzd37txJZu6Q2c9nrbPuOfvsfc4+d26ee+4+594oIjAzszw8Z6g7YGZmjePQNzPLiEPfzCwjDn0zs4w49M3MMuLQNzPLiEPfDiuSJkr6oaQ9kj451P1pBElrJS0c6n4ASFo51H2wQ+PQH+YkvVVSh6QnJe1KAfKKBuw3JL1oEDa9GHgEGBsRl1bZb4ukr0l6RNITkjZLWjQI/WiYiDg/ItqHuh+9kXSZpGfSa6xnOqm0vlXS9yX9StI9ks4trVsk6ZbS8lhJP05/w1GNPpYcOPSHMUl/C3wK+GdgInACcBUwdwi7daheCGyN3r9V+AVgR6r3POAi4OEG9a3fJI0cwn2PlnRsnXWXSXoQuFBSl6QPVFT5SkQcXZruK627HriD4u/xPuCrkpqq7ON44HvAg8BbIuKZgzku60NEeBqGE3As8CQwv0ad0RRvCjvT9ClgdFq3CLilon4AL0rzK4ErgW8Be4BbgZPTuh+muk+lPrwFGA/cCDwOPAb8CHhOL/16GXAb8ER6fFlpn88AT6ftnlul7ZPA6TWOeRbwk9SPu4BzSvt8BJiclk9LdU7pZTunAOvSsWwH3lxadwFFyP2S4g3ostK61vTcXAw8lJ6rRcAtwCeAXwD3A+eX2mwA/qL8d6lR98S0zT0UAXol8MVejqE59fFLwLk1/h5np+M8Jf0Nmnqet7T+shr7eDGwFzimVPYj4N0VxzM+PWef760fngYoG4a6A54G6Q8Lc4B9wMgadT4E/BSYkP4h/wT4cFq3iL5D/zFgJjAyBceqanXT8r8AnwVGpemPAFXp07gUZu9I270wLT+vtN+P1Dim7wE/BhYAJ1SsawYeBV5L8Sn31Wm5Ka1fDtwMjAHuBv6yl30cRRHm70x9/EOKN4xpaf05wB+kfbyU4pPGvLSuNT0316XtjEnP9TPAu4ARwBKKN2GlNhv4/dCvVXcjxRvCEcArKEK9aiCn+s8HLk3H+2B6TZxUUWce8D+AgJVVtnEZxRv0Y8AWYElp3RuBbRX1/x34TOl4tqZ2n632mvA0sJOHd4av5wGPRMS+GnXeBnwoInZHRDfwQYqwrdfXI2JT2seXgNNr1H0GmAS8MCKeiYgfRfpXX+EC4N6I+EJE7IuI64F7gNfX2af5FGeS/wTcL+lOSWemdW8HboqImyLiNxGxDuigeBOAIryOBTZRBOmVvezjdcADEfH51MefAV8D3gQQERsiYnPax90Uwxt/XLGNyyLiqYj4dVp+MCL+MyL2A+3puZrYy/6r1pV0AnAm8P6IeDoibgHW1HqyIuLnEfHJiHgpRUAfB/xU0gZJp6Vq6yg+OfwMOEPSBZJGlDazGjiV4sThXcD7JV2Y1h1N8YZQ9gRwTGl5MsUngs/38pqwAeTQH74eBcb3MWb8Aoqzux4PprJ6/bw0/yuKf+C9+TjQCXxX0n2SltbZp55+NdfToYj4RUQsjYhpFKF5J/BNSaIY558v6fGeieJseFJq+wzFJ4npwCdrBNALgbMqtvM2irNmJJ2VLlx2S3oCeDfF8EXZjorl3z6XEfGrNNvb89lb3RcAj5XKqu2nlk6KIa9OiqGc49I+nqL4RHcpxdn+R4ENPa+tiNgaETsjYn9E/AT4NOkNkGK4bWzFfsZSvIn0uAv4O2CtpDP60V87CA794Wsj8H8UH817s5MiwHqckMqgGI9/bs8KSc8/lM5ExJ6IuDQiTqI4a/9bSbPr6FNPv/73IPb5CMVQxwsoho12AF+IiONK01ERcTmApGbgAxTjyp+UNLqXTe8AflCxnaMjYkla/2WKM+zJEXEsadiisnv9PZ467ALGSXpuqWxyrQaSRkiaI+l6imsMF1AMxbVExA9+29ki0G+mONtvoxi2Or2XzQa/O94twEmSymf2p6Xy3zWI+DRwObBO0vSaR2mHxKE/TEXEE8D7gSslzZP0XEmjJJ0v6WOp2vXAP0pqkjQ+1f9iWncXME3S6ZKOpBj66I+HgfJte6+T9KJ0xv1LYH+aKt0EvDjdajpS0luAqRQXgfsk6aOSpqe2x1CMeXdGxKPp2F4v6bwUdkdKOifd5imKs/xrKC6y7gI+3Mtubkx9fEd6TkdJOlPSqWn9MRRn3P8naSbw1nr6fqgi4kGK4arLJB0h6WxqDItJmgB0UYT8TymuwfxpRPxXeVhQUpuks0pNT6a4CaA7rZ8r6XgVZgLvAW5Iffofik9bH0jP9xsp3jC+VqX/H6P4lPA9SS856CfCahvqiwqeBneiGHbooDhz/znF3TY9d8McCVxBEXC70vyRpbbvo7hAuYNiPLzyQu5HSnXPAbpKy+9O23wceDPwN8ADqR9dwD/V6PMrgNspxn5vB15RWvd7+63S9jPAvRTDCt0UAX1qaf1ZwA8oLjp2p+fjBOC9FBczj0j1XpDW/1Ev+3lJattNMZR2M+muIYqhjQcphjBupLhw+cW0rjU9jyNL21pE7YvmG6i4e6dG3ZMprmnsAdYDK4BrejmGo4HT6ngNnQF8P70OnqT4RFC+WHt9eg6epLj+8p6K9q3pGH5NcafTuX0c+0fSa+Tkof73Mxynniv+ZjYMSfoKcE9EVN5Xf7DbWxkRiwZiWzY0PLxjNoykYaaTJT1H0hyKL+J9c4i7Zc8iQ/ZtQDMbFM8Hvk5xy24XxTDMHQO1cZ/lH/48vGNmlhEP75iZZeRZP7wzfvz4aG1tHepumJkdVm6//fZHIuKAH7Z71od+a2srHR0dQ90NM7PDSvpV1AN4eMfMLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCMOfTOzjDj0zcwy4tA3M8uIQ9/MLCPP+m/kmtmhaV36raHuwrDywOUXDHUXDonP9M3MMuLQNzPLiEPfzCwjDn0zs4w49M3MMuLQNzPLiEPfzCwjDn0zs4zUFfqSjpP0VUn3SNom6WxJ4yStk3Rvejy+VH+ZpE5J2yWdVyqfIWlzWneFJA3GQZmZWXX1nul/Gvh2RJwCnAZsA5YC6yNiCrA+LSNpKrAAmAbMAa6SNCJt52pgMTAlTXMG6DjMzKwOfYa+pLHAK4FrACLi6Yh4HJgLtKdq7cC8ND8XWBUReyPifqATmClpEjA2IjZGRADXldqYmVkD1HOmfxLQDXxe0h2SPifpKGBiROwCSI8TUv1mYEepfVcqa07zleVmZtYg9YT+SOAPgasj4gzgKdJQTi+qjdNHjfIDNyAtltQhqaO7u7uOLpqZWT3qCf0uoCsibk3LX6V4E3g4DdmQHneX6k8utW8BdqbylirlB4iIFRHRFhFtTU1N9R6LmZn1oc/Qj4ifAzskvSQVzQa2AmuAhalsIXBDml8DLJA0WtKJFBdsN6UhoD2SZqW7di4qtTEzswao9/f0/wr4kqQjgPuAd1K8YayWdDHwEDAfICK2SFpN8cawD7gkIvan7SwBVgJjgLVpMjOzBqkr9CPiTqCtyqrZvdRfDiyvUt4BTO9H/8zMbAD5G7lmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlG6gp9SQ9I2izpTkkdqWycpHWS7k2Px5fqL5PUKWm7pPNK5TPSdjolXSFJA39IZmbWm/6c6b8qIk6PiLa0vBRYHxFTgPVpGUlTgQXANGAOcJWkEanN1cBiYEqa5hz6IZiZWb0OZXhnLtCe5tuBeaXyVRGxNyLuBzqBmZImAWMjYmNEBHBdqY2ZmTVAvaEfwHcl3S5pcSqbGBG7ANLjhFTeDOwote1KZc1pvrL8AJIWS+qQ1NHd3V1nF83MrC8j66z38ojYKWkCsE7SPTXqVhunjxrlBxZGrABWALS1tVWtY2Zm/VfXmX5E7EyPu4FvADOBh9OQDelxd6reBUwuNW8BdqbylirlZmbWIH2GvqSjJB3TMw+8BvhvYA2wMFVbCNyQ5tcACySNlnQixQXbTWkIaI+kWemunYtKbczMrAHqGd6ZCHwj3V05EvhyRHxb0m3AakkXAw8B8wEiYouk1cBWYB9wSUTsT9taAqwExgBr02RmZg3SZ+hHxH3AaVXKHwVm99JmObC8SnkHML3/3TQzs4Hgb+SamWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRhz6ZmYZceibmWXEoW9mlhGHvplZRuoOfUkjJN0h6ca0PE7SOkn3psfjS3WXSeqUtF3SeaXyGZI2p3VXSNLAHo6ZmdXSnzP99wLbSstLgfURMQVYn5aRNBVYAEwD5gBXSRqR2lwNLAampGnOIfXezMz6pa7Ql9QCXAB8rlQ8F2hP8+3AvFL5qojYGxH3A53ATEmTgLERsTEiAriu1MbMzBqg3jP9TwF/D/ymVDYxInYBpMcJqbwZ2FGq15XKmtN8ZfkBJC2W1CGpo7u7u84umplZX/oMfUmvA3ZHxO11brPaOH3UKD+wMGJFRLRFRFtTU1OduzUzs76MrKPOy4E3SHotcCQwVtIXgYclTYqIXWnoZneq3wVMLrVvAXam8pYq5WZm1iB9nulHxLKIaImIVooLtDdHxNuBNcDCVG0hcEOaXwMskDRa0okUF2w3pSGgPZJmpbt2Liq1MTOzBqjnTL83lwOrJV0MPATMB4iILZJWA1uBfcAlEbE/tVkCrATGAGvTZGZmDdKv0I+IDcCGNP8oMLuXesuB5VXKO4Dp/e2kmZkNDH8j18wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy0ifoS/pSEmbJN0laYukD6bycZLWSbo3PR5farNMUqek7ZLOK5XPkLQ5rbtCkgbnsMzMrJp6zvT3An8SEacBpwNzJM0ClgLrI2IKsD4tI2kqsACYBswBrpI0Im3ramAxMCVNcwbuUMzMrC99hn4UnkyLo9IUwFygPZW3A/PS/FxgVUTsjYj7gU5gpqRJwNiI2BgRAVxXamNmZg1Q15i+pBGS7gR2A+si4lZgYkTsAkiPE1L1ZmBHqXlXKmtO85Xl1fa3WFKHpI7u7u5+HI6ZmdVSV+hHxP6IOB1ooThrn16jerVx+qhRXm1/KyKiLSLampqa6umimZnVoV9370TE48AGirH4h9OQDelxd6rWBUwuNWsBdqbylirlZmbWIPXcvdMk6bg0PwY4F7gHWAMsTNUWAjek+TXAAkmjJZ1IccF2UxoC2iNpVrpr56JSGzMza4CRddSZBLSnO3CeA6yOiBslbQRWS7oYeAiYDxARWyStBrYC+4BLImJ/2tYSYCUwBlibJjMza5A+Qz8i7gbOqFL+KDC7lzbLgeVVyjuAWtcDzMxsEPkbuWZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpaRPkNf0mRJ35e0TdIWSe9N5eMkrZN0b3o8vtRmmaROSdslnVcqnyFpc1p3hSQNzmGZmVk19Zzp7wMujYhTgVnAJZKmAkuB9RExBViflknrFgDTgDnAVZJGpG1dDSwGpqRpzgAei5mZ9aHP0I+IXRHxszS/B9gGNANzgfZUrR2Yl+bnAqsiYm9E3A90AjMlTQLGRsTGiAjgulIbMzNrgH6N6UtqBc4AbgUmRsQuKN4YgAmpWjOwo9SsK5U1p/nK8mr7WSypQ1JHd3d3f7poZmY11B36ko4Gvgb8dUT8slbVKmVRo/zAwogVEdEWEW1NTU31dtHMzPpQV+hLGkUR+F+KiK+n4ofTkA3pcXcq7wIml5q3ADtTeUuVcjMza5B67t4RcA2wLSL+tbRqDbAwzS8EbiiVL5A0WtKJFBdsN6UhoD2SZqVtXlRqY2ZmDTCyjjovB94BbJZ0Zyr7B+ByYLWki4GHgPkAEbFF0mpgK8WdP5dExP7UbgmwEhgDrE2TmZk1SJ+hHxG3UH08HmB2L22WA8urlHcA0/vTQTMzGzj+Rq6ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llxKFvZpYRh76ZWUYc+mZmGXHom5llZGRfFSRdC7wO2B0R01PZOOArQCvwAPDmiPhFWrcMuBjYD7wnIr6TymcAK4ExwE3AeyMiBvZwhk7r0m8NdReGjQcuv2Cou2A2bNVzpr8SmFNRthRYHxFTgPVpGUlTgQXAtNTmKkkjUpurgcXAlDRVbtPMzAZZn6EfET8EHqsongu0p/l2YF6pfFVE7I2I+4FOYKakScDYiNiYzu6vK7UxM7MGOdgx/YkRsQsgPU5I5c3AjlK9rlTWnOYry6uStFhSh6SO7u7ug+yimZlVGugLuapSFjXKq4qIFRHRFhFtTU1NA9Y5M7PcHWzoP5yGbEiPu1N5FzC5VK8F2JnKW6qUm5lZAx1s6K8BFqb5hcANpfIFkkZLOpHigu2mNAS0R9IsSQIuKrUxM7MGqeeWzeuBc4DxkrqADwCXA6slXQw8BMwHiIgtklYDW4F9wCURsT9tagm/u2VzbZrMzKyB+gz9iLiwl1Wze6m/HFhepbwDmN6v3pmZ2YDyN3LNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMOPTNzDLi0Dczy4hD38wsIw59M7OMNDz0Jc2RtF1Sp6Sljd6/mVnOGhr6kkYAVwLnA1OBCyVNbWQfzMxy1ugz/ZlAZ0TcFxFPA6uAuQ3ug5lZtkY2eH/NwI7SchdwVmUlSYuBxWnxSUnbG9C3HIwHHhnqTvRFHx3qHtgQ8etzYL2wWmGjQ19VyuKAgogVwIrB705eJHVERNtQ98OsGr8+G6PRwztdwOTScguws8F9MDPLVqND/zZgiqQTJR0BLADWNLgPZmbZaujwTkTsk/SXwHeAEcC1EbGlkX3InIfM7NnMr88GUMQBQ+pmZjZM+Ru5ZmYZceibmWXEoX+YkLRf0p2lqXUQ9/WApPGDtX3Lh6SQ9IXS8khJ3ZJu7KPdOX3VsYPT6Pv07eD9OiJOH+pOmPXTU8B0SWMi4tfAq4H/HeI+Zc1n+ocxSTMk/UDS7ZK+I2lSKt8g6d8k/VDSNklnSvq6pHslfaTU/pup7Zb0Lehq+3i7pE3p08V/pN9PMuuPtcAFaf5C4PqeFZJmSvqJpDvS40sqG0s6StK1km5L9fzTLYfAoX/4GFMa2vmGpFHAZ4A3RcQM4Fpgean+0xHxSuCzwA3AJcB0YJGk56U6f57atgHvKZUDIOlU4C3Ay9OnjP3A2wbvEG2YWgUskHQk8FLg1tK6e4BXRsQZwPuBf67S/n3AzRFxJvAq4OOSjhrkPg9bHt45fPze8I6k6RQhvk4SFN972FWq3/Olt83AlojYldrdR/Gt6Ecpgv6Nqd5kYEoq7zEbmAHclvYxBtg9oEdlw15E3J2uQV0I3FSx+ligXdIUip9kGVVlE68B3iDp79LykcAJwLbB6fHw5tA/fIkizM/uZf3e9Pib0nzP8khJ5wDnAmdHxK8kbaD4x1S5j/aIWDZQnbZsrQE+AZwDlD9Rfhj4fkS8Mb0xbKjSVsCfRYR/eHEAeHjn8LUdaJJ0NoCkUZKm9aP9scAvUuCfAsyqUmc98CZJE9I+xkmq+st9Zn24FvhQRGyuKD+W313YXdRL2+8Af6X0cVPSGYPSw0w49A9T6f8jeBPwUUl3AXcCL+vHJr5NccZ/N8XZ1k+r7GMr8I/Ad1O9dcCkQ+y6ZSgiuiLi01VWfQz4F0k/phiirObDFMM+d0v677RsB8k/w2BmlhGf6ZuZZcShb2aWEYe+mVlGHPpmZhlx6JuZZcShb2aWEYe+mVlG/h9PcIaPfLaQ0gAAAABJRU5ErkJggg==", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "training_data[\"Sex\"].where(training_data[\"Target\"] == \">50K\").value_counts().sort_values().plot(\n", - " kind=\"bar\", title=\"Counts of Sex earning >$50K\", rot=0\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Encode and Upload the Dataset\n", - "Here we encode the training and test data. Encoding input data is not necessary for SageMaker Clarify, but is necessary for the model." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn import preprocessing\n", - "\n", - "\n", - "def number_encode_features(df):\n", - " result = df.copy()\n", - " encoders = {}\n", - " for column in result.columns:\n", - " if result.dtypes[column] == np.object:\n", - " encoders[column] = preprocessing.LabelEncoder()\n", - " result[column] = encoders[column].fit_transform(result[column].fillna(\"None\"))\n", - " return result, encoders\n", - "\n", - "\n", - "training_data = pd.concat([training_data[\"Target\"], training_data.drop([\"Target\"], axis=1)], axis=1)\n", - "training_data, _ = number_encode_features(training_data)\n", - "training_data.to_csv(\"train_data.csv\", index=False, header=False)\n", - "\n", - "testing_data, _ = number_encode_features(testing_data)\n", - "test_features = testing_data.drop([\"Target\"], axis=1)\n", - "test_target = testing_data[\"Target\"]\n", - "test_features.to_csv(\"test_features.csv\", index=False, header=False)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A quick note about our encoding: the \"Female\" Sex value has been encoded as 0 and \"Male\" as 1." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
TargetAgeWorkclassfnlwgtEducationEducation-NumMarital StatusOccupationRelationshipEthnic groupSexCapital GainCapital LossHours per weekCountry
003957751691340141217404038
105048331191323041001338
2038221564611905141004038
305322347211725021004038
402823384099132952000404
\n", - "
" - ], - "text/plain": [ - " Target Age Workclass fnlwgt Education Education-Num Marital Status \\\n", - "0 0 39 5 77516 9 13 4 \n", - "1 0 50 4 83311 9 13 2 \n", - "2 0 38 2 215646 11 9 0 \n", - "3 0 53 2 234721 1 7 2 \n", - "4 0 28 2 338409 9 13 2 \n", - "\n", - " Occupation Relationship Ethnic group Sex Capital Gain Capital Loss \\\n", - "0 0 1 4 1 2174 0 \n", - "1 3 0 4 1 0 0 \n", - "2 5 1 4 1 0 0 \n", - "3 5 0 2 1 0 0 \n", - "4 9 5 2 0 0 0 \n", - "\n", - " Hours per week Country \n", - "0 40 38 \n", - "1 13 38 \n", - "2 40 38 \n", - "3 40 38 \n", - "4 40 4 " - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "training_data.head()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Lastly, let's upload the data to S3." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "from sagemaker.s3 import S3Uploader\n", - "from sagemaker.inputs import TrainingInput\n", - "\n", - "train_uri = S3Uploader.upload(\"train_data.csv\", \"s3://{}/{}\".format(bucket, prefix))\n", - "train_input = TrainingInput(train_uri, content_type=\"csv\")\n", - "test_uri = S3Uploader.upload(\"test_features.csv\", \"s3://{}/{}\".format(bucket, prefix))" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Train XGBoost Model\n", - "#### Train Model\n", - "Since our focus is on understanding how to use SageMaker Clarify, we keep it simple by using a standard XGBoost model.\n", - "\n", - "It takes about 5 minutes for the model to be trained." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2023-02-07-05-54-36-442\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "2023-02-07 05:54:36 Starting - Starting the training job..\n", - "2023-02-07 05:54:50 Starting - Preparing the instances for training........\n", - "2023-02-07 05:55:32 Downloading - Downloading input data....\n", - "2023-02-07 05:55:57 Training - Downloading the training image...\n", - "2023-02-07 05:56:18 Training - Training image download completed. Training in progress.......\n", - "2023-02-07 05:56:53 Uploading - Uploading generated training model.\n", - "2023-02-07 05:57:04 Completed - Training job completed\n" - ] - } - ], - "source": [ - "from sagemaker.image_uris import retrieve\n", - "from sagemaker.estimator import Estimator\n", - "\n", - "# This references the AWS managed XGBoost container\n", - "xgboost_image_uri = retrieve(\"xgboost\", region, version=\"1.5-1\")\n", - "\n", - "xgb = Estimator(\n", - " xgboost_image_uri,\n", - " role,\n", - " instance_count=1,\n", - " instance_type=\"ml.m5.xlarge\",\n", - " disable_profiler=True,\n", - " sagemaker_session=sagemaker_session,\n", - ")\n", - "\n", - "xgb.set_hyperparameters(\n", - " max_depth=5,\n", - " eta=0.2,\n", - " gamma=4,\n", - " min_child_weight=6,\n", - " subsample=0.8,\n", - " objective=\"binary:logistic\",\n", - " num_round=800,\n", - ")\n", - "\n", - "xgb.fit({\"train\": train_input}, logs=False)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Create Model\n", - "Here we create the SageMaker model." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:sagemaker:Creating model with name: DEMO-clarify-model-07-02-2023-05-57-08\n" - ] - }, - { - "data": { - "text/plain": [ - "'DEMO-clarify-model-07-02-2023-05-57-08'" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "model_name = \"DEMO-clarify-model-{}\".format(datetime.now().strftime(\"%d-%m-%Y-%H-%M-%S\"))\n", - "model = xgb.create_model(name=model_name)\n", - "container_def = model.prepare_container_def()\n", - "sagemaker_session.create_model(model_name, role, container_def)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Amazon SageMaker Clarify\n", - "With your model set up, it's time to explore SageMaker Clarify. For a general overview of how SageMaker Clarify processing jobs work, refer [the provided link](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-processing-job-configure-how-it-works.html). " - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: 1.0.\n", - "INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.\n" - ] - } - ], - "source": [ - "from sagemaker import clarify\n", - "\n", - "# Initialize a SageMakerClarifyProcessor to compute bias metrics and model explanations.\n", - "clarify_processor = clarify.SageMakerClarifyProcessor(\n", - " role=role, instance_count=1, instance_type=\"ml.m5.xlarge\", sagemaker_session=sagemaker_session\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Detecting Bias\n", - "SageMaker Clarify helps you detect possible [pre-training](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-detect-data-bias.html) and [post-training](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-detect-post-training-bias.html) biases using a variety of metrics.\n", - "\n", - "#### Writing DataConfig\n", - "A [DataConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.DataConfig) object communicates some basic information about data I/O to SageMaker Clarify. For our example here we provide the below information:\n", - "\n", - "* `s3_data_input_path`: S3 URI of the train dataset we uploaded above\n", - "* `s3_output_path`: S3 URI at which our output report will be uploaded\n", - "* `label`: Specifies the ground truth label, which is also known as observed label or target attribute. It is used for many bias metrics. In this example, the `Target` column has the ground truth label.\n", - "* `headers`: The list of column names in the dataset\n", - "* `dataset_type`: specifies the format of your dataset, for this example as we are using CSV dataset this will be `text/csv`" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [], - "source": [ - "bias_report_output_path = \"s3://{}/{}/clarify-bias\".format(bucket, prefix)\n", - "bias_data_config = clarify.DataConfig(\n", - " s3_data_input_path=train_uri,\n", - " s3_output_path=bias_report_output_path,\n", - " label=\"Target\",\n", - " headers=training_data.columns.to_list(),\n", - " dataset_type=\"text/csv\",\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Writing ModelConfig\n", - "\n", - "A [ModelConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.ModelConfig) object communicates information about your trained model. To avoid additional traffic to the production models, SageMaker Clarify sets up and tears down a dedicated endpoint when processing. For our example here we provide the below information:\n", - "\n", - "* `model_name`: name of the concerned model, using name of the xgboost model trained earlier\n", - "* `instance_type` and `initial_instance_count` specify your preferred instance type and instance count used to run your model on during SageMaker Clarify's processing. The example dataset is small, so a single standard instance is good enough to run this example.\n", - "* `accept_type` denotes the endpoint response payload format, and `content_type` denotes the payload format of request to the endpoint. As per the example model we created above both of these will be `text/csv`." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [], - "source": [ - "model_config = clarify.ModelConfig(\n", - " model_name=model_name,\n", - " instance_type=\"ml.m5.xlarge\",\n", - " instance_count=1,\n", - " accept_type=\"text/csv\",\n", - " content_type=\"text/csv\",\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Writing ModelPredictedLabelConfig\n", - "\n", - "A [ModelPredictedLabelConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.ModelPredictedLabelConfig) provides information on the format of your predictions. XGBoost model outputs probabilities of samples, so SageMaker Clarify invokes the endpoint then uses `probability_threshold` to convert the probability to binary labels for bias analysis. Prediction above the threshold is interpreted as label value `1` and below or equal as label value `0`." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [], - "source": [ - "predictions_config = clarify.ModelPredictedLabelConfig(probability_threshold=0.8)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Writing BiasConfig\n", - "[BiasConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.BiasConfig) contains configuration values for detecting bias using a Clarify container." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "bias_config = clarify.BiasConfig(\n", - " label_values_or_threshold=[1], facet_name=\"Sex\", facet_values_or_threshold=[0], group_name=\"Age\"\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For our demo we provide the following information in BiasConfig API:\n", - "\n", - "* `label_values_or_threshold`: List of label value(s) or threshold to indicate positive outcome used for bias metrics. Here positive outcome is earning >$50,000.\n", - "* `facet_name`: Sensitive columns of the dataset, \"Sex\" is the category\n", - "* `facet_values_or_threshold`: values of the sensitive group, \"Female\" respondents are the sensitive group.\n", - "* `group_name`: This example has selected the \"Age\" column which is used to form subgroups for the measurement of bias metric [Conditional Demographic Disparity (CDD)](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-data-bias-metric-cddl.html) or [Conditional Demographic Disparity in Predicted Labels (CDDPL)](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-post-training-bias-metric-cddpl.html).\n", - "\n", - "SageMaker Clarify can handle both categorical and continuous data for `facet: values_or_threshold` and for `label_values_or_threshold`. In this case we are using categorical data. The results will show if the model has a preference for records of one sex over the other." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Pre-training Bias\n", - "Bias can be present in your data before any model training occurs. Inspecting your data for bias before training begins can help detect any data collection gaps, inform your feature engineering, and help you understand what societal biases the data may reflect.\n", - "\n", - "Computing pre-training bias metrics does not require a trained model.\n", - "\n", - "#### Post-training Bias\n", - "Computing post-training bias metrics does require a trained model.\n", - "\n", - "Unbiased training data (as determined by concepts of fairness measured by bias metric) may still result in biased model predictions after training. Whether this occurs depends on several factors including hyperparameter choices.\n", - "\n", - "\n", - "You can run these options separately with [run_pre_training_bias()](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.SageMakerClarifyProcessor.run_pre_training_bias) and [run_post_training_bias()](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.SageMakerClarifyProcessor.run_post_training_bias) or at the same time with `run_bias()` as shown below. We use following additional parameters for the api call:\n", - "\n", - "* `pre_training_methods`: Pre-training bias metrics to be computed. The detailed description of the metrics can be found on [Measure Pre-training Bias](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-measure-data-bias.html). This example sets methods to \"all\" to compute all the pre-training bias metrics.\n", - "* `post_training_methods`: Post-training bias metrics to be computed. The detailed description of the metrics can be found on [Measure Post-training Bias](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-detect-post-training-bias.html). This example sets methods to \"all\" to compute all the post-training bias metrics." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# The job takes about 10 minutes to run\n", - "clarify_processor.run_bias(\n", - " data_config=bias_data_config,\n", - " bias_config=bias_config,\n", - " model_config=model_config,\n", - " model_predicted_label_config=predictions_config,\n", - " pre_training_methods=\"all\",\n", - " post_training_methods=\"all\",\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Viewing the Bias Report\n", - "In Studio, you can view the results under the experiments tab.\n", - "\n", - "\n", - "\n", - "Each bias metric has detailed explanations with examples that you can explore.\n", - "\n", - "\n", - "\n", - "You could also summarize the results in a handy table!\n", - "\n", - "\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you're not a Studio user yet, you can access the bias report in PDF, HTML and ipynb formats in the following S3 bucket:" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'s3://sagemaker-ap-south-1-000000000000/sagemaker/DEMO-sagemaker-clarify/clarify-bias'" - ] - }, - "execution_count": 18, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "bias_report_output_path" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Explaining Predictions\n", - "There are expanding business needs and legislative regulations that require explanations of _why_ a model made the decision it did. SageMaker Clarify uses Kernel SHAP to explain the contribution that each input feature makes to the final decision.\n", - "\n", - "For run_explainability API call we need similar `DataConfig` and `ModelConfig` objects we defined above. [SHAPConfig](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.SHAPConfig) here is the config class for Kernel SHAP algorithm.\n", - "\n", - "For our demo we pass the following information in `SHAPConfig`:\n", - "\n", - "* `baseline`: Kernel SHAP algorithm requires a baseline (also known as background dataset). If not provided, a baseline is calculated automatically by SageMaker Clarify using K-means or K-prototypes in the input dataset. Baseline dataset type shall be the same as dataset_type, and baseline samples shall only include features. By definition, baseline should either be a S3 URI to the baseline dataset file, or an in-place list of samples. In this case we chose the latter, and put the mean of the train dataset to the list. For more details on baseline selection please [refer this documentation](https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/clarify-feature-attribute-shap-baselines.html).\n", - "* `num_samples`: Number of samples to be used in the Kernel SHAP algorithm. This number determines the size of the generated synthetic dataset to compute the SHAP values. \n", - "* `agg_method`: Aggregation method for global SHAP values. For our example here we are using `mean_abs` i.e. mean of absolute SHAP values for all instances\n", - "* `save_local_shap_values`: Indicates whether to save the local SHAP values in the output location. Default is True." - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [], - "source": [ - "explainability_output_path = \"s3://{}/{}/clarify-explainability\".format(bucket, prefix)\n", - "explainability_data_config = clarify.DataConfig(\n", - " s3_data_input_path=train_uri,\n", - " s3_output_path=explainability_output_path,\n", - " label=\"Target\",\n", - " headers=training_data.columns.to_list(),\n", - " dataset_type=\"text/csv\",\n", - ")\n", - "\n", - "baseline = [training_data.mean().iloc[1:].values.tolist()]\n", - "shap_config = clarify.SHAPConfig(\n", - " baseline=baseline,\n", - " num_samples=15,\n", - " agg_method=\"mean_abs\",\n", - " save_local_shap_values=True,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# The job takes about 10 minutes to run\n", - "clarify_processor.run_explainability(\n", - " data_config=explainability_data_config,\n", - " model_config=model_config,\n", - " explainability_config=shap_config,\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Viewing the Explainability Report\n", - "As with the bias report, you can view the explainability report in Studio under the experiments tab.\n", - "\n", - "\n", - "\n", - "\n", - "The Model Insights tab contains direct links to the report and model insights.\n", - "\n", - "If you're not a Studio user yet, as with the Bias Report, you can access this report at the following S3 bucket." - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'s3://sagemaker-ap-south-1-000000000000/sagemaker/DEMO-sagemaker-clarify/clarify-explainability'" - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "explainability_output_path" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Analysis of local explanations\n", - "It is possible to visualize the local explanations for single examples in your dataset. You can use the obtained results from running Kernel SHAP algorithm for global explanations.\n", - "\n", - "You can simply load the local explanations stored in your output path, and visualize the explanation (i.e., the impact that the single features have on the prediction of your model) for any single example." - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Example number: 111 \n", - "with model prediction: False\n", - "\n", - "Feature values -- Label Target 0\n", - "Age 21\n", - "Workclass 2\n", - "fnlwgt 199915\n", - "Education 15\n", - "Education-Num 10\n", - "Marital Status 4\n", - "Occupation 7\n", - "Relationship 3\n", - "Ethnic group 4\n", - "Sex 0\n", - "Capital Gain 0\n", - "Capital Loss 0\n", - "Hours per week 40\n", - "Country 38\n", - "Name: 120, dtype: int64\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 22, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYkAAAFMCAYAAAA++EC6AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8/fFQqAAAACXBIWXMAAAsTAAALEwEAmpwYAAA4hElEQVR4nO3dedxmc/3H8dd7ZjCWxjrJPmKk4RdprEnJki20CCFRoVVpIy1+UiT9JL8iIksi8hMV2feIGbuQScpYh7KUnc/vj+/3Mue+XOfezve6F/N+Ph73477Oua7zOee+7us6n3O+qyICMzOzTsYM9wGYmdnI5SRhZma1nCTMzKyWk4SZmdVykjAzs1pOEmZmVstJYhSQ9C5JM4d4n5MkhaRxQ7nfvO+dJF3QpdiflPSwpH9LWrQL8YftfRsOo+XvlXSApF8M93GMRk4SgyTpXkkbD/dxjHadTjIRcUpEbNqFfc0F/A+waUQsEBGPFYjpz4EhaUNJl0p6QtK9HZ7/tqRbJb0o6YC255aQdI6kB/J3YdIQHXa/OEnYnGRxYDxw+0A3VOLvi1Fz1/Qf4HjgyzWbzQC+Avy+w3MvA38APlDkAAvzh74wSfNI+mG+KnggP56n8vw2km6S9KSkv0raLK/fTdIdkp6SdI+kPQewz5UlXSjpn5LukvShvH6FvG6NvLykpEclvSsvXybpYEnX5SugsyUtUrOP2uNrFYdJ+qKkRyQ9KGm3yvNbSrox/833tV1JXZF/P56LgNaV9FFJV1W2X0/S9fkYr5e0XuW5y/JV2tX52C6QtFiH418JuKuyr0v6Gfs7kq4Gngbe2BbzZGBZ4Lf52L9SeXonSf/I7/f+lW3GSNo3/+8fk3R63XueX79V/rw8LumPkt6S12+f/w8T8vLmkh6SNDEvH5Hf6yclTZf0jkrMAySdIekX+T27VdJKkvbL/7/7JG1aef1APicLSjoufwbul3SQpLE1rz0g//0n5eO4XdLUyvMhacXK8gmSDsqPW5+5r1Q+c9tK2kLSX/Ln/mttuxwv6Vd5XzdIWq0Se0lJZ0qaJelvkj7Xdpy/zu/Xk8BH2/+WiLguIk4G7un0t0bEiRFxHvBUh+cejoifANd32nbYRYR/BvED3Ats3GH9gcC1wOuBicAfgW/n59YCngA2ISXopYCV83NbAisAAt5JOimtkZ97FzCz5jjmB+4DdgPGAWsAjwKr5Oc/AdwBzAecDxxW2fYy4H5g1RznTOAX+blJQADj+nl8L+a/fS5gi/z8wpXn/yv/zW8BHga27bSfvO6jwFX58SLAv4Bd8t+3Y15etPI3/BVYCZg3Lx9S8161/039if0PYJX8/Fx9fQ4q+zg2H89qwHPAm/Pznyd9PpYG5gF+Cpxac7xrAI8AawNjgV3z/ubJz58CnAAsCjwAbFXZdue8fhzwReAhYHx+7gDgWeA9+fmTgL8B++f/3yeAvw3yc/Kb/DfNT/oOXAfsWfP3tY5ji/z3HQxcW3k+gBUryycAB7V95r5ZOeZZwC+B1+X/2bPAGyv7egH4YH79l/LfPBfpczk9x5qbdDFwD/Cetm23za+dt5fzwsbAvb08/wvggJrnxuW/edJwn996HNdwH8Bo/aE+SfwV2KKy/J7WhyZ/eQ7vZ/zfAHvnx++iPklsD1zZtu6nwLcqy+cAtwK3kE8wef1lVE6owBTg+fyF7fHl78fxPUPPE/0jwDo12/6w9T502g89k8QuwHVt218DfLTyN3y98tyngD/U7LfHvvoZ+8CBfA4q+1i6su46YIf8+A5go8pzS5BOQK96n4GjyBcYlXV3Ae/MjxciJbFbgZ/2cZz/AlbLjw8ALqw8917g38DYvPy6/DcsNJDPCak47zkqJ1FS4r205pgOAC5qi/tMZbmvJPFMh2Neu/L66cy+GDmAngloDPAg8A5SEv5H27HtB/y8su0V/fzevuaSxIhukTBKLQn8vbL897wOYBng3E4bSdoc+BbpingM6cr/1n7sbzlgbUmPV9aNA06uLB9LShR7RMRzbdvf13ascwGdimv6Or7HIuLFyvLTwAJ527WBQ0hXonOTrqDP6MffBq9+P1vHuVRl+aFO+y0U+z4Gp+6YlgPOkvRy5fmXSCfY+9tiLAfsKumzlXVz5+MmIh6XdAawD23l2ZK+CHw8vzaACfT8vz5cefwM8GhEvFRZJh/z4/lxfz4ny+X1D0pqrRtD7+9h+/s0XtK4ts9Sncc6HHP731X9LLxyHBHxslKLwdb7s2Tbd2gscGWnbec0rpMo7wHSl6Vl2bwO0gdthfYNlOoszgQOAxaPiIVIyUTtr+3gPuDyiFio8rNARHwyx16AdOV+HHBAh7LkZdqO9QVScVWp44NUBHAOsExELAgcXdk2+ti2/f1sHWf7CXUw+hO7r+Pr6/l29wGbt/2/xkdEp7/nPuA7ba+dLyJOBZC0OrA7cCrwo9ZGuf7hq8CHSEV+C5GKOfv7/+qkz89JPt7ngMUqxzshIlYZ5D6fJl2MtLxhkHFaXvkblBohLE36DNxHKl6rvs+vi4gtKtsO9P/8muEk0cxcksZXfsaRvrBflzQxV6B+k3SLCelEvZukjXIF5lKSVmb21fUs4MV81d7fJqC/A1aStIukufLPmpLenJ8/ApgeER8ntaw4um37nSVNkTQfqU7h15Wrs5YmxwepKOCfEfGspLWAD1eem0Vq3fHGjlumZLSSpA9LGidpe1KxxO8GsP86JWI/TP2xd3I08B1JywHkz8k2Na89FthL0tpK5ldqBPA6SeNJn6uvkeqjlpL0qbzd60jl9bOAcZK+SbqTaKLPz0lEPAhcAPxA0oT8GV9B0jsHuc+bgA9LGqvUwGOwcVreJun9+Xv6eVJCu5ZUHPikpK9Kmjfvb1VJa/Y3cP5bx5PupJTPB3NXnp8rPz+G9D8ZX63Qz8+1GrjMk5dHBCeJZs4l3dK2fg4ADgKmkcr/bwVuyOuIiOtIX+jDSVd2lwPLRcRTwOeA00llxx8mXXn3KW+7KbAD6aroIeB7pA/aNsBmwF755fsAa0jaqRLiZFJZ70Ok5qGfo02T48s+BRwo6SlS0jy9Evtp4DvA1UoteNZp2/djwFakytfHSM0It4qI9qvYASsU+2DSRcHjkr7Uj9cfQXrvLsjvx7WkMvFOxzeNVCH7v6T3fQazW9YcTKqnOioXIe4MHCRpMqmBwnnAX0hFQ8/SvLikz89J9hHSRcWf8zH/mlTvMhh7k+pLHgd2ItWDNXE2qQ6v1Vjh/RHxQk527wVWJ1VmPwr8DFhwALE3IJ0DziXdaT1DSpgtx+Z1O5IaCDyTj6HlGVK9EMCdzC4+G3bKFSY2B5J0GamVys+G+1hs5PLnZM7mOwkzM6vlJGFmZrVc3GRmZrV8J2FmZrVeU53pFltssZg0adJwH4aZ2agyffr0RyNiYqfnXlNJYtKkSUybNm24D8PMbFSR1D7ywCtc3GRmZrWcJMzMrJaThJmZ1XKSMDOzWk4SZmZWy0nCzMxqOUmYmVktJwkzM6v1mupMZ2ZDb9K+v+/3a+89ZMuuxR5MfOub7yTMzKyWk4SZmdVykjAzs1pOEmZmVstJwszMahVJEpI2k3SXpBmS9u3wvCT9KD9/i6Q1+tpW0vcl3Zlff5akhUocq5mZ9V/jJCFpLPBjYHNgCrCjpCltL9scmJx/9gCO6se2FwKrRsRbgL8A+zU9VjMzG5gSdxJrATMi4p6IeB44Ddim7TXbACdFci2wkKQlets2Ii6IiBfz9tcCSxc4VjMzG4ASSWIp4L7K8sy8rj+v6c+2ALsD53XauaQ9JE2TNG3WrFkDPHQzM+tNiSShDuuin6/pc1tJ+wMvAqd02nlEHBMRUyNi6sSJHadoNTOzQSoxLMdMYJnK8tLAA/18zdy9bStpV2ArYKOIaE88ZmbWZSXuJK4HJktaXtLcwA7AOW2vOQf4SG7ltA7wREQ82Nu2kjYDvgpsHRFPFzhOMzMboMZ3EhHxoqTPAOcDY4HjI+J2SXvl548GzgW2AGYATwO79bZtDv2/wDzAhZIAro2IvZoer5mZ9V+RUWAj4lxSIqiuO7ryOIBP93fbvH7FEsdmZmaD5x7XZmZWy0nCzMxqOUmYmVktJwkzM6vlJGFmZrWcJMzMrJaThJmZ1XKSMDOzWk4SZmZWy0nCzMxqOUmYmVktJwkzM6vlJGFmZrWcJMzMrJaThJmZ1XKSMDOzWk4SZmZWy0nCzMxqOUmYmVktJwkzM6s1brgPwMzMBmfSvr8f0OvvPWTLAe/DdxJmZlbLScLMzGo5SZiZWS0nCTMzq+UkYWZmtZwkzMyslpOEmZnVcpIwM7NaRZKEpM0k3SVphqR9OzwvST/Kz98iaY2+tpW0naTbJb0saWqJ4zQzs4FpnCQkjQV+DGwOTAF2lDSl7WWbA5Pzzx7AUf3Y9jbg/cAVTY/RzMwGp8SdxFrAjIi4JyKeB04Dtml7zTbASZFcCywkaYneto2IOyLirgLHZ2Zmg1QiSSwF3FdZnpnX9ec1/dm2V5L2kDRN0rRZs2YNZFMzM+tDiSShDuuin6/pz7a9iohjImJqREydOHHiQDY1M7M+lBgFdiawTGV5aeCBfr5m7n5sa2Zmw6TEncT1wGRJy0uaG9gBOKftNecAH8mtnNYBnoiIB/u5rZmZDZPGdxIR8aKkzwDnA2OB4yPidkl75eePBs4FtgBmAE8Du/W2LYCk9wFHAhOB30u6KSLe0/R4zcys/4pMOhQR55ISQXXd0ZXHAXy6v9vm9WcBZ5U4PjMzGxz3uDYzs1qevtTMrEuGYnrRbvOdhJmZ1XKSMDOzWk4SZmZWy0nCzMxqOUmYmVktJwkzM6vlJGFmZrWcJMzMrJaThJmZ1XKSMDOzWk4SZmZWy0nCzMxqOUmYmVktJwkzM6vlJGFmZrWcJMzMrJYnHbLXxMQoZtYdvpMwM7NaThJmZlbLScLMzGo5SZiZWS0nCTMzq+UkYWZmtZwkzMyslpOEmZnVcpIwM7NaRZKEpM0k3SVphqR9OzwvST/Kz98iaY2+tpW0iKQLJd2dfy9c4ljNzKz/GicJSWOBHwObA1OAHSVNaXvZ5sDk/LMHcFQ/tt0XuDgiJgMX52UzMxtCJe4k1gJmRMQ9EfE8cBqwTdtrtgFOiuRaYCFJS/Sx7TbAifnxicC2BY7VzMwGoMQAf0sB91WWZwJr9+M1S/Wx7eIR8SBARDwo6fWddi5pD9LdCcsuu2zHAxztA9h1+/i7/fd2+/gdf3jjd/PzM9o/m90+/qE4V5VIEuqwLvr5mv5s26uIOAY4BmDq1KkD2tbM5mwj7YJwJCpR3DQTWKayvDTwQD9f09u2D+ciKfLvRwocq5mZDUCJJHE9MFnS8pLmBnYAzml7zTnAR3Irp3WAJ3JRUm/bngPsmh/vCpxd4FjNzGwAGhc3RcSLkj4DnA+MBY6PiNsl7ZWfPxo4F9gCmAE8DezW27Y59CHA6ZI+BvwD2K7psZqZ2cAUmZkuIs4lJYLquqMrjwP4dH+3zesfAzYqcXxmZjY47nFtZma1nCTMzKyWk4SZmdVykjAzs1pOEmZmVstJwszMajlJmJlZLScJMzOr5SRhZma1nCTMzKyWk4SZmdVykjAzs1pOEmZmVstJwszMajlJmJlZLScJMzOr5SRhZma1nCTMzKyWk4SZmdVykjAzs1pOEmZmVmvccB+A2Uh37yFbDvchmA0b30mYmVkt30lY1/lKfHj5/bcmfCdhZma1nCTMzKyWk4SZmdVykjAzs1pOEmZmVqtR6yZJiwC/AiYB9wIfioh/dXjdZsARwFjgZxFxSG/bS1oU+DWwJnBCRHymyXGOdm6dYmbDpemdxL7AxRExGbg4L/cgaSzwY2BzYAqwo6QpfWz/LPAN4EsNj8/MzBpomiS2AU7Mj08Etu3wmrWAGRFxT0Q8D5yWt6vdPiL+ExFXkZKFmZkNk6ZJYvGIeBAg/359h9csBdxXWZ6Z1/V3+15J2kPSNEnTZs2aNdDNzcysF33WSUi6CHhDh6f27+c+1GFd9HPbPkXEMcAxAFOnTi0W10YP19mYdU+fSSIiNq57TtLDkpaIiAclLQE80uFlM4FlKstLAw/kx/3Z3szMhknT4qZzgF3z412Bszu85npgsqTlJc0N7JC36+/2ZmY2TJomiUOATSTdDWySl5G0pKRzASLiReAzwPnAHcDpEXF7b9vnGPcC/wN8VNLMSosoMzMbIo36SUTEY8BGHdY/AGxRWT4XOLe/2+fnJjU5NjMza849rs3MrJaThJmZ1XKSMDOzWk4SZmZWy0nCzMxqOUmYmVktJwkzM6vlJGFmZrWcJMzMrJaThJmZ1XKSMDOzWk4SZmZWy0nCzMxqOUmYmVktJwkzM6vlJGFmZrWcJMzMrJaThJmZ1XKSMDOzWk4SZmZWy0nCzMxqjRvuAzCb0917yJbDfQhmtXwnYWZmtZwkzMyslpOEmZnVcpIwM7NaThJmZlbLScLMzGo1ShKSFpF0oaS78++Fa163maS7JM2QtG9f20vaRNJ0Sbfm3+9ucpxmZjY4Te8k9gUujojJwMV5uQdJY4EfA5sDU4AdJU3pY/tHgfdGxH8BuwInNzxOMzMbhKZJYhvgxPz4RGDbDq9ZC5gREfdExPPAaXm72u0j4saIeCCvvx0YL2mehsdqZmYD1DRJLB4RDwLk36/v8JqlgPsqyzPzuv5u/wHgxoh4rtMBSNpD0jRJ02bNmjXIP8PMzDrpc1gOSRcBb+jw1P793Ic6rIt+bSitAnwP2LTuNRFxDHAMwNSpU/sV18zM+qfPJBERG9c9J+lhSUtExIOSlgAe6fCymcAyleWlgVZRUu32kpYGzgI+EhF/7cffYmZmhTUtbjqHVLFM/n12h9dcD0yWtLykuYEd8na120taCPg9sF9EXN3wGM3MbJCaJolDgE0k3Q1skpeRtKSkcwEi4kXgM8D5wB3A6RFxe2/b59evCHxD0k35p1N9hZmZdVGjocIj4jFgow7rHwC2qCyfC5w7gO0PAg5qcmxmZtace1ybmVktJwkzM6vlJGFmZrWcJMzMrJaThJmZ1XKSMDOzWk4SZmZWy0nCzMxqOUmYmVktJwkzM6vlJGFmZrWcJMzMrJaThJmZ1XKSMDOzWk4SZmZWy0nCzMxqOUmYmVktJwkzM6vlJGFmZrWcJMzMrJaThJmZ1XKSMDOzWk4SZmZWy0nCzMxqOUmYmVktJwkzM6vlJGFmZrWcJMzMrFajJCFpEUkXSro7/1645nWbSbpL0gxJ+/a1vaS1JN2Uf26W9L4mx2lmZoPT9E5iX+DiiJgMXJyXe5A0FvgxsDkwBdhR0pQ+tr8NmBoRqwObAT+VNK7hsZqZ2QA1TRLbACfmxycC23Z4zVrAjIi4JyKeB07L29VuHxFPR8SLef14IBoep5mZDULTJLF4RDwIkH+/vsNrlgLuqyzPzOt63V7S2pJuB24F9qokjR4k7SFpmqRps2bNavjnmJlZVZ9FOJIuAt7Q4an9+7kPdVjX551BRPwJWEXSm4ETJZ0XEc92eN0xwDEAU6dO9R2HmVlBfSaJiNi47jlJD0taIiIelLQE8EiHl80ElqksLw08kB/3uX1E3CHpP8CqwLS+jreTew/ZcjCbmZnN8ZoWN50D7Jof7wqc3eE11wOTJS0vaW5gh7xd7fb5tePy4+WANwH3NjxWMzMboKZJ4hBgE0l3A5vkZSQtKelcgFyX8BngfOAO4PSIuL237YH1gZsl3QScBXwqIh5teKxmZjZAinjtFONPnTo1pk0bVImUmdkcS9L0iJja6Tn3uDYzs1pOEmZmVstJwszMajlJmJlZLScJMzOr5SRhZma1XlNNYCXNAv4+gE0WA7rZ/8LxHd/xR1/sOTH+chExsdMTr6kkMVCSptW1DXZ8x3f8kRt/NB/7aIvv4iYzM6vlJGFmZrXm9CRxjOM7vuOPyvij+dhHVfw5uk7CzMx6N6ffSZiZWS+cJMzMrJaThJmZ1XKSGEUkzdOfdXMqSYdKmiBpLkkXS3pU0s7DfVwjhaTxkvaR9H+SzpT0BUnjh/u4+kvSepI+LOkjrZ+CsVdofZckvUvS5yQtVDD+gW3LYyWdUip+jrlIyXgtc1SSkLS4pOMknZeXp0j6WOF9vF3S/PnxzpL+J0/BWsI1/Vw3aPn4L5T0F0n3SPqbpHtK7iPvZ4KkRVo/hcJuGhFPAluR5lZfCfhyodhIWknSsZIukHRJ66dU/LyPRSUdKekGSdMlHSFp0ULhTwJWAY4E/hd4M3Byodi0f5fyifBbhWKfDBxGmrVyzfxTsjPamcBLklYEjgOWB35ZMP6ykvaDVy7szgLuLhgf4E+SzpC0hSSVCjquVKBR4gTg58D+efkvwK9IH4pSjgJWk7Qa8JUc+yTgnYMNKOkNwFLAvJLeCrQ+ABOA+Zod7qscB3wBmA68VDg2kvYEDgSeAVpN6wJ4Y4Hwc+XfWwCnRsQ/C35XAM4AjgaOpQvvTXYacAXwgby8E+kzunGB2G+KiNUqy5dKurlA3JaNJH0A+BiwKOm7dnmh2FOBKdG95pgvR8SLkt4H/DAijpR0Y8H4uwGn5ESxIXBeRBxeMD6ki6KNgd2BIyX9CjghIv7SKGpEzDE/wPX5942VdTcV3scN+fc3gY9V1zWIuStwKfBU/t36OQd4f+Hj/1OX/wd3A4t1KfYhwJ3AjaSEMbHk3wNM7+Z7U7cPYFqh2CcA61SW1wZ+Uvj4tyeNGfQP4O0F454BLNHF9/1PwI7AbcDyed1tBeKuUflZG7gJ+HFrXRf/ng2B+4HHSYl63cHGmqP6SUi6jHSFdmFErCFpHeB7ETHoq/wO+7gc+APpymEDYBYpEf1XgdgfiIgzm8bpYx+HAGOB/wOea62PiBsKxf8DKbE9XSJeh/gLA09GxEuS5gMmRMRDhWIfADxCKiqovjf/LBE/7+MwYBpwel71QWCViGhcbCPpDuBNpBM4wLLAHcDLQETEWxrGnwycCNxKKsr6M7BPif+1pEuB1YHr6Pneb900do4/BdgLuCYiTpW0PLB9RBzSMO6lvTwdEfHuJvHb9rUosDOwC/AwqVTgHNL7dkZELD+ouHNYkliDVB67KumKYSLwwYi4peA+3gB8mHTXcqWkZYF3RcRJBWLv02H1E6Srz5uaxs/76PShLvZhzsVlPydduVW/7J8rELtjRWaJ9z7H/1vn8FGiqKy1j6eA+UnFWSLVG/6nsq8JDWL3WjcWEQMZQblT/DuBz0TERblMfB9g94hYpUncHLvjhVxElCrOqu5rYWCZkueFoSDpL6Q6pp9HxMy2574aEd8bVNw5KUkASBpHupoScFdEvFA4/vzAs/lKdiVgZVL5Y+P9SPolqWz2t3nVlsD1eR9nRMShTffRbZKuA64iXW2+3FofEScWiH1kZXE8sBGpqO+DTWO/FuQLlleJiH90Wj+I+BMiNRyorpscEaUraIvLpQxbk+ppbyKVAFweEZ0uzAYTf3Hgu8CSEbF5vnNZNyKK1IdKGgt8v9Tx9og9JyUJSe/vsPoJ4NaIeKTQPqYD7wAWBq4lFR08HRE7FYh9PvCBiPh3Xl4A+DXwPtLdxJQC+1gQ+BapqAxSeeaBEfFE09g5/h8jYr0SsfqxrwWBkwsWScwFfJLZ781lwE8LXQCsHBF35rvdVylR3CfpVlIjAZGS6PKkC6XGV/o5futEuFREbFbiRCjpqohYP99hVU9WouGdVdt+boyIt0r6OOku4luSbmlaBFeJfx650UxErJYvVm8sUQxd2cfFEbFRqXgtc1rrpo8B65IqfQHeRTqRryTpwIgo0RxQEfF0bg54ZEQcKummAnEhlSE/X1l+gTRZyDOSnqvZZqCOJxXFfSgv70L6cHdKsINxqaQ9SHdDXSnXr3gamFww3lGkCvGf5OVd8rqPF4i9D7AH8IMOzwXQuLiv/YSUE9KeTeNWnEDh1oMRsX7+/bqmB9eHcZKWIH3u9+/rxYOwWESc3moGG6klVekWcjdJOodUyd8qoiQi/q9J0DktSbwMvDkiHoZXrnyOIrU6uIIybcYlaV1S08VWu/GxBeJCard9raSz8/J7gVNzEdefC+1jhYj4QGX5vwsmOUj1NQD7VdYVaQIr6bfMvtocA0whfWFKWTN6NiG9pFQT0ojYI//esES8fu7zBklrFgzZ9ROhpNeT7oLI+yhSVEZqln0+cHVEXC/pjZTtx/CfXLEcALnRTJG784pFgMfoeUERpEYogzanJYlJrQSRPQKsFKk9fam6ib1JJ8CzIuL2/GHrrYVDv0XEtyWdS+pQJGCviJiWn25cnJU9I2n9iLgKUuc6Up+GIgbbwqKfDqs8fhH4e3sFXkMvSVohIv4KkP+33ehLsh4wicr3swsNH8aQmmHOahq3omsnQklbk+6yliR9b5cjtcwqUlQWEWdQuaCIiHuY3VelhH1ILY1WkHQ1udFMwfgAP4uIq6sr8ve3kTmtTuInpCKb1ofhA6SeuV8GfjeUV3GDodS1/0rgjxHxn75eP8h9rE5qxrggKRH9E/hoRBS5Yu5mCyRJ34uIr/a1rkH8jUjFKfeQ3pvlgN0ioshFQN7HycAKpMrTVgKKQq2/qs1oXwTuBc6MiGebxs7xu9Z6MN+xvRu4KNcdbAjs2LoDKxB/adKxv52U5K4C9i55kTEEjWZuiIg1+lo34LhzWJIQqWx9/bzqMVIHnU8X3MdEUk/rVeh5W9y4TFnS7qRjX5fUse5K4IqIOLvXDQe3rwkA7a1VCsTtWgukmi9JscrHHG8eZn/R74yIUnVBrfh30N2exUh6HSnx/LtQvDWB+yLioXwi3JN0AfZn4Jsl6puU52zOyeKtEfGypOsiYq2msXP8C0nFua0i552BnSJik0Lx5yPdTSwXEZ9Q6lPypoj4XYHY6wLrAZ8Hqr24JwDvaysiHbA5qrgpIkLSX0l1EB8C/kYas6WkU0iVdVuROufsSqFb+og4HjheqS/Gh4AvkSo7G1fqSdo5In7RViSB8rAWEfE/TfeR43y2Lf6CNKwLkvRJ4FPAGyVVr1pfB1zdeasBxX93RFzSoXXcCpIaVwy2uQ14A/BgwZgASFqV9F4vkpcfBXaNiNsahv4ps4cNWY9U8ftZUieuYyhTrPJ4bs13BWl4i0dId0OlTIyIn1eWT5D0+YLxf04a6mbdvDyTVKLROEkAcwMLkM7n1XPBkxR47+eIJJH7K+xA6nb/GOkkri4VLy0aEcdJ2jtSR5/LlXphNybpZ6TK2IdJdxEfBIr0hCZ14ILOCaebt5slWiD9EjgPOBjYt7L+qUKtpt4JXEJqKNCuccUg9Kh0fx3wZ6X+JKV7Fh9D6gF9ad7nu/K6pk2Sx1be5+2BYyKNDHBmwUYP25Dqxr5Aqn9bkFTZXEprxOBT83LrXFHKChGxvaQdAXKLxCIDi1XOMydEww6RncwRSYI0ns+VwHsjYgaApC90aV+tcsYHJW0JPAAsXSj2oqSWUo+T6goejYgiV1MR8dP88KJuVH5VYnVqgXR6/RZ9y304niB9sastYBaQtEDTFjAxe0iMAyOiR69rpeEbSjis75c0Nn+1/iQiLsst45oaK2lc/ixuRLq7bSlyjqnUwb1MqjMrbXfSyLiHkz6ffyQNrVPK85LmZXal/gpULgIKmUfSMby60UOzou7o0gBTI+mH1NnsV8B9pBE8NwL+1qV9bUW6ylmV1KppOrB14X28mVT++HdgZuHYrxqMsNO6BvHfWfl5O7B0wdjvJTVb/A+pKPFl4PYuvzdFB/0j3dGNyY9XIvUCnqtQ7LOAb+STyCTg68BvCsTdn1SsdzZpcMVWXeeKpCalTWJ/DPhyZfl+UjHKU8AnS773HfZ9WMFYm5I6ps4iFUnfSxqup+Tx3kzq7LkW8LbWT9O4c1rF9fzAtqQrzneTrkjOiogLhvO4+kvSVqTe3BuQenRfA1wZqa6iaeyuVn4NhW61gJG0MqkhwqH0nJ9iAukEVqQZZt5XN3vsLwz8N7MbblwB/HdE/KtA7HWAJYALIl/152LeBaJBb3FJ1wObRcRjefnG/L8dn/e1Qe8RBk/SPyKi41Amg4y3KLAOqdHDtRHxaKnYOf70iHhbyZgw5xQ3Aa/csp5CqvhaBNiOVIbdOEnkVju1GTcKNGEENid9sY+IiAcKxKvqauWX0uB4de9PRMQKTfcBvBARj0kaI2lMRFwqaVCDmrV5E+kOcSF61ks8BXyiQPyqrvTYVxrb54yIKDEvxatExLUd1jWbxyAZ00oQ2Rk59rO5+Kabik1Gkps2X0G6qLuzVNw2v5X0KQqPUjxH3Ul0k6Rde3s+CgxgNxQkLRddqPzSq2dXG8PsFlo3RM9e3oPdx0WkO8WDgcVIna7WjEJjRUlaNyKKzgTYYR83klpqHU6aj+R2SbdGmaHmzwF2iULjcA0FSTMiYsUO68cAM6LhCLyqnxVRwM0RUaQ+UdK7SXdw7yCNLnATqfn6ESXi5310ZZRiJ4lRQK8e3OyVpyg4yFneV9f6eeT4Y0hjHn2Z9EX5bkQUGVIkFyc+Q0pArRYwv2h6JVWJP55URt7+3uxeIn7exwakxHl1RHxPqVf350vciUo6nVTccSE9x/YpcZfbFUodYP8ZEV9vW38QaRiQvRrGb93hdrpraHyCbdvXWNK0qxuSmsc/ExErl4rfLU4SheVOOdtFxON5eWHgtIh4z7AeWD9JuoBUyf8lKv08omGvZaURVHcnNWG8Cjg48vAWpaj7Pa7PILWU+zCp+eVOwB0RsXeJ+N1Wd7c7ku9yc+L/Genk2ur1vxqprubjUahDYLdJupjUKOEaUkvLq6LQyNOVfXRlNAMnicIk3RQRq7etuzEi3loo/lhgcXo2cSs1yNkrlV/VnsqSLo+Gs/dJmknq/PRDZs+M9ooo0CFNXe5xXak0vSUi3pIT3/ml7rLyPlYiJehJlGzGOMrlO6pWA4E/l77A6DZJh5NaGz1Hagl2BWkWvGLjoqlLoxnMURXXQ+QlScu2TtxKs4EVycSSPkua6+FhZk/YE0CxYSfoXj+Pi0jHulr+qWrUIa2mx7VIFfGNe1xXtN6bx3Pv5YdIJ/OSzgCOJl09lx5BtTWfRNUTpKvyg9oqiEeUSAPu3TPcxzFYEfEFgNxrfDdSD+w3APMU3Efx0QzAdxLFSdqM1Iu11ct6A2DPiPhDgdgzgLW7+WXOzWyvBJYhDXg2gdRM8pxu7bOp/GVYmO71uG7t5+OkYVzeQvqSLwB8I2Z3RCyxj640Y8yxDyUlnl/mVTuQkukTwPoR0alHuRUg6TOkSuu3kfo3tVo6XdLFfc4F3BIRb24Ux0miPEmLMbs99DWl2kMrzT+9SRTqZT3cJP0uIrYqEGc+UvPXF/Lym4AtSEOFlxxXqeskHUBqlVW0GWOOfXVEvL3TulItqEabXlo3AeUmw5L0ZVJimN6t72/baAZjSZ1uT4+Ifeu36kdcJ4mylGa4+2ZleQxpCs0SnaGOI7XZ/z09TyBFBt/L+ziRNETy43l5YeAHJVvwVPZVpK5G0hWk5qJ3S1oRuI7UH2YKcH3TL0llP4sCBzB7OOkrgW+XvLPrVjPGHPtmYI+I+FNeXgs4NtJ0msXqzUrq9kl8KFs3dZukar1hsflUXCdR3rKS9ouIg5WGlT6DhoPwSTo5InYhDb98OKnj29zND7Wjt7QSBEBE/EtSt04eNxaKs3BEtGYR2xU4NSI+K2lu0rAoRZIEcBrparDVp2MnUkuwYh3UoruTMn2cNIrwAnn5KeBjuQXRwV3cbxPT6eUkTsMZDbv8fg+piLhcabbN1myDRWbW851EYZJEuoq9ldQe+ryIOLz3rfqM+WdSb+vfkubl7qFwufvNpDFl/pWXFwEuH8lFEW0tsa4Gvh8Rv8nLN0ehIUU61Rcoz3NQIn6ONxdp/J3WcBOXAT+NghPU5DocVS8G7JW75sn07ANzxfAd0cBI+hDwfdJnRqQ6kC9HxK+bxPWdRCFKs3K1HEEaY/9q0hC+a0SD8WtIrV3+ACxPaonyym4pND90xQ+AP0pqfbC2A75TKrjSiLIHkGZ1G8fsDoFN/oZbJB1GGvxtRfIwK5IWanSwr3appB2YPWrtB0lFfyUdBcwF/CQv75LXfbzUDmIU9biu6uZJPDdK2JvUku8mUp3iNfScL3qwsceSmkp3ZUiUiv1JIww8kvc7kdSqsFGS8J1EIblSuU6UaOcu6aiI+GTTOP3YzxTSl0PAxaV6ROfYd5I61E2n0sSzSbm+0hg+e5MGmDs+8lSrSnNFrxARjZsB5nhPkTpEtZofj2F2z+WIAj3fO935lLwbGq3qTuKl+o/k5sFrkgbeW11pUMf/jojtC8Xv+pAo7Y0Pcn3ozU1LAXwnUUhEbJj/KdtFxK+6tI+hSBDLAv8mTdr+yrqCHfaeiIjzCsUC0gQuwCEd1v+RNC9Aqf00ngGwH16StEKrs1juRFa0v8QotTezT+Ibtk7iBeM/G2nQQCTNExF35lZyxeIDtyqNyNCtIVH+IOl8Zk+ctD1wbtOgThIFRZp399OkyszR6vfMbkY3L6mI6y5m93Zt6lJJ3yd1nqu20Co1w17XKI2r9CqFy62/THqP7iHdyS1Hoclv8mfzlLaWaztGxE963XBk6PZJfGYunvwNcKGkf5E6kpbye8oXTQKQW/QtHhFfVppid31y83tS/Wiz+C5uKkvSN0iDzP2KnlcMxSqXh1Kua9kzIvYsFK9TsVyR4rhuy+3QW8aTJneZXvrYc6u4N5G+6HdGRJEZzNTlIWO6SdJZpGT5eVJR6L9IkzFt0YV9vZM0OOR5hRsMzAssGxF3lYqZ4/4O+FpE3NK2firwraadJJ0kCutmO/fhog5jIhlIWgY4NCJ2LBDr3RFxSb4SfJUSnQKVhixZLfKXPleo3hIFJ00aCt04iVeamfe6rkH895KmqJ07IpaXtDppOtzGc5dLui0iVq15rnEnSRc3FTba211L2qeyOAZYgzTlYqn4C5LGn2oV3VxO+rIMukKvrafpq5T4ItaYSZqmtoR3ApfQc1KjlkZjW1WcD5wu6egccy9Sq7kRr3rCjojLW+tIrb9K6JEocwItOTzKAaQ7z8sAIuImlZsffXwvzzWemMlJorChaOfeZdXK2RdJ5ahnFox/PHAbacIhSF/ynwMdr6D76bCmB9Uf6jn74BhgdWYPX91IRHwrPzwwInrcjRY8mXwV2JP0+RSpqfDPCsXutq6cxCXtB3wNmFfSk63VwPOkMdhKeTEinkjdqF5RqhjnekmfiIhjqyuVZjec3jS4i5sKk/QzUjv31hj9uwAvRUSxdu6jWU25+KvWjUTqOR/Di8C9EVFylNmORXudOvHNKaonceDp1mrySTwi9iu0n4NLxaqJfxxwMan3/weAz5HqVBpNmpRjL04a6+t5ZieFqaRRGd4XEQ81iu8kUdZobec+VEU2kq4h9QK9Ki+/HTgsItYtEHsyaXiJKfTscFWkPkhp+IpnI+KlvDwWmCcinu59y37FXpl0tXwoqYVTywTS+zXoegNJp0fEh9R5qHCi0Hwb3dStk7iklXNLqY51bqVa3SkNQrk/sCkpyZ1PGvfr2RLx8z42ZHbx5+1RaIRZJ4nCJN1A6itRbef+65Fe8avZg4O9nzTO/S/y8o6kK+avFdrP6qS7rAVJX5Z/Ah9tdYBrGPsqUn3H4aSy/d1In/Fv9bph/+NfC2wceTY0pTGQLogCc2hL2oY0P/fWVPqokMZXOi33+Rhs7CUi4kGluU1eJbowp3kp3T6JSzomIvYYqlZ3kibkuE+VjNtNThKFSPo8aRiOhYFjgVa58iRg91JZvdskXRERG/S1rsB+JgBExJN9vXYAMVuz6r3SokPSlRHxjkLxu15UJmndiLimVLzRbqhP4t0iaU1SfVyrzu8J0nmhcZ1Bt7niupylSWM2vRn4C+kKeTrw84go2Smn2yZKemOkmcBalaYTmwaVtHNE/KKt9RStirwoM9z5s0q93u9WmuTlfuD1BeK2/EeVcbgkvY3UJ6akG3Ont1XoWWTWeKj23Lz2e6T3RPknosBwIt0SEXvk3xt2cz+SxpNmN1yf2cPAH12wOOg44FMRcWXe3/qkBhsjvqjPSaKQiPgSgNLw1FOB9YB1gU9Lejwipgzn8Q3AF4DLco9fSHdCJTrSzZ9/dxraotTt7OeB+UiVgt8GNgQ6Tg7fIP4ZklpJfwnS0AclnQzcCbwHOJA0HPkdhWIfCrw3IkrFGzJDcBI/iVS015onekfS/2K7QvGfaiUIgIi4SmkssBHPxU2F5X4A65ImplkXWAi4NSKKDK0wFHKP35XzYrEevzn229tbBHVaN8jY20XEGX2ta7iPuejZG7po0+ZWD2jl4c/z/s4vUayiDjPTjRaSTiedxKt1ZQtHRJGTeLcbnEg6nHQBcyopyW1P6jV+JozsYWmcJAqRdAypiOAp4E/AtaTByP41rAfWT5K+EhGH5sc9TqySvluw4rpTE88iPbq7GTvH6vrYR5Kui4i1lGbb+xTwEHBdiRZako4gNUr4DT3HzRrxU7wOwUn8BNKdybV5eW1g14j4VKH4XR8lultc3FTOssA8pNmg7if1xn18OA9ogHYgFUcA7EeaUa9lM1Jb9UGTtC6pCG5iW73EBNJ8vE1ib06a03opST9qi11yPuFPRMSPWwuRZu37BLPnfijhmJx8vkFq5bQA8M3eN+m3CaS+BptW1pXqzd1tN0pap+0kXrKPytrARyS1RjteFrij1Wy4aTPhbtepdJOTRCERsZlSLewqpJPhF4FVJf2TNO59kWaYXaSax52WB2Nu0glvHD3rJZ4kTd7TxAOkyZi2pmcP06dIdSyljJGkiB5jHxWdRjYiWj2gL6fsZFKMpiLPDrp6EiddCFkHLm7qAklLk+ok1gO2AhaNiIWG9aD6UC2WaS+iKVxks1y32uVLGhcRJe8c2uMfRhq6uzX20SeBf0TEFwvE3qe350u0/lKaqewTpMYIr1wglmg51W11fTxaSn2mJL2enq3KSs2jMmr5TqIQSZ8jJYW3Ay+QboWvIbWNvnUYD62/VlMau0a8ehyb3gYQG6inleaTaG/iOegy2VaPYlKRRDd7FH+DdJLdi9ljHx1XKPZQTGh0NqlV0EWMsomMWkmgWydxSVuTpu5dEniEdDFwB+XmURm1nCTKmUSaS/YLEfHgMB/LgEVEo3qBATiFNNfGVqST7a40H2V27/x7q4ZxOpI0DvguqQf3faQEsQypw+QYCpxwI6LkLGt15ouIrw7BfoobgpP4t0lTol6UW5dtSGpBVYSk7YA/RMRTkr5OGl35oJHcqqllzHAfwGtFROwTEb8ejQliiC0aEccBL0TE5bmoY50mAfOQE2OB4yLi7+0/BY75+8AiwBsjYo1Ik/QsTxpapOgItJJWknSxpNvy8lvySaWE30kqPknPEGmdxP8SaTj+jShbcf1CpHnWx0gaExGXkkb5LeUbOUGsT+oDcyJwVMH4XeMkYUOt1a/gQUlbSnorqbd6I5EG3Xs691MpbStSy6ZXOj/lx58ktaoq6VhS67IX8n5uIbU8GzRJT+Xiw71JieIZSU9W1o8G3T6JP640FtcVwCm5uXDJ+q3W3eaWwFERcTaFGz10i4ubbKgdlE/kXyT1bp1AuRZI3ZpsPlotmtpWvtSpDqSh+SLiOvWcd6DRySoihqK+o9vaT+KPUPYkvg1piJUvkHq5L0jq8V7K/ZJ+CmwMfC93WB0VF+lOEjakIuJ3+eETpGEzSurWZPN/lvSRiDipulLSzqQhNEp6VNIK5KFKJH0QKFKEKeniiNior3UjVFdO4pJWBBav9Ph/GThR0gak0RIea7qP7EOkZraHRcTjkpag55DwI5abwNqQknQisHdbr+UfjORmmJKWInU4e4bUDyOANUkT4bwvIu4vuK83kmZEW480bMPfgJ2a1K3kcY/mJ02P+i5m93uZQJon+s1NjrmbOpzEW+s3AO6PPCR/g/i/A76Wi/Wq66cC34qITtPJDnQfY0hziZea6nZI+U7ChtpbWgkCXum1/NYSgdWlSYdyElhb0rtJrWlEOrle3CRuzb7uATZWmuBoDCkxbQ80qYDfkzQ44ZJAtTXNk8CPO20wgvyQzr39n87PNT2JT2pPEAARMU3SpIaxW7FelnSzpGVHY78LJwkbamMkLdwa00rSIpT7HP6c2ZMObUiedKhQbCLNCdKVeUGU5tf4NLAUqT/DRXn5S6R5tE8ZbOyIOAI4QtJnI+LIPjcYWbp9Eu+tD9C8BeK3LAHcLuk6etaXFZnxsZucJGyo/QD4o6Rf5+XtgO8Uij1vRFych874O3CApCtJiWOkO5lUvHQNqcPeV0itX7aNiJuaBG4N3hgRR3Zz8MYu6fZJ/HpJn4iIY6srJX2MnkO8NDUU/WC6wnUSNuQkTQHeTbrKvzgi/lwo7tXAO0idGi8hDbR4SES8qUT8blLP2fTGAo8Cy0aBaS6HasiVbpB0KnBJzUl804hoNJ+HpMWBs4DnmZ0UppIS9Psi4qEm8V8LnCRsSElattP6EmW1SlNE3kFqlfJtUguYQ1sjh45kXR4v68bcAbDH407LI81QncRzD+tWxfLtUXi6YaUJhlon27mBuYD/xAieFbDFScKGVGvUzrw4L6nn8l0RMUePkSPpJWaXVYv03jxNgSlGR/OdREu3T+JDTdK2wFojvKgPcJKwYSZpDWDPiBj0FKmSzunt+dFQOdhNlQRUTT7k5fERMddwHducTNK1EdFoSJqh4IprG1YRcUMuJmpiXdLAe6eSZgUs1qLptWAIB2+0GpLeX1kcQyoyGxVX6E4SNqTa5k0YQxoNs+kosG8ANiGN2vlhUq/rUyPi9oZxzUqp9ud4EbiX1It8xHNxkw0pSdXmqK0vy5kR8Wyh+POQksX3gQNHYb8AsxHFScJeE3Jy2JKUICaR5oc+vuSQGWaDpTRb5ZGkSckCuIo0PM3MYT2wfnCSsCEh6bf0UgbbpHI5jwe1KnAecFpE3DbYWGbdkEcm/iWp0yTAzqQxuTYZvqPqHycJGxKS3pkfvp9Uh/CLvLwjcG+TpoCSXmZ289HqB7px81GzEiTdFBGr97VuJHLFtQ2JiLgcQNK3I2KDylO/lXRFw9ijYlx+m6M9moeWPzUv70i5Yci7yl8uG2oT83DYAEhaHpg4jMdjNhR2J80p8RBpfpAP5nUjnoubbEhJ2ow0X8I9edUkUme684ftoMyslpOEDbncEmnlvHhnRDw3nMdj1i2SjqT3BhtNp9btOhc32ZCQ9JXK4tYRcXP+eU7Sd4ftwMy6axppYMLpwNaVx62fEc93EjYkXguDzJk1MdJH3K3jOwkbKqp53GnZ7LVoVF6RO0nYUImax52WzWyEcHGTDQkPV21zorbJhuaj5+d+VHT0dJIwM7NaLm4yM7NaThJmZlbLScLMzGo5SZiZWS0nCTMzq/X/iNMnelxiHoYAAAAASUVORK5CYII=", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "local_explanations_out = pd.read_csv(explainability_output_path + \"/explanations_shap/out.csv\")\n", - "feature_names = [str.replace(c, \"_label0\", \"\") for c in local_explanations_out.columns.to_series()]\n", - "local_explanations_out.columns = feature_names\n", - "\n", - "selected_example = 111\n", - "print(\n", - " \"Example number:\",\n", - " selected_example,\n", - " \"\\nwith model prediction:\",\n", - " sum(local_explanations_out.iloc[selected_example]) > 0,\n", - ")\n", - "print(\"\\nFeature values -- Label\", training_data.iloc[selected_example])\n", - "local_explanations_out.iloc[selected_example].plot(\n", - " kind=\"bar\", title=\"Local explanation for the example number \" + str(selected_example), rot=90\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Note:** You can run both bias and explainability jobs at the same time with `run_bias_and_explainability()`, refer [API Documentation](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.clarify.SageMakerClarifyProcessor.run_bias_and_explainability) for more details. " - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Clean Up\n", - "Finally, don't forget to clean up the resources we set up and used for this demo!" - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "INFO:sagemaker:Deleting model with name: DEMO-clarify-model-07-02-2023-05-57-08\n" - ] - } - ], - "source": [ - "sagemaker_session.delete_model(model_name)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Notebook CI Test Results\n", - "\n", - "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", - "\n", - "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability.ipynb)\n", - "\n", - "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability.ipynb)\n", - "\n", - "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability.ipynb)\n", - "\n", - "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability.ipynb)\n", - "\n", - "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability.ipynb)\n", - "\n", - "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability.ipynb)\n", - "\n", - "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability.ipynb)\n", - "\n", - "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability.ipynb)\n", - "\n", - "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability.ipynb)\n", - "\n", - "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability.ipynb)\n", - "\n", - "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability.ipynb)\n", - "\n", - "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability.ipynb)\n", - "\n", - "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability.ipynb)\n", - "\n", - "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability.ipynb)\n", - "\n", - "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/sagemaker-clarify|fairness_and_explainability|fairness_and_explainability.ipynb)\n" - ] - } - ], - "metadata": { - "instance_type": "ml.t3.medium", - "kernelspec": { - "display_name": "Python 3 (Data Science 3.0)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.6" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/sagemaker-lineage/sagemaker-lineage-multihop-queries.ipynb b/sagemaker-lineage/sagemaker-lineage-multihop-queries.ipynb deleted file mode 100644 index 9941703ee2..0000000000 --- a/sagemaker-lineage/sagemaker-lineage-multihop-queries.ipynb +++ /dev/null @@ -1,1094 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "id": "cb187715", - "metadata": {}, - "source": [ - "# Amazon SageMaker Multi-hop Lineage Queries\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "66fa3294", - "metadata": {}, - "source": [ - "---\n", - "\n", - "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", - "\n", - "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/sagemaker-lineage|sagemaker-lineage-multihop-queries.ipynb)\n", - "\n", - "---" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "367041e5", - "metadata": {}, - "source": [ - "\n", - "Amazon SageMaker Lineage tracks events that happen within SageMaker allowing the relationships between them to be traced via a graph structure. SageMaker Lineage introduces a new API called `LineageQuery` that allows customers to query the lineage graph structure to discover relationship across their Machine Learning entities. \n", - "\n", - "Your machine learning workflows can generate deeply nested relationships, the lineage APIs allow you to answer questions about these relationships. For example find all Data Sets that trained the model deployed to a given Endpoint or find all Models trained by a Data Set.\n", - "\n", - "The lineage graph is created automatically by SageMaker and you can directly create or modify your own lineage.\n", - "\n", - "In addition to the `LineageQuery` API, the SageMaker SDK provides wrapper functions that make it easy to run queries that span across multiple hops of the entity relationship graph. These APIs and helper functions are described in this notebook.\n", - "\n", - "## Runtime\n", - "\n", - "This notebook takes approximately 15 minutes to run.\n", - "\n", - "## Contents\n", - "\n", - "1. [Key Concepts](#Key-Concepts)\n", - "1. [Prerequisites](#Prerequisites)\n", - "1. [Notebook Overview](#Notebook-Overview)\n", - "1. [Create an Experiment and Trial for a training job](#Create-an-Experiment-and-Trial-for-a-training-job)\n", - "1. [Training Data](#Training-Data)\n", - "1. [Create a training job](#Create-a-training-job)\n", - "1. [Create a Model Package Group for the trained model to be registered](#Create-a-Model-Package-Group-for-the-trained-model-to-be-registered)\n", - "1. [Register the model in the Model Registry](#Register-the-model-in-the-Model-Registry)\n", - "1. [Deploy the model to a SageMaker Endpoint](#Deploy-the-model-to-a-SageMaker-Endpoint)\n", - "1. [SageMaker Lineage Queries](#SageMaker-Lineage-Queries)\n", - " 1. [Using the LineageQuery API to find entity associations](#Using-the-LineageQuery-API-to-find-entity-associations)\n", - " 1. [Find all datasets associated with an Endpoint](#Find-all-datasets-associated-with-an-Endpoint)\n", - " 1. [Find the models associated with an Endpoint](#Find-the-models-associated-with-an-Endpoint)\n", - " 1. [Find the trial components associated with an Endpoint](#Find-the-trial-components-associated-with-an-Endpoint)\n", - " 1. [Change the focal point of lineage](#Change-the-focal-point-of-lineage)\n", - " 1. [Use LineageQueryDirectionEnum.BOTH](#Use-LineageQueryDirectionEnum.BOTH)\n", - " 1. [Directions in LineageQuery: Ascendants vs. Descendants](#Directions-in-LineageQuery:-Ascendants-vs.-Descendants)\n", - " 1. [SDK helper functions](#SDK-helper-functions)\n", - " 1. [Lineage Graph Visualization](#Lineage-Graph-Visualization)\n", - "1. [Conclusion](#Conclusion)\n", - "1. [Cleanup](#Cleanup)\n", - "\n", - "\n", - "## Key Concepts\n", - "\n", - "* **Lineage Graph** - A connected graph tracing your machine learning workflow end to end. \n", - "* **Artifacts** - Represents a URI addressable object or data. Artifacts are typically inputs or outputs to Actions. \n", - "* **Actions** - Represents an action taken such as a computation, transformation, or job. \n", - "* **Contexts** - Provides a method to logically group other entities.\n", - "* **Associations** - A directed edge in the lineage graph that links two entities.\n", - "* **Lineage Traversal** - Starting from an arbitrary point trace the lineage graph to discover and analyze relationships between steps in your workflow.\n", - "* **Experiments** - Experiment entites (Experiments, Trials, and Trial Components) are also part of the lineage graph and can be associated wtih Artifacts, Actions, or Contexts." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "25d4a00f", - "metadata": {}, - "source": [ - "## Prerequisites\n", - "\n", - "[`sagemaker-experiments`](https://github.com/aws/sagemaker-experiments) and [`pyvis`]((https://pyvis.readthedocs.io/en/latest/)) are two Python libraries that need to be installed as part of this notebook execution. `pyvis` is a library designed for interactive network visualization and `sagemaker-experiments` gives users the ability to use SageMaker's Experiment Tracking capabilities. \n", - "\n", - "This notebook should be run with `Python 3.9` using the SageMaker Studio `Python3 (Data Science)` kernel. The `sagemaker` sdk version required for this notebook is `>2.70.0`.\n", - "\n", - "If running in SageMaker Classic Notebooks, use the `conda_python3` kernel. \n", - "\n", - "The AWS account running this notebook should have access to provision two instances of type `ml.m5.xlarge`. These instances are used for training and deploying a model." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "0fee7359", - "metadata": {}, - "source": [ - "Let's start by installing the Python SDK, boto and AWS CLI." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "93adbfe7", - "metadata": {}, - "outputs": [], - "source": [ - "!pip install sagemaker botocore boto3 awscli --upgrade" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "69886125", - "metadata": {}, - "outputs": [], - "source": [ - "!pip install sagemaker-experiments pyvis" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "c6cf2db5", - "metadata": {}, - "source": [ - "## Notebook Overview\n", - "\n", - "This notebook demonstrates how to use SageMaker Lineage APIs to query multi-hop relationships across the lineage graph. Multi-hop relationships are those that span beyond single entity relationships, e.g. Model -> Endpoint, Training Job -> Model. Multi-hop queries allow users to search for distant relationships across the Lineage Graph such as Endpoint -> Data Set.\n", - "\n", - "To demonstrate these capabilities, in this notebook we create a training job, register a model to the Model Registry, and deploy the model to an Endpoint. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "26efdda2", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "import boto3\n", - "import sagemaker\n", - "import pprint\n", - "from botocore.config import Config\n", - "\n", - "config = Config(retries={\"max_attempts\": 50, \"mode\": \"adaptive\"})\n", - "\n", - "sagemaker_session = sagemaker.Session()\n", - "sm_client = sagemaker_session.sagemaker_client\n", - "\n", - "region = sagemaker_session.boto_region_name\n", - "\n", - "default_bucket = sagemaker_session.default_bucket()\n", - "role = sagemaker.get_execution_role()\n", - "\n", - "# Helper function to print query outputs\n", - "pp = pprint.PrettyPrinter()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9c40701a", - "metadata": {}, - "outputs": [], - "source": [ - "from datetime import datetime\n", - "\n", - "training_instance_type = \"ml.m5.xlarge\"\n", - "inference_instance_type = \"ml.m5.xlarge\"\n", - "s3_prefix = \"multihop-example\"\n", - "\n", - "unique_id = str(datetime.now().timestamp()).split(\".\")[0]" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "6c51f513", - "metadata": {}, - "source": [ - "## Create an Experiment and Trial for a training job" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8718c000", - "metadata": {}, - "outputs": [], - "source": [ - "from smexperiments.experiment import Experiment\n", - "from smexperiments.trial import Trial\n", - "from smexperiments.trial_component import TrialComponent\n", - "\n", - "experiment_name = f\"MultihopQueryExperiment-{unique_id}\"\n", - "exp = Experiment.create(experiment_name=experiment_name, sagemaker_boto_client=sm_client)\n", - "\n", - "trial = Trial.create(\n", - " experiment_name=exp.experiment_name,\n", - " trial_name=f\"MultihopQueryTrial-{unique_id}\",\n", - " sagemaker_boto_client=sm_client,\n", - ")\n", - "\n", - "print(exp.experiment_name)\n", - "print(trial.trial_name)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "f63f088c", - "metadata": {}, - "source": [ - "## Training Data\n", - "\n", - "Creating a `data/` directory to store the preprocessed [UCI Abalone](https://archive.ics.uci.edu/ml/datasets/abalone) dataset. The preprocessing is done using the preprocessing script defined in the notebook [Orchestrating Jobs with Amazon SageMaker Model Building Pipelines](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-pipelines/tabular/abalone_build_train_deploy/sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb) notebook. Then training and validation data is uploaded to S3 so that it can be used in the training and inference job." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4d020ac3", - "metadata": {}, - "outputs": [], - "source": [ - "default_bucket" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "c54bdc99", - "metadata": {}, - "outputs": [], - "source": [ - "if not os.path.exists(\"./data/\"):\n", - " os.makedirs(\"./data/\")\n", - " print(\"Directory Created \")\n", - "else:\n", - " print(\"Directory already exists\")\n", - "\n", - "# Download the processed abalone dataset files\n", - "s3 = boto3.client(\"s3\")\n", - "s3.download_file(\n", - " f\"sagemaker-example-files-prod-{region}\",\n", - " \"datasets/tabular/uci_abalone/preprocessed/test.csv\",\n", - " \"./data/test.csv\",\n", - ")\n", - "s3.download_file(\n", - " f\"sagemaker-example-files-prod-{region}\",\n", - " \"datasets/tabular/uci_abalone/preprocessed/train.csv\",\n", - " \"./data/train.csv\",\n", - ")\n", - "s3.download_file(\n", - " f\"sagemaker-example-files-prod-{region}\",\n", - " \"datasets/tabular/uci_abalone/preprocessed/validation.csv\",\n", - " \"./data/validation.csv\",\n", - ")\n", - "\n", - "# Upload the datasets to the SageMaker session default bucket\n", - "boto3.Session().resource(\"s3\").Bucket(default_bucket).Object(\n", - " \"experiments-demo/train.csv\"\n", - ").upload_file(\"data/train.csv\")\n", - "boto3.Session().resource(\"s3\").Bucket(default_bucket).Object(\n", - " \"experiments-demo/validation.csv\"\n", - ").upload_file(\"data/validation.csv\")\n", - "\n", - "training_data = f\"s3://{default_bucket}/experiments-demo/train.csv\"\n", - "validation_data = f\"s3://{default_bucket}/experiments-demo/validation.csv\"" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "660c9e25", - "metadata": {}, - "source": [ - "## Create a training job\n", - "\n", - "We train a simple XGBoost model on the Abalone dataset. \n", - "`sagemaker.image_uris.retrieve()` is used to get the sagemaker container for XGBoost so that it can be used in the Estimator. \n", - "\n", - "In the `.fit()` function, we pass in a training and validation dataset along with an `experiment_config`. The `experiment_config` ensures that the metrics, parameters, and artifats associated with this training job are logged to the experiment and trial created above. \n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8fed64de", - "metadata": {}, - "outputs": [], - "source": [ - "from sagemaker.estimator import Estimator\n", - "\n", - "model_path = f\"s3://{default_bucket}/{s3_prefix}/xgb_model\"\n", - "training_instance_type = \"ml.m5.large\"\n", - "\n", - "image_uri = sagemaker.image_uris.retrieve(\n", - " framework=\"xgboost\",\n", - " region=region,\n", - " version=\"1.5-1\",\n", - " py_version=\"py3\",\n", - " instance_type=training_instance_type,\n", - ")\n", - "\n", - "xgb_train = Estimator(\n", - " image_uri=image_uri,\n", - " instance_type=training_instance_type,\n", - " instance_count=1,\n", - " output_path=model_path,\n", - " sagemaker_session=sagemaker_session,\n", - " role=role,\n", - ")\n", - "\n", - "xgb_train.set_hyperparameters(\n", - " objective=\"reg:squarederror\",\n", - " num_round=50,\n", - " max_depth=5,\n", - " eta=0.2,\n", - " gamma=4,\n", - " min_child_weight=6,\n", - " subsample=0.7,\n", - " verbosity=0,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5285ba3d", - "metadata": {}, - "outputs": [], - "source": [ - "from sagemaker.inputs import TrainingInput\n", - "\n", - "xgb_train.fit(\n", - " inputs={\n", - " \"train\": TrainingInput(\n", - " s3_data=training_data,\n", - " content_type=\"text/csv\",\n", - " ),\n", - " \"validation\": TrainingInput(\n", - " s3_data=validation_data,\n", - " content_type=\"text/csv\",\n", - " ),\n", - " },\n", - " experiment_config={\n", - " \"ExperimentName\": experiment_name,\n", - " \"TrialName\": trial.trial_name,\n", - " \"TrialComponentDisplayName\": \"MultiHopQueryTrialComponent\",\n", - " },\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "ce43b815", - "metadata": {}, - "source": [ - "## Create a Model Package Group for the trained model to be registered\n", - "\n", - "Create a new Model Package Group or use an existing one to register the model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "17e9f1e0", - "metadata": {}, - "outputs": [], - "source": [ - "model_package_group_name = \"lineage-test-\" + unique_id\n", - "mpg = sm_client.create_model_package_group(ModelPackageGroupName=model_package_group_name)\n", - "mpg_arn = mpg[\"ModelPackageGroupArn\"]" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "d17d04c0", - "metadata": {}, - "source": [ - "## Register the model in the Model Registry\n", - "Once the model is registered, it appears in the Model Registry tab of the SageMaker Studio UI. The model is registered with the `approval_status` set to \"Approved\". By default, the model is registered with the `approval_status` set to \"PendingManualApproval\". Users can then navigate to the Model Registry to manually approve the model based on any criteria set for model evaluation or this can be done via API. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "38ab67a1", - "metadata": {}, - "outputs": [], - "source": [ - "inference_instance_type = \"ml.m5.xlarge\"\n", - "model_package = xgb_train.register(\n", - " model_package_group_name=mpg_arn,\n", - " inference_instances=[inference_instance_type],\n", - " transform_instances=[inference_instance_type],\n", - " content_types=[\"text/csv\"],\n", - " response_types=[\"text/csv\"],\n", - " approval_status=\"Approved\",\n", - ")\n", - "\n", - "model_package_arn = model_package.model_package_arn\n", - "print(\"Model Package ARN : \", model_package_arn)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "570f9d6c", - "metadata": {}, - "source": [ - "## Deploy the model to a SageMaker Endpoint\n", - "\n", - "A SageMaker Endpoint is used to host a model that can be used for inference. The type of endpoint deployed in this notebook is a real time inference endpoint. This is ideal for inference workloads where you have real-time, interactive, low latency requirements." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8433e1e9", - "metadata": {}, - "outputs": [], - "source": [ - "endpoint_name = \"lineage-test-endpoint-\" + unique_id\n", - "model_package.deploy(\n", - " endpoint_name=endpoint_name,\n", - " initial_instance_count=1,\n", - " instance_type=inference_instance_type,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "17178ffe", - "metadata": {}, - "outputs": [], - "source": [ - "# Get the endpoint ARN\n", - "endpoint_arn = sm_client.describe_endpoint(EndpointName=endpoint_name)[\"EndpointArn\"]\n", - "print(endpoint_arn)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "1b73bd20", - "metadata": {}, - "source": [ - "## SageMaker Lineage Queries\n", - "\n", - "We explore SageMaker's lineage capabilities to traverse the relationships between the entities created in this notebook - datasets, model, endpoint, and training job. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fc2b4ef0", - "metadata": {}, - "outputs": [], - "source": [ - "from sagemaker.lineage.context import Context, EndpointContext\n", - "from sagemaker.lineage.action import Action\n", - "from sagemaker.lineage.association import Association\n", - "from sagemaker.lineage.artifact import Artifact, ModelArtifact, DatasetArtifact\n", - "\n", - "from sagemaker.lineage.query import (\n", - " LineageQuery,\n", - " LineageFilter,\n", - " LineageSourceEnum,\n", - " LineageEntityEnum,\n", - " LineageQueryDirectionEnum,\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "093e985e", - "metadata": {}, - "source": [ - "### Using the LineageQuery API to find entity associations\n", - "\n", - "In this section we use two APIs, `LineageQuery` and `LineageFilter` to construct queries to answer questions about the Lineage Graph and extract entity relationships. \n", - "\n", - "LineageQuery parameters:\n", - "* `start_arns`: A list of ARNs that is used as the starting point for the query.\n", - "* `direction`: The direction of the query.\n", - "* `include_edges`: If true, return edges in addition to vertices.\n", - "* `query_filter`: The query filter.\n", - "\n", - "LineageFilter paramters:\n", - "* `entities`: A list of entity types (Artifact, Association, Action) to filter for when returning the results on LineageQuery\n", - "* `sources`: A list of source types (Endpoint, Model, Dataset) to filter for when returning the results of LineageQuery\n", - "\n", - "A `Context` is automatically created when a SageMaker Endpoint is created, an `Artifact` is automatically created when a Model is created in SageMaker. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a30c455b", - "metadata": {}, - "outputs": [], - "source": [ - "# Find the endpoint context and model artifact that should be used for the lineage queries.\n", - "\n", - "contexts = Context.list(source_uri=endpoint_arn)\n", - "context_name = list(contexts)[0].context_name\n", - "endpoint_context = EndpointContext.load(context_name=context_name)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "9963e76e", - "metadata": {}, - "source": [ - "#### Find all datasets associated with an Endpoint" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dfde258b", - "metadata": {}, - "outputs": [], - "source": [ - "# Define the LineageFilter to look for entities of type `ARTIFACT` and the source of type `DATASET`.\n", - "\n", - "query_filter = LineageFilter(\n", - " entities=[LineageEntityEnum.ARTIFACT], sources=[LineageSourceEnum.DATASET]\n", - ")\n", - "\n", - "# Providing this `LineageFilter` to the `LineageQuery` constructs a query that traverses through the given context `endpoint_context`\n", - "# and find all datasets.\n", - "\n", - "query_result = LineageQuery(sagemaker_session).query(\n", - " start_arns=[endpoint_context.context_arn],\n", - " query_filter=query_filter,\n", - " direction=LineageQueryDirectionEnum.ASCENDANTS,\n", - " include_edges=False,\n", - ")\n", - "\n", - "# Parse through the query results to get the lineage objects corresponding to the datasets\n", - "dataset_artifacts = []\n", - "for vertex in query_result.vertices:\n", - " dataset_artifacts.append(vertex.to_lineage_object().source.source_uri)\n", - "\n", - "pp.pprint(dataset_artifacts)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "7dab1c4a", - "metadata": {}, - "source": [ - "#### Find the models associated with an Endpoint" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6294fc97", - "metadata": {}, - "outputs": [], - "source": [ - "# Define the LineageFilter to look for entities of type `ARTIFACT` and the source of type `MODEL`.\n", - "\n", - "query_filter = LineageFilter(\n", - " entities=[LineageEntityEnum.ARTIFACT], sources=[LineageSourceEnum.MODEL]\n", - ")\n", - "\n", - "# Providing this `LineageFilter` to the `LineageQuery` constructs a query that traverses through the given context `endpoint_context`\n", - "# and find all datasets.\n", - "\n", - "query_result = LineageQuery(sagemaker_session).query(\n", - " start_arns=[endpoint_context.context_arn],\n", - " query_filter=query_filter,\n", - " direction=LineageQueryDirectionEnum.ASCENDANTS,\n", - " include_edges=False,\n", - ")\n", - "\n", - "# Parse through the query results to get the lineage objects corresponding to the model\n", - "model_artifacts = []\n", - "for vertex in query_result.vertices:\n", - " model_artifacts.append(vertex.to_lineage_object().source.source_uri)\n", - "\n", - "# The results of the `LineageQuery` API call return the ARN of the model deployed to the endpoint along with\n", - "# the S3 URI to the model.tar.gz file associated with the model\n", - "pp.pprint(model_artifacts)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "4fa79344", - "metadata": {}, - "source": [ - "#### Find the trial components associated with an Endpoint" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d417bf3a", - "metadata": {}, - "outputs": [], - "source": [ - "# Define the LineageFilter to look for entities of type `TRIAL_COMPONENT` and the source of type `TRAINING_JOB`.\n", - "\n", - "query_filter = LineageFilter(\n", - " entities=[LineageEntityEnum.TRIAL_COMPONENT],\n", - " sources=[LineageSourceEnum.TRAINING_JOB],\n", - ")\n", - "\n", - "# Providing this `LineageFilter` to the `LineageQuery` constructs a query that traverses through the given context `endpoint_context`\n", - "# and find all datasets.\n", - "\n", - "query_result = LineageQuery(sagemaker_session).query(\n", - " start_arns=[endpoint_context.context_arn],\n", - " query_filter=query_filter,\n", - " direction=LineageQueryDirectionEnum.ASCENDANTS,\n", - " include_edges=False,\n", - ")\n", - "\n", - "# Parse through the query results to get the ARNs of the training jobs associated with this Endpoint\n", - "trial_components = []\n", - "for vertex in query_result.vertices:\n", - " trial_components.append(vertex.arn)\n", - "\n", - "pp.pprint(trial_components)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "9954748f", - "metadata": {}, - "source": [ - "#### Change the focal point of lineage\n", - "\n", - "The `LineageQuery` can be modified to have different `start_arns` which changes the focal point of lineage. In addition, the `LineageFilter` can take multiple sources and entities to expand the scope of the query. \n", - "\n", - "**Here we use the model as the lineage focal point and find the Endpoints and Datasets associated with it.**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0c28d8ea", - "metadata": {}, - "outputs": [], - "source": [ - "# Get the ModelArtifact\n", - "\n", - "model_artifact_summary = list(Artifact.list(source_uri=model_package_arn))[0]\n", - "model_artifact = ModelArtifact.load(artifact_arn=model_artifact_summary.artifact_arn)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ca86919e", - "metadata": {}, - "outputs": [], - "source": [ - "query_filter = LineageFilter(\n", - " entities=[LineageEntityEnum.ARTIFACT],\n", - " sources=[LineageSourceEnum.ENDPOINT, LineageSourceEnum.DATASET],\n", - ")\n", - "\n", - "query_result = LineageQuery(sagemaker_session).query(\n", - " start_arns=[model_artifact.artifact_arn], # Model is the starting artifact\n", - " query_filter=query_filter,\n", - " # Find all the entities that descend from the model, i.e. the endpoint\n", - " direction=LineageQueryDirectionEnum.DESCENDANTS,\n", - " include_edges=False,\n", - ")\n", - "\n", - "associations = []\n", - "for vertex in query_result.vertices:\n", - " associations.append(vertex.to_lineage_object().source.source_uri)\n", - "\n", - "query_result = LineageQuery(sagemaker_session).query(\n", - " start_arns=[model_artifact.artifact_arn], # Model is the starting artifact\n", - " query_filter=query_filter,\n", - " # Find all the entities that ascend from the model, i.e. the datasets\n", - " direction=LineageQueryDirectionEnum.ASCENDANTS,\n", - " include_edges=False,\n", - ")\n", - "\n", - "for vertex in query_result.vertices:\n", - " associations.append(vertex.to_lineage_object().source.source_uri)\n", - "\n", - "pp.pprint(associations)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "eaa41ff9", - "metadata": {}, - "source": [ - "#### Use LineageQueryDirectionEnum.BOTH\n", - "\n", - "When the direction is set to `BOTH`, when the query traverses the graph to find ascendant and descendant relationships, the traversal takes place not only from the starting node, but from each node that is visited. \n", - "\n", - "e.g. If the training job is run twice and both models generated by the training job are deployed to endpoints, this result of the query with direction set to `BOTH` shows both endpoints. This is because the same image is used for training and deploying the model. Since the image is common to the model (`start_arn`) and both the endpoints, it appears in the query result. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f4bee658", - "metadata": {}, - "outputs": [], - "source": [ - "query_filter = LineageFilter(\n", - " entities=[LineageEntityEnum.ARTIFACT],\n", - " sources=[LineageSourceEnum.ENDPOINT, LineageSourceEnum.DATASET],\n", - ")\n", - "\n", - "query_result = LineageQuery(sagemaker_session).query(\n", - " start_arns=[model_artifact.artifact_arn], # Model is the starting artifact\n", - " query_filter=query_filter,\n", - " # This specifies that the query should look for associations both ascending and descending for the start\n", - " direction=LineageQueryDirectionEnum.BOTH,\n", - " include_edges=False,\n", - ")\n", - "\n", - "associations = []\n", - "for vertex in query_result.vertices:\n", - " associations.append(vertex.to_lineage_object().source.source_uri)\n", - "\n", - "pp.pprint(associations)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "a69aff24", - "metadata": {}, - "source": [ - "### Directions in LineageQuery: Ascendants vs. Descendants\n", - "\n", - "To understand the direction in the Lineage Graph, take the following entity relationship graph - \n", - "Dataset -> Training Job -> Model -> Endpoint\n", - "\n", - "The endpoint is a **descendant** of the model, and the model is a **descendant** of the dataset. Similarly, the model is an **ascendant** of the endpoint The `direction` parameter can be used to specify whether the query should return entities that are descendants or ascendants of the entity in start_arns. If `start_arns` contains a model and the direction is `DESCENDANTS`, the query returns the endpoint. If the direction is `ASCENDANTS`, the query returns the dataset.\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a273b49f", - "metadata": {}, - "outputs": [], - "source": [ - "# In this example, we'll look at the impact of specifying the direction as ASCENDANT or DESCENDANT in a `LineageQuery`.\n", - "\n", - "query_filter = LineageFilter(\n", - " entities=[LineageEntityEnum.ARTIFACT],\n", - " sources=[\n", - " LineageSourceEnum.ENDPOINT,\n", - " LineageSourceEnum.MODEL,\n", - " LineageSourceEnum.DATASET,\n", - " LineageSourceEnum.TRAINING_JOB,\n", - " ],\n", - ")\n", - "\n", - "query_result = LineageQuery(sagemaker_session).query(\n", - " start_arns=[model_artifact.artifact_arn],\n", - " query_filter=query_filter,\n", - " direction=LineageQueryDirectionEnum.ASCENDANTS,\n", - " include_edges=False,\n", - ")\n", - "\n", - "ascendant_artifacts = []\n", - "\n", - "# The lineage entity returned for the Training Job is a TrialComponent which can't be converted to a\n", - "# lineage object using the method `to_lineage_object()` so we extract the TrialComponent ARN.\n", - "for vertex in query_result.vertices:\n", - " try:\n", - " ascendant_artifacts.append(vertex.to_lineage_object().source.source_uri)\n", - " except:\n", - " ascendant_artifacts.append(vertex.arn)\n", - "\n", - "print(\"Ascendant artifacts:\")\n", - "pp.pprint(ascendant_artifacts)\n", - "\n", - "query_result = LineageQuery(sagemaker_session).query(\n", - " start_arns=[model_artifact.artifact_arn],\n", - " query_filter=query_filter,\n", - " direction=LineageQueryDirectionEnum.DESCENDANTS,\n", - " include_edges=False,\n", - ")\n", - "\n", - "descendant_artifacts = []\n", - "for vertex in query_result.vertices:\n", - " try:\n", - " descendant_artifacts.append(vertex.to_lineage_object().source.source_uri)\n", - " except:\n", - " # Handling TrialComponents.\n", - " descendant_artifacts.append(vertex.arn)\n", - "\n", - "print(\"Descendant artifacts:\")\n", - "pp.pprint(descendant_artifacts)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "f7ec9d14", - "metadata": {}, - "source": [ - "### SDK helper functions\n", - "\n", - "The classes `EndpointContext`, `ModelArtifact`, and `DatasetArtifact`have helper functions that are wrappers over the `LineageQuery` API to make \n", - "certain lineage queries easier to leverage. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b5df166d", - "metadata": {}, - "outputs": [], - "source": [ - "# Find all the datasets associated with the endpoint\n", - "\n", - "datasets = []\n", - "dataset_artifacts = endpoint_context.dataset_artifacts()\n", - "for dataset in dataset_artifacts:\n", - " datasets.append(dataset.source.source_uri)\n", - "print(\"Datasets : \", datasets)\n", - "\n", - "# Find the training jobs associated with the endpoint\n", - "training_job_artifacts = endpoint_context.training_job_arns()\n", - "training_jobs = []\n", - "for training_job in training_job_artifacts:\n", - " training_jobs.append(training_job)\n", - "print(\"Training Jobs : \", training_jobs)\n", - "\n", - "# Get the ARN for the pipeline execution associated with this endpoint (if any)\n", - "pipeline_executions = endpoint_context.pipeline_execution_arn()\n", - "if pipeline_executions:\n", - " for pipeline in pipelines_executions:\n", - " print(pipeline)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dfc055f5", - "metadata": {}, - "outputs": [], - "source": [ - "# Here we use the `ModelArtifact` class to find all the datasets and endpoints associated with the model\n", - "\n", - "dataset_artifacts = model_artifact.dataset_artifacts()\n", - "endpoint_contexts = model_artifact.endpoint_contexts()\n", - "\n", - "datasets = [dataset.source.source_uri for dataset in dataset_artifacts]\n", - "endpoints = [endpoint.source.source_uri for endpoint in endpoint_contexts]\n", - "\n", - "print(\"Datasets associated with this model : \")\n", - "pp.pprint(datasets)\n", - "\n", - "print(\"Endpoints associated with this model : \")\n", - "pp.pprint(endpoints)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1fd69a51", - "metadata": {}, - "outputs": [], - "source": [ - "# Here we use the `DatasetArtifact` class to find all the endpoints hosting models that were trained with a particular dataset\n", - "# Find the artifact associated with the dataset\n", - "\n", - "dataset_artifact_arn = list(Artifact.list(source_uri=training_data))[0].artifact_arn\n", - "dataset_artifact = DatasetArtifact.load(artifact_arn=dataset_artifact_arn)\n", - "\n", - "# Find the endpoints that used this training dataset\n", - "endpoint_contexts = dataset_artifact.endpoint_contexts()\n", - "endpoints = [endpoint.source.source_uri for endpoint in endpoint_contexts]\n", - "\n", - "print(\"Endpoints associated with the training dataset {}\".format(training_data))\n", - "pp.pprint(endpoints)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "2f9fdd40", - "metadata": {}, - "source": [ - "### Lineage Graph Visualization\n", - "\n", - "A helper class `Visualizer()` is provided in `visualizer.py` to help plot the lineage graph. When the query response is rendered, a graph with the lineage relationships from the `StartArns` is displayed. From the `StartArns` the visualization shows the relationships with the other lineage entities returned in the `query_lineage` API call. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "106d8d5a", - "metadata": {}, - "outputs": [], - "source": [ - "# Graph APIs\n", - "# Here we use the boto3 `query_lineage` API to generate the query response to plot.\n", - "\n", - "from visualizer import Visualizer\n", - "\n", - "query_response = sm_client.query_lineage(\n", - " StartArns=[endpoint_context.context_arn], Direction=\"Ascendants\", IncludeEdges=True\n", - ")\n", - "\n", - "viz = Visualizer()\n", - "viz.render(query_response, \"Endpoint\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "22436292", - "metadata": {}, - "outputs": [], - "source": [ - "query_response = sm_client.query_lineage(\n", - " StartArns=[model_artifact.artifact_arn], Direction=\"Ascendants\", IncludeEdges=True\n", - ")\n", - "viz.render(query_response, \"Model\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "b393afa3", - "metadata": {}, - "source": [ - "## Conclusion\n", - "\n", - "This notebook demostrated the capabilities of SageMaker Lineage that make it easy for users to keep track of their complex ML workflows. Users can construct their own lineage queries using the `LineageQuery` API and `LineageFilter` or they can use the functions provided on the `EndpointContext`, `ModelArtifact`, and `DatasetArtifact` classes. \n", - "\n", - "In addition, the responses from lineage queries can be plotting using the helper class `Visualizer()` to better understand the relationship between the lineage entities. \n", - "\n", - "When using SageMaker Pipelines as part of their ML workflows, users can find Pipeline execution ARNs using the lineage APIs described in this notebook.\n", - "\n", - "## Cleanup\n", - "In this section we clean up the resources created in this notebook." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8f43ef02", - "metadata": {}, - "outputs": [], - "source": [ - "# Delete endpoint\n", - "\n", - "sm_client.delete_endpoint(EndpointName=endpoint_name)\n", - "\n", - "# # Delete the model package\n", - "sm_client.delete_model_package(ModelPackageName=model_package.model_package_arn)\n", - "\n", - "# Delete the model package group\n", - "sm_client.delete_model_package_group(ModelPackageGroupName=model_package_group_name)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2e19fe85", - "metadata": {}, - "outputs": [], - "source": [ - "# Delete the experiment and trial within it\n", - "\n", - "import time\n", - "\n", - "\n", - "def delete_experiment(experiment):\n", - " for trial_summary in experiment.list_trials():\n", - " trial = Trial.load(trial_name=trial_summary.trial_name)\n", - " for trial_component_summary in trial.list_trial_components():\n", - " tc = TrialComponent.load(\n", - " trial_component_name=trial_component_summary.trial_component_name\n", - " )\n", - " trial.remove_trial_component(tc)\n", - " try:\n", - " # comment out to keep trial components\n", - " tc.delete()\n", - " except:\n", - " # tc is associated with another trial\n", - " continue\n", - " # to prevent throttling\n", - " time.sleep(0.5)\n", - " trial.delete()\n", - " experiment_name = experiment.experiment_name\n", - " experiment.delete()\n", - " print(f\"\\nExperiment {experiment_name} deleted\")\n", - "\n", - "\n", - "# Delete the Experiment and Trials within it\n", - "experiment = Experiment.load(experiment_name=exp.experiment_name)\n", - "delete_experiment(experiment)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "7a9fa294", - "metadata": {}, - "source": [ - "## Notebook CI Test Results\n", - "\n", - "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", - "\n", - "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/sagemaker-lineage|sagemaker-lineage-multihop-queries.ipynb)\n", - "\n", - "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/sagemaker-lineage|sagemaker-lineage-multihop-queries.ipynb)\n", - "\n", - "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/sagemaker-lineage|sagemaker-lineage-multihop-queries.ipynb)\n", - "\n", - "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/sagemaker-lineage|sagemaker-lineage-multihop-queries.ipynb)\n", - "\n", - "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/sagemaker-lineage|sagemaker-lineage-multihop-queries.ipynb)\n", - "\n", - "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/sagemaker-lineage|sagemaker-lineage-multihop-queries.ipynb)\n", - "\n", - "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/sagemaker-lineage|sagemaker-lineage-multihop-queries.ipynb)\n", - "\n", - "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/sagemaker-lineage|sagemaker-lineage-multihop-queries.ipynb)\n", - "\n", - "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/sagemaker-lineage|sagemaker-lineage-multihop-queries.ipynb)\n", - "\n", - "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/sagemaker-lineage|sagemaker-lineage-multihop-queries.ipynb)\n", - "\n", - "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/sagemaker-lineage|sagemaker-lineage-multihop-queries.ipynb)\n", - "\n", - "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/sagemaker-lineage|sagemaker-lineage-multihop-queries.ipynb)\n", - "\n", - "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/sagemaker-lineage|sagemaker-lineage-multihop-queries.ipynb)\n", - "\n", - "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/sagemaker-lineage|sagemaker-lineage-multihop-queries.ipynb)\n", - "\n", - "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/sagemaker-lineage|sagemaker-lineage-multihop-queries.ipynb)\n" - ] - } - ], - "metadata": { - "instance_type": "ml.t3.medium", - "interpreter": { - "hash": "ac2eaa0ea0ebeafcc7822e65e46aa9d4f966f30b695406963e145ea4a91cd4fc" - }, - "kernelspec": { - "display_name": "Python 3 (TensorFlow 2.6 Python 3.8 CPU Optimized)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/tensorflow-2.6-cpu-py38-ubuntu20.04-v1" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.2" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/sagemaker-pipelines/tabular/abalone_build_train_deploy/sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb b/sagemaker-pipelines/tabular/abalone_build_train_deploy/sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb deleted file mode 100644 index ba00f7ec9a..0000000000 --- a/sagemaker-pipelines/tabular/abalone_build_train_deploy/sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb +++ /dev/null @@ -1,1697 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "# Orchestrate Jobs to Train and Evaluate Models with Amazon SageMaker Pipelines\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "\n", - "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", - "\n", - "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/sagemaker-pipelines|tabular|abalone_build_train_deploy|sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb)\n", - "\n", - "---" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "\n", - "Amazon SageMaker Pipelines offers machine learning (ML) application developers and operations engineers the ability to orchestrate SageMaker jobs and author reproducible ML pipelines. It also enables them to deploy custom-built models for inference in real-time with low latency, run offline inferences with Batch Transform, and track lineage of artifacts. They can institute sound operational practices in deploying and monitoring production workflows, deploying model artifacts, and tracking artifact lineage through a simple interface, adhering to safety and best practice paradigms for ML application development.\n", - "\n", - "The SageMaker Pipelines service supports a SageMaker Pipeline domain specific language (DSL), which is a declarative JSON specification. This DSL defines a directed acyclic graph (DAG) of pipeline parameters and SageMaker job steps. The SageMaker Python Software Developer Kit (SDK) streamlines the generation of the pipeline DSL using constructs that engineers and scientists are already familiar with.\n", - "\n", - "## Runtime\n", - "\n", - "This notebook takes approximately an hour to run.\n", - "\n", - "## Contents\n", - "\n", - "1. [SageMaker Pipelines](#SageMaker-Pipelines)\n", - "1. [Notebook Overview](#Notebook-Overview)\n", - "1. [A SageMaker Pipeline](#A-SageMaker-Pipeline)\n", - "1. [Dataset](#Dataset)\n", - "1. [Define Parameters to Parametrize Pipeline Execution](#Define-Parameters-to-Parametrize-Pipeline-Execution)\n", - "1. [Define a Processing Step for Feature Engineering](#Define-a-Processing-Step-for-Feature-Engineering)\n", - "1. [Define a Training Step to Train a Model](#Define-a-Training-Step-to-Train-a-Model)\n", - "1. [Define a Model Evaluation Step to Evaluate the Trained Model](#Define-a-Model-Evaluation-Step-to-Evaluate-the-Trained-Model)\n", - "1. [Define a Create Model Step to Create a Model](#Define-a-Create-Model-Step-to-Create-a-Model)\n", - "1. [Define a Transform Step to Perform Batch Transformation](#Define-a-Transform-Step-to-Perform-Batch-Transformation)\n", - "1. [Define a Register Model Step to Create a Model Package](#Define-a-Register-Model-Step-to-Create-a-Model-Package)\n", - "1. [Define a Fail Step to Terminate the Pipeline Execution and Mark it as Failed](#Define-a-Fail-Step-to-Terminate-the-Pipeline-Execution-and-Mark-it-as-Failed)\n", - "1. [Define a Condition Step to Check Accuracy and Conditionally Create a Model and Run a Batch Transformation and Register a Model in the Model Registry, Or Terminate the Execution in Failed State](#Define-a-Condition-Step-to-Check-Accuracy-and-Conditionally-Create-a-Model-and-Run-a-Batch-Transformation-and-Register-a-Model-in-the-Model-Registry,-Or-Terminate-the-Execution-in-Failed-State)\n", - "1. [Define a Pipeline of Parameters, Steps, and Conditions](#Define-a-Pipeline-of-Parameters,-Steps,-and-Conditions)\n", - "1. [Submit the pipeline to SageMaker and start execution](#Submit-the-pipeline-to-SageMaker-and-start-execution)\n", - "1. [Pipeline Operations: Examining and Waiting for Pipeline Execution](#Pipeline-Operations:-Examining-and-Waiting-for-Pipeline-Execution)\n", - " 1. [Examining the Evaluation](#Examining-the-Evaluation)\n", - " 1. [Lineage](#Lineage)\n", - " 1. [Parametrized Executions](#Parametrized-Executions)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## SageMaker Pipelines\n", - "\n", - "SageMaker Pipelines supports the following activities, which are demonstrated in this notebook:\n", - "\n", - "* Pipelines - A DAG of steps and conditions to orchestrate SageMaker jobs and resource creation.\n", - "* Processing job steps - A simplified, managed experience on SageMaker to run data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation.\n", - "* Training job steps - An iterative process that teaches a model to make predictions by presenting examples from a training dataset.\n", - "* Conditional execution steps - A step that provides conditional execution of branches in a pipeline.\n", - "* Register model steps - A step that creates a model package resource in the Model Registry that can be used to create deployable models in Amazon SageMaker.\n", - "* Create model steps - A step that creates a model for use in transform steps or later publication as an endpoint.\n", - "* Transform job steps - A batch transform to preprocess datasets to remove noise or bias that interferes with training or inference from a dataset, get inferences from large datasets, and run inference when a persistent endpoint is not needed.\n", - "* Fail steps - A step that stops a pipeline execution and marks the pipeline execution as failed.\n", - "* Parametrized Pipeline executions - Enables variation in pipeline executions according to specified parameters." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Notebook Overview\n", - "\n", - "This notebook shows how to:\n", - "\n", - "* Define a set of Pipeline parameters that can be used to parametrize a SageMaker Pipeline.\n", - "* Define a Processing step that performs cleaning, feature engineering, and splitting the input data into train and test data sets.\n", - "* Define a Training step that trains a model on the preprocessed train data set.\n", - "* Define a Processing step that evaluates the trained model's performance on the test dataset.\n", - "* Define a Create Model step that creates a model from the model artifacts used in training.\n", - "* Define a Transform step that performs batch transformation based on the model that was created.\n", - "* Define a Register Model step that creates a model package from the estimator and model artifacts used to train the model.\n", - "* Define a Conditional step that measures a condition based on output from prior steps and conditionally executes other steps.\n", - "* Define a Fail step with a customized error message indicating the cause of the execution failure.\n", - "* Define and create a Pipeline definition in a DAG, with the defined parameters and steps.\n", - "* Start a Pipeline execution and wait for execution to complete.\n", - "* Download the model evaluation report from the S3 bucket for examination.\n", - "* Start a second Pipeline execution." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## A SageMaker Pipeline\n", - "\n", - "The pipeline that you create follows a typical machine learning (ML) application pattern of preprocessing, training, evaluation, model creation, batch transformation, and model registration:\n", - "\n", - "![A typical ML Application pipeline](img/pipeline-full.png)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Dataset\n", - "\n", - "The dataset you use is the [UCI Machine Learning Abalone Dataset](https://archive.ics.uci.edu/ml/datasets/abalone) [1]. The aim for this task is to determine the age of an abalone snail from its physical measurements. At the core, this is a regression problem.\n", - "\n", - "The dataset contains several features: length (the longest shell measurement), diameter (the diameter perpendicular to length), height (the height with meat in the shell), whole_weight (the weight of whole abalone), shucked_weight (the weight of meat), viscera_weight (the gut weight after bleeding), shell_weight (the weight after being dried), sex ('M', 'F', 'I' where 'I' is Infant), and rings (integer).\n", - "\n", - "The number of rings turns out to be a good approximation for age (age is rings + 1.5). However, to obtain this number requires cutting the shell through the cone, staining the section, and counting the number of rings through a microscope, which is a time-consuming task. However, the other physical measurements are easier to determine. You use the dataset to build a predictive model of the variable rings through these other physical measurements.\n", - "\n", - "Before you upload the data to an S3 bucket, install the SageMaker Python SDK and gather some constants you can use later in this notebook.\n", - "\n", - "> [1] Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "!pip install -U sagemaker" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "import sys\n", - "\n", - "import boto3\n", - "import sagemaker\n", - "from sagemaker.workflow.pipeline_context import PipelineSession\n", - "\n", - "sagemaker_session = sagemaker.session.Session()\n", - "region = sagemaker_session.boto_region_name\n", - "role = sagemaker.get_execution_role()\n", - "pipeline_session = PipelineSession()\n", - "default_bucket = sagemaker_session.default_bucket()\n", - "model_package_group_name = f\"AbaloneModelPackageGroupName\"" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Now, upload the data into the default bucket. You can select our own data set for the `input_data_uri` as is appropriate." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "!mkdir -p data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "local_path = \"data/abalone-dataset.csv\"\n", - "\n", - "s3 = boto3.resource(\"s3\")\n", - "s3.Bucket(f\"sagemaker-example-files-prod-{region}\").download_file(\n", - " \"datasets/tabular/uci_abalone/abalone.csv\", local_path\n", - ")\n", - "\n", - "base_uri = f\"s3://{default_bucket}/abalone\"\n", - "input_data_uri = sagemaker.s3.S3Uploader.upload(\n", - " local_path=local_path,\n", - " desired_s3_uri=base_uri,\n", - ")\n", - "print(input_data_uri)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Download a second dataset for batch transformation after model creation. You can select our own dataset for the `batch_data_uri` as is appropriate." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "local_path = \"data/abalone-dataset-batch\"\n", - "\n", - "s3 = boto3.resource(\"s3\")\n", - "s3.Bucket(f\"sagemaker-servicecatalog-seedcode-{region}\").download_file(\n", - " \"dataset/abalone-dataset-batch\", local_path\n", - ")\n", - "\n", - "base_uri = f\"s3://{default_bucket}/abalone\"\n", - "batch_data_uri = sagemaker.s3.S3Uploader.upload(\n", - " local_path=local_path,\n", - " desired_s3_uri=base_uri,\n", - ")\n", - "print(batch_data_uri)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Define Parameters to Parametrize Pipeline Execution\n", - "\n", - "Define Pipeline parameters that you can use to parametrize the pipeline. Parameters enable custom pipeline executions and schedules without having to modify the Pipeline definition.\n", - "\n", - "The supported parameter types include:\n", - "\n", - "* `ParameterString` - represents a `str` Python type\n", - "* `ParameterInteger` - represents an `int` Python type\n", - "* `ParameterFloat` - represents a `float` Python type\n", - "\n", - "These parameters support providing a default value, which can be overridden on pipeline execution. The default value specified should be an instance of the type of the parameter.\n", - "\n", - "The parameters defined in this workflow include:\n", - "\n", - "* `processing_instance_count` - The instance count of the processing job.\n", - "* `instance_type` - The `ml.*` instance type of the training job.\n", - "* `model_approval_status` - The approval status to register with the trained model for CI/CD purposes (\"PendingManualApproval\" is the default).\n", - "* `input_data` - The S3 bucket URI location of the input data.\n", - "* `batch_data` - The S3 bucket URI location of the batch data.\n", - "* `mse_threshold` - The Mean Squared Error (MSE) threshold used to verify the accuracy of a model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from sagemaker.workflow.parameters import (\n", - " ParameterInteger,\n", - " ParameterString,\n", - " ParameterFloat,\n", - ")\n", - "\n", - "processing_instance_count = ParameterInteger(name=\"ProcessingInstanceCount\", default_value=1)\n", - "instance_type = ParameterString(name=\"TrainingInstanceType\", default_value=\"ml.m5.xlarge\")\n", - "model_approval_status = ParameterString(\n", - " name=\"ModelApprovalStatus\", default_value=\"PendingManualApproval\"\n", - ")\n", - "input_data = ParameterString(\n", - " name=\"InputData\",\n", - " default_value=input_data_uri,\n", - ")\n", - "batch_data = ParameterString(\n", - " name=\"BatchData\",\n", - " default_value=batch_data_uri,\n", - ")\n", - "mse_threshold = ParameterFloat(name=\"MseThreshold\", default_value=6.0)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "![Define Parameters](img/pipeline-1.png)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Define a Processing Step for Feature Engineering\n", - "\n", - "First, develop a preprocessing script that is specified in the Processing step.\n", - "\n", - "This notebook cell writes a file `preprocessing_abalone.py`, which contains the preprocessing script. You can update the script, and rerun this cell to overwrite. The preprocessing script uses `scikit-learn` to do the following:\n", - "\n", - "* Fill in missing sex category data and encode it so that it is suitable for training.\n", - "* Scale and normalize all numerical fields, aside from sex and rings numerical data.\n", - "* Split the data into training, validation, and test datasets.\n", - "\n", - "The Processing step executes the script on the input data. The Training step uses the preprocessed training features and labels to train a model. The Evaluation step uses the trained model and preprocessed test features and labels to evaluate the model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "!mkdir -p code" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "%%writefile code/preprocessing.py\n", - "import argparse\n", - "import os\n", - "import requests\n", - "import tempfile\n", - "\n", - "import numpy as np\n", - "import pandas as pd\n", - "\n", - "from sklearn.compose import ColumnTransformer\n", - "from sklearn.impute import SimpleImputer\n", - "from sklearn.pipeline import Pipeline\n", - "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n", - "\n", - "\n", - "# Since we get a headerless CSV file, we specify the column names here.\n", - "feature_columns_names = [\n", - " \"sex\",\n", - " \"length\",\n", - " \"diameter\",\n", - " \"height\",\n", - " \"whole_weight\",\n", - " \"shucked_weight\",\n", - " \"viscera_weight\",\n", - " \"shell_weight\",\n", - "]\n", - "label_column = \"rings\"\n", - "\n", - "feature_columns_dtype = {\n", - " \"sex\": str,\n", - " \"length\": np.float64,\n", - " \"diameter\": np.float64,\n", - " \"height\": np.float64,\n", - " \"whole_weight\": np.float64,\n", - " \"shucked_weight\": np.float64,\n", - " \"viscera_weight\": np.float64,\n", - " \"shell_weight\": np.float64,\n", - "}\n", - "label_column_dtype = {\"rings\": np.float64}\n", - "\n", - "\n", - "def merge_two_dicts(x, y):\n", - " z = x.copy()\n", - " z.update(y)\n", - " return z\n", - "\n", - "\n", - "if __name__ == \"__main__\":\n", - " base_dir = \"/opt/ml/processing\"\n", - "\n", - " df = pd.read_csv(\n", - " f\"{base_dir}/input/abalone-dataset.csv\",\n", - " header=None,\n", - " names=feature_columns_names + [label_column],\n", - " dtype=merge_two_dicts(feature_columns_dtype, label_column_dtype),\n", - " )\n", - " numeric_features = list(feature_columns_names)\n", - " numeric_features.remove(\"sex\")\n", - " numeric_transformer = Pipeline(\n", - " steps=[(\"imputer\", SimpleImputer(strategy=\"median\")), (\"scaler\", StandardScaler())]\n", - " )\n", - "\n", - " categorical_features = [\"sex\"]\n", - " categorical_transformer = Pipeline(\n", - " steps=[\n", - " (\"imputer\", SimpleImputer(strategy=\"constant\", fill_value=\"missing\")),\n", - " (\"onehot\", OneHotEncoder(handle_unknown=\"ignore\")),\n", - " ]\n", - " )\n", - "\n", - " preprocess = ColumnTransformer(\n", - " transformers=[\n", - " (\"num\", numeric_transformer, numeric_features),\n", - " (\"cat\", categorical_transformer, categorical_features),\n", - " ]\n", - " )\n", - "\n", - " y = df.pop(\"rings\")\n", - " X_pre = preprocess.fit_transform(df)\n", - " y_pre = y.to_numpy().reshape(len(y), 1)\n", - "\n", - " X = np.concatenate((y_pre, X_pre), axis=1)\n", - "\n", - " np.random.shuffle(X)\n", - " train, validation, test = np.split(X, [int(0.7 * len(X)), int(0.85 * len(X))])\n", - "\n", - " pd.DataFrame(train).to_csv(f\"{base_dir}/train/train.csv\", header=False, index=False)\n", - " pd.DataFrame(validation).to_csv(\n", - " f\"{base_dir}/validation/validation.csv\", header=False, index=False\n", - " )\n", - " pd.DataFrame(test).to_csv(f\"{base_dir}/test/test.csv\", header=False, index=False)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Next, create an instance of a `SKLearnProcessor` processor and use that in our `ProcessingStep`.\n", - "\n", - "You also specify the `framework_version` to use throughout this notebook.\n", - "\n", - "Note the `processing_instance_count` parameter used by the processor instance." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from sagemaker.sklearn.processing import SKLearnProcessor\n", - "\n", - "\n", - "framework_version = \"1.2-1\"\n", - "\n", - "sklearn_processor = SKLearnProcessor(\n", - " framework_version=framework_version,\n", - " instance_type=\"ml.m5.xlarge\",\n", - " instance_count=processing_instance_count,\n", - " base_job_name=\"sklearn-abalone-process\",\n", - " role=role,\n", - " sagemaker_session=pipeline_session,\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Finally, we take the output of the processor's `run` method and pass that as arguments to the `ProcessingStep`. By passing the `pipeline_session` to the `sagemaker_session`, calling `.run()` does not launch the processing job, it returns the arguments needed to run the job as a step in the pipeline.\n", - "\n", - "Note the `\"train_data\"` and `\"test_data\"` named channels specified in the output configuration for the processing job. Step `Properties` can be used in subsequent steps and resolve to their runtime values at execution. Specifically, this usage is called out when you define the training step." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", - "from sagemaker.workflow.steps import ProcessingStep\n", - "\n", - "processor_args = sklearn_processor.run(\n", - " inputs=[\n", - " ProcessingInput(source=input_data, destination=\"/opt/ml/processing/input\"),\n", - " ],\n", - " outputs=[\n", - " ProcessingOutput(output_name=\"train\", source=\"/opt/ml/processing/train\"),\n", - " ProcessingOutput(output_name=\"validation\", source=\"/opt/ml/processing/validation\"),\n", - " ProcessingOutput(output_name=\"test\", source=\"/opt/ml/processing/test\"),\n", - " ],\n", - " code=\"code/preprocessing.py\",\n", - ")\n", - "\n", - "step_process = ProcessingStep(name=\"AbaloneProcess\", step_args=processor_args)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "![Define a Processing Step for Feature Engineering](img/pipeline-2.png)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Define a Training Step to Train a Model\n", - "\n", - "In this section, use Amazon SageMaker's [XGBoost Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) to train on this dataset. Configure an Estimator for the XGBoost algorithm and the input dataset. A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to `model_dir` so that it can be hosted later.\n", - "\n", - "The model path where the models from training are saved is also specified.\n", - "\n", - "Note the `instance_type` parameter may be used in multiple places in the pipeline. In this case, the `instance_type` is passed into the estimator." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from sagemaker.estimator import Estimator\n", - "from sagemaker.inputs import TrainingInput\n", - "\n", - "model_path = f\"s3://{default_bucket}/AbaloneTrain\"\n", - "image_uri = sagemaker.image_uris.retrieve(\n", - " framework=\"xgboost\",\n", - " region=region,\n", - " version=\"1.0-1\",\n", - " py_version=\"py3\",\n", - " instance_type=\"ml.m5.xlarge\",\n", - ")\n", - "xgb_train = Estimator(\n", - " image_uri=image_uri,\n", - " instance_type=instance_type,\n", - " instance_count=1,\n", - " output_path=model_path,\n", - " role=role,\n", - " sagemaker_session=pipeline_session,\n", - ")\n", - "xgb_train.set_hyperparameters(\n", - " objective=\"reg:linear\",\n", - " num_round=50,\n", - " max_depth=5,\n", - " eta=0.2,\n", - " gamma=4,\n", - " min_child_weight=6,\n", - " subsample=0.7,\n", - ")\n", - "\n", - "train_args = xgb_train.fit(\n", - " inputs={\n", - " \"train\": TrainingInput(\n", - " s3_data=step_process.properties.ProcessingOutputConfig.Outputs[\"train\"].S3Output.S3Uri,\n", - " content_type=\"text/csv\",\n", - " ),\n", - " \"validation\": TrainingInput(\n", - " s3_data=step_process.properties.ProcessingOutputConfig.Outputs[\n", - " \"validation\"\n", - " ].S3Output.S3Uri,\n", - " content_type=\"text/csv\",\n", - " ),\n", - " }\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Finally, we use the output of the estimator's `.fit()` method as arguments to the `TrainingStep`. By passing the `pipeline_session` to the `sagemaker_session`, calling `.fit()` does not launch the training job, it returns the arguments needed to run the job as a step in the pipeline.\n", - "\n", - "Pass in the `S3Uri` of the `\"train_data\"` output channel to the `.fit()` method. Also, use the other `\"test_data\"` output channel for model evaluation in the pipeline. The `properties` attribute of a Pipeline step matches the object model of the corresponding response of a describe call. These properties can be referenced as placeholder values and are resolved at runtime. For example, the `ProcessingStep` `properties` attribute matches the object model of the [DescribeProcessingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeProcessingJob.html) response object." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from sagemaker.inputs import TrainingInput\n", - "from sagemaker.workflow.steps import TrainingStep\n", - "\n", - "\n", - "step_train = TrainingStep(\n", - " name=\"AbaloneTrain\",\n", - " step_args=train_args,\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "![Define a Training Step to Train a Model](img/pipeline-3.png)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Define a Model Evaluation Step to Evaluate the Trained Model\n", - "\n", - "First, develop an evaluation script that is specified in a Processing step that performs the model evaluation.\n", - "\n", - "After pipeline execution, you can examine the resulting `evaluation.json` for analysis.\n", - "\n", - "The evaluation script uses `xgboost` to do the following:\n", - "\n", - "* Load the model.\n", - "* Read the test data.\n", - "* Issue predictions against the test data.\n", - "* Build a classification report, including accuracy and ROC curve.\n", - "* Save the evaluation report to the evaluation directory." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "%%writefile code/evaluation.py\n", - "import json\n", - "import pathlib\n", - "import pickle\n", - "import tarfile\n", - "\n", - "import joblib\n", - "import numpy as np\n", - "import pandas as pd\n", - "import xgboost\n", - "\n", - "from sklearn.metrics import mean_squared_error\n", - "\n", - "\n", - "if __name__ == \"__main__\":\n", - " model_path = f\"/opt/ml/processing/model/model.tar.gz\"\n", - " with tarfile.open(model_path) as tar:\n", - " tar.extractall(path=\".\")\n", - "\n", - " model = pickle.load(open(\"xgboost-model\", \"rb\"))\n", - "\n", - " test_path = \"/opt/ml/processing/test/test.csv\"\n", - " df = pd.read_csv(test_path, header=None)\n", - "\n", - " y_test = df.iloc[:, 0].to_numpy()\n", - " df.drop(df.columns[0], axis=1, inplace=True)\n", - "\n", - " X_test = xgboost.DMatrix(df.values)\n", - "\n", - " predictions = model.predict(X_test)\n", - "\n", - " mse = mean_squared_error(y_test, predictions)\n", - " std = np.std(y_test - predictions)\n", - " report_dict = {\n", - " \"regression_metrics\": {\n", - " \"mse\": {\"value\": mse, \"standard_deviation\": std},\n", - " },\n", - " }\n", - "\n", - " output_dir = \"/opt/ml/processing/evaluation\"\n", - " pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)\n", - "\n", - " evaluation_path = f\"{output_dir}/evaluation.json\"\n", - " with open(evaluation_path, \"w\") as f:\n", - " f.write(json.dumps(report_dict))" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Next, create an instance of a `ScriptProcessor` processor and use it in the `ProcessingStep`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from sagemaker.processing import ScriptProcessor\n", - "\n", - "\n", - "script_eval = ScriptProcessor(\n", - " image_uri=image_uri,\n", - " command=[\"python3\"],\n", - " instance_type=\"ml.m5.xlarge\",\n", - " instance_count=1,\n", - " base_job_name=\"script-abalone-eval\",\n", - " role=role,\n", - " sagemaker_session=pipeline_session,\n", - ")\n", - "\n", - "eval_args = script_eval.run(\n", - " inputs=[\n", - " ProcessingInput(\n", - " source=step_train.properties.ModelArtifacts.S3ModelArtifacts,\n", - " destination=\"/opt/ml/processing/model\",\n", - " ),\n", - " ProcessingInput(\n", - " source=step_process.properties.ProcessingOutputConfig.Outputs[\"test\"].S3Output.S3Uri,\n", - " destination=\"/opt/ml/processing/test\",\n", - " ),\n", - " ],\n", - " outputs=[\n", - " ProcessingOutput(output_name=\"evaluation\", source=\"/opt/ml/processing/evaluation\"),\n", - " ],\n", - " code=\"code/evaluation.py\",\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Use the processor's arguments returned by `.run()` to construct a `ProcessingStep`, along with the input and output channels and the code that will be executed when the pipeline invokes pipeline execution.\n", - "\n", - "Specifically, the `S3ModelArtifacts` from the `step_train` `properties` and the `S3Uri` of the `\"test_data\"` output channel of the `step_process` `properties` are passed as inputs. The `TrainingStep` and `ProcessingStep` `properties` attribute matches the object model of the [DescribeTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html) and [DescribeProcessingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeProcessingJob.html) response objects, respectively." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from sagemaker.workflow.properties import PropertyFile\n", - "\n", - "\n", - "evaluation_report = PropertyFile(\n", - " name=\"EvaluationReport\", output_name=\"evaluation\", path=\"evaluation.json\"\n", - ")\n", - "step_eval = ProcessingStep(\n", - " name=\"AbaloneEval\",\n", - " step_args=eval_args,\n", - " property_files=[evaluation_report],\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "![Define a Model Evaluation Step to Evaluate the Trained Model](img/pipeline-4.png)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Define a Create Model Step to Create a Model\n", - "\n", - "In order to perform batch transformation using the example model, create a SageMaker model.\n", - "\n", - "Specifically, pass in the `S3ModelArtifacts` from the `TrainingStep`, `step_train` properties. The `TrainingStep` `properties` attribute matches the object model of the [DescribeTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html) response object." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from sagemaker.model import Model\n", - "\n", - "model = Model(\n", - " image_uri=image_uri,\n", - " model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,\n", - " sagemaker_session=pipeline_session,\n", - " role=role,\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Define the `ModelStep` by providing the return values from `model.create()` as the step arguments." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from sagemaker.inputs import CreateModelInput\n", - "from sagemaker.workflow.model_step import ModelStep\n", - "\n", - "step_create_model = ModelStep(\n", - " name=\"AbaloneCreateModel\",\n", - " step_args=model.create(instance_type=\"ml.m5.large\", accelerator_type=\"ml.eia1.medium\"),\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Define a Transform Step to Perform Batch Transformation\n", - "\n", - "Now that a model instance is defined, create a `Transformer` instance with the appropriate model type, compute instance type, and desired output S3 URI.\n", - "\n", - "Specifically, pass in the `ModelName` from the `CreateModelStep`, `step_create_model` properties. The `CreateModelStep` `properties` attribute matches the object model of the [DescribeModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeModel.html) response object." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from sagemaker.transformer import Transformer\n", - "\n", - "\n", - "transformer = Transformer(\n", - " model_name=step_create_model.properties.ModelName,\n", - " instance_type=\"ml.m5.xlarge\",\n", - " instance_count=1,\n", - " output_path=f\"s3://{default_bucket}/AbaloneTransform\",\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Pass in the transformer instance and the `TransformInput` with the `batch_data` pipeline parameter defined earlier." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from sagemaker.inputs import TransformInput\n", - "from sagemaker.workflow.steps import TransformStep\n", - "\n", - "\n", - "step_transform = TransformStep(\n", - " name=\"AbaloneTransform\", transformer=transformer, inputs=TransformInput(data=batch_data)\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Define a Register Model Step to Create a Model Package\n", - "\n", - "A model package is an abstraction of reusable model artifacts that packages all ingredients required for inference. Primarily, it consists of an inference specification that defines the inference image to use along with an optional model weights location.\n", - "\n", - "A model package group is a collection of model packages. A model package group can be created for a specific ML business problem, and new versions of the model packages can be added to it. Typically, customers are expected to create a ModelPackageGroup for a SageMaker pipeline so that model package versions can be added to the group for every SageMaker Pipeline run.\n", - "\n", - "To register a model in the Model Registry, we take the model created in the previous steps\n", - "```\n", - "model = Model(\n", - " image_uri=image_uri,\n", - " model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,\n", - " sagemaker_session=pipeline_session,\n", - " role=role,\n", - ")\n", - "```\n", - "and call the `.register()` function on it while passing all the parameters needed for registering the model.\n", - "\n", - "We take the outputs of the `.register()` call and pass that to the `ModelStep` as step arguments." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from sagemaker.model_metrics import MetricsSource, ModelMetrics\n", - "\n", - "model_metrics = ModelMetrics(\n", - " model_statistics=MetricsSource(\n", - " s3_uri=\"{}/evaluation.json\".format(\n", - " step_eval.arguments[\"ProcessingOutputConfig\"][\"Outputs\"][0][\"S3Output\"][\"S3Uri\"]\n", - " ),\n", - " content_type=\"application/json\",\n", - " )\n", - ")\n", - "\n", - "register_args = model.register(\n", - " content_types=[\"text/csv\"],\n", - " response_types=[\"text/csv\"],\n", - " inference_instances=[\"ml.t2.medium\", \"ml.m5.xlarge\"],\n", - " transform_instances=[\"ml.m5.xlarge\"],\n", - " model_package_group_name=model_package_group_name,\n", - " approval_status=model_approval_status,\n", - " model_metrics=model_metrics,\n", - ")\n", - "step_register = ModelStep(name=\"AbaloneRegisterModel\", step_args=register_args)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "![Define a Create Model Step and Batch Transform to Process Data in Batch at Scale](img/pipeline-5.png)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Define a Fail Step to Terminate the Pipeline Execution and Mark it as Failed\n", - "\n", - "This section walks you through the following steps:\n", - "\n", - "* Define a `FailStep` with customized error message, which indicates the cause of the execution failure.\n", - "* Enter the `FailStep` error message with a `Join` function, which appends a static text string with the dynamic `mse_threshold` parameter to build a more informative error message." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from sagemaker.workflow.fail_step import FailStep\n", - "from sagemaker.workflow.functions import Join\n", - "\n", - "step_fail = FailStep(\n", - " name=\"AbaloneMSEFail\",\n", - " error_message=Join(on=\" \", values=[\"Execution failed due to MSE >\", mse_threshold]),\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "![Define a Fail Step to Terminate the Execution in Failed State](img/pipeline-8.png)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Define a Condition Step to Check Accuracy and Conditionally Create a Model and Run a Batch Transformation and Register a Model in the Model Registry, Or Terminate the Execution in Failed State\n", - "\n", - "In this step, the model is registered only if the accuracy of the model, as determined by the evaluation step `step_eval`, exceeded a specified value. Otherwise, the pipeline execution fails and terminates. A `ConditionStep` enables pipelines to support conditional execution in the pipeline DAG based on the conditions of the step properties.\n", - "\n", - "In the following section, you:\n", - "\n", - "* Define a `ConditionLessThanOrEqualTo` on the accuracy value found in the output of the evaluation step, `step_eval`.\n", - "* Use the condition in the list of conditions in a `ConditionStep`.\n", - "* Pass the `CreateModelStep` and `TransformStep` steps, and the `RegisterModel` step collection into the `if_steps` of the `ConditionStep`, which are only executed if the condition evaluates to `True`.\n", - "* Pass the `FailStep` step into the `else_steps`of the `ConditionStep`, which is only executed if the condition evaluates to `False`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from sagemaker.workflow.conditions import ConditionLessThanOrEqualTo\n", - "from sagemaker.workflow.condition_step import ConditionStep\n", - "from sagemaker.workflow.functions import JsonGet\n", - "\n", - "\n", - "cond_lte = ConditionLessThanOrEqualTo(\n", - " left=JsonGet(\n", - " step_name=step_eval.name,\n", - " property_file=evaluation_report,\n", - " json_path=\"regression_metrics.mse.value\",\n", - " ),\n", - " right=mse_threshold,\n", - ")\n", - "\n", - "step_cond = ConditionStep(\n", - " name=\"AbaloneMSECond\",\n", - " conditions=[cond_lte],\n", - " if_steps=[step_register, step_create_model, step_transform],\n", - " else_steps=[step_fail],\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "![Define a Condition Step to Check Accuracy and Conditionally Execute Steps](img/pipeline-6.png)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Define a Pipeline of Parameters, Steps, and Conditions\n", - "\n", - "In this section, combine the steps into a Pipeline so it can be executed.\n", - "\n", - "A pipeline requires a `name`, `parameters`, and `steps`. Names must be unique within an `(account, region)` pair.\n", - "\n", - "Note:\n", - "\n", - "* All the parameters used in the definitions must be present.\n", - "* Steps passed into the pipeline do not have to be listed in the order of execution. The SageMaker Pipeline service resolves the data dependency DAG as steps for the execution to complete.\n", - "* Steps must be unique to across the pipeline step list and all condition step if/else lists." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from sagemaker.workflow.pipeline import Pipeline\n", - "\n", - "\n", - "pipeline_name = f\"AbalonePipeline\"\n", - "pipeline = Pipeline(\n", - " name=pipeline_name,\n", - " parameters=[\n", - " processing_instance_count,\n", - " instance_type,\n", - " model_approval_status,\n", - " input_data,\n", - " batch_data,\n", - " mse_threshold,\n", - " ],\n", - " steps=[step_process, step_train, step_eval, step_cond],\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "![Define a Pipeline of Parameters, Steps, and Conditions](img/pipeline-7.png)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "### (Optional) Examining the pipeline definition\n", - "\n", - "The JSON of the pipeline definition can be examined to confirm the pipeline is well-defined and the parameters and step properties resolve correctly." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "import json\n", - "\n", - "\n", - "definition = json.loads(pipeline.definition())\n", - "definition" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Submit the pipeline to SageMaker and start execution\n", - "\n", - "Submit the pipeline definition to the Pipeline service. The Pipeline service uses the role that is passed in to create all the jobs defined in the steps." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "pipeline.upsert(role_arn=role)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Start the pipeline and accept all the default parameters." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "execution = pipeline.start()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Pipeline Operations: Examining and Waiting for Pipeline Execution\n", - "\n", - "Describe the pipeline execution." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "execution.describe()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Wait for the execution to complete." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "execution.wait()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "List the steps in the execution. These are the steps in the pipeline that have been resolved by the step executor service." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "execution.list_steps()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "### Examining the Evaluation\n", - "\n", - "Examine the resulting model evaluation after the pipeline completes. Download the resulting `evaluation.json` file from S3 and print the report." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from pprint import pprint\n", - "\n", - "\n", - "evaluation_json = sagemaker.s3.S3Downloader.read_file(\n", - " \"{}/evaluation.json\".format(\n", - " step_eval.arguments[\"ProcessingOutputConfig\"][\"Outputs\"][0][\"S3Output\"][\"S3Uri\"]\n", - " )\n", - ")\n", - "pprint(json.loads(evaluation_json))" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "### Lineage\n", - "\n", - "Review the lineage of the artifacts generated by the pipeline." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "import time\n", - "from sagemaker.lineage.visualizer import LineageTableVisualizer\n", - "\n", - "\n", - "viz = LineageTableVisualizer(sagemaker.session.Session())\n", - "for execution_step in reversed(execution.list_steps()):\n", - " print(execution_step)\n", - " display(viz.show(pipeline_execution_step=execution_step))\n", - " time.sleep(5)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "### Parametrized Executions\n", - "\n", - "You can run additional executions of the pipeline and specify different pipeline parameters. The `parameters` argument is a dictionary containing parameter names, and where the values are used to override the defaults values.\n", - "\n", - "Based on the performance of the model, you might want to kick off another pipeline execution on a compute-optimized instance type and set the model approval status to \"Approved\" automatically. This means that the model package version generated by the `RegisterModel` step is automatically ready for deployment through CI/CD pipelines, such as with SageMaker Projects." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "execution = pipeline.start(\n", - " parameters=dict(\n", - " ModelApprovalStatus=\"Approved\",\n", - " )\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "execution.wait()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "execution.list_steps()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Apart from that, you might also want to adjust the MSE threshold to a smaller value and raise the bar for the accuracy of the registered model. In this case you can override the MSE threshold like the following:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "execution = pipeline.start(parameters=dict(MseThreshold=3.0))" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "If the MSE threshold is not satisfied, the pipeline execution enters the `FailStep` and is marked as failed." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "try:\n", - " execution.wait()\n", - "except Exception as error:\n", - " print(error)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "execution.list_steps()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Notebook CI Test Results\n", - "\n", - "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", - "\n", - "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/sagemaker-pipelines|tabular|abalone_build_train_deploy|sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb)\n", - "\n", - "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/sagemaker-pipelines|tabular|abalone_build_train_deploy|sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb)\n", - "\n", - "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/sagemaker-pipelines|tabular|abalone_build_train_deploy|sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb)\n", - "\n", - "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/sagemaker-pipelines|tabular|abalone_build_train_deploy|sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb)\n", - "\n", - "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/sagemaker-pipelines|tabular|abalone_build_train_deploy|sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb)\n", - "\n", - "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/sagemaker-pipelines|tabular|abalone_build_train_deploy|sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb)\n", - "\n", - "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/sagemaker-pipelines|tabular|abalone_build_train_deploy|sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb)\n", - "\n", - "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/sagemaker-pipelines|tabular|abalone_build_train_deploy|sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb)\n", - "\n", - "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/sagemaker-pipelines|tabular|abalone_build_train_deploy|sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb)\n", - "\n", - "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/sagemaker-pipelines|tabular|abalone_build_train_deploy|sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb)\n", - "\n", - "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/sagemaker-pipelines|tabular|abalone_build_train_deploy|sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb)\n", - "\n", - "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/sagemaker-pipelines|tabular|abalone_build_train_deploy|sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb)\n", - "\n", - "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/sagemaker-pipelines|tabular|abalone_build_train_deploy|sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb)\n", - "\n", - "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/sagemaker-pipelines|tabular|abalone_build_train_deploy|sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb)\n", - "\n", - "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/sagemaker-pipelines|tabular|abalone_build_train_deploy|sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb)\n" - ] - } - ], - "metadata": { - "instance_type": "ml.t3.medium", - "kernelspec": { - "display_name": "Python 3 (Data Science 3.0)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.6" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/sagemaker-pipelines/tabular/lambda-step/sagemaker-pipelines-lambda-step.ipynb b/sagemaker-pipelines/tabular/lambda-step/sagemaker-pipelines-lambda-step.ipynb deleted file mode 100644 index e1861407da..0000000000 --- a/sagemaker-pipelines/tabular/lambda-step/sagemaker-pipelines-lambda-step.ipynb +++ /dev/null @@ -1,1709 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# SageMaker Pipelines Lambda Step\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "\n", - "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", - "\n", - "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/sagemaker-pipelines|tabular|lambda-step|sagemaker-pipelines-lambda-step.ipynb)\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "This notebook illustrates how a Lambda function can be run as a step in a SageMaker Pipeline.\n", - "\n", - "The steps in this pipeline include:\n", - "* Preprocess the Abalone dataset\n", - "* Train an XGBoost Model\n", - "* Evaluate the model performance\n", - "* Create a model\n", - "* Deploy the model to a SageMaker Hosted Endpoint using a Lambda Function, through SageMaker Pipelines\n", - "\n", - "A step to register the model into a Model Registry can be added to the pipeline using the `RegisterModel` step." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Runtime\n", - "\n", - "This notebook takes approximately 15 minutes to run.\n", - "\n", - "## Contents\n", - "\n", - "1. [Prerequisites](#Prerequisites)\n", - "1. [Configuration Setup](#Configuration-Setup)\n", - "1. [Data Preparation](#Data-Preparation)\n", - "1. [Model Training and Evaluation](#Model-Training-and-Evaluation)\n", - "1. [Setting up Lambda](#Setting-up-Lambda)\n", - "1. [Execute the Pipeline](#Execute-the-Pipeline)\n", - "1. [Clean up resources](#Clean-up-resources)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Prerequisites\n", - "\n", - "The notebook execution role should have policies which enable the notebook to create a Lambda function. The Amazon managed policy `AmazonSageMakerPipelinesIntegrations` can be added to the notebook execution role to achieve the same effect.\n", - "\n", - "The policy description is as follows:\n", - "\n", - "```\n", - "\n", - "{\n", - " \"Version\": \"2012-10-17\",\n", - " \"Statement\": [\n", - " {\n", - " \"Effect\": \"Allow\",\n", - " \"Action\": [\n", - " \"lambda:CreateFunction\",\n", - " \"lambda:DeleteFunction\",\n", - " \"lambda:InvokeFunction\",\n", - " \"lambda:UpdateFunctionCode\"\n", - " ],\n", - " \"Resource\": [\n", - " \"arn:aws:lambda:*:*:function:*sagemaker*\",\n", - " \"arn:aws:lambda:*:*:function:*sageMaker*\",\n", - " \"arn:aws:lambda:*:*:function:*SageMaker*\"\n", - " ]\n", - " },\n", - " {\n", - " \"Effect\": \"Allow\",\n", - " \"Action\": [\n", - " \"sqs:CreateQueue\",\n", - " \"sqs:SendMessage\"\n", - " ],\n", - " \"Resource\": [\n", - " \"arn:aws:sqs:*:*:*sagemaker*\",\n", - " \"arn:aws:sqs:*:*:*sageMaker*\",\n", - " \"arn:aws:sqs:*:*:*SageMaker*\"\n", - " ]\n", - " },\n", - " {\n", - " \"Effect\": \"Allow\",\n", - " \"Action\": [\n", - " \"iam:PassRole\"\n", - " ],\n", - " \"Resource\": \"arn:aws:iam::*:role/*\",\n", - " \"Condition\": {\n", - " \"StringEquals\": {\n", - " \"iam:PassedToService\": [\n", - " \"lambda.amazonaws.com\"\n", - " ]\n", - " }\n", - " }\n", - " }\n", - " ]\n", - "}\n", - "\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Let's start by importing necessary packages and installing the SageMaker Python SDK." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "import os\n", - "import time\n", - "import boto3\n", - "import sagemaker\n", - "\n", - "from sagemaker.estimator import Estimator\n", - "from sagemaker.inputs import TrainingInput\n", - "\n", - "from sagemaker.processing import (\n", - " ProcessingInput,\n", - " ProcessingOutput,\n", - " Processor,\n", - " ScriptProcessor,\n", - ")\n", - "\n", - "from sagemaker import Model\n", - "from sagemaker.xgboost import XGBoostPredictor\n", - "from sagemaker.sklearn.processing import SKLearnProcessor\n", - "\n", - "from sagemaker.workflow.parameters import (\n", - " ParameterInteger,\n", - " ParameterString,\n", - ")\n", - "from sagemaker.workflow.pipeline import Pipeline\n", - "from sagemaker.workflow.properties import PropertyFile\n", - "from sagemaker.workflow.steps import ProcessingStep, TrainingStep, CacheConfig\n", - "from sagemaker.workflow.lambda_step import (\n", - " LambdaStep,\n", - " LambdaOutput,\n", - " LambdaOutputTypeEnum,\n", - ")\n", - "from sagemaker.workflow.model_step import ModelStep\n", - "from sagemaker.workflow.conditions import ConditionLessThanOrEqualTo\n", - "from sagemaker.workflow.condition_step import ConditionStep\n", - "from sagemaker.workflow.functions import JsonGet\n", - "from sagemaker.workflow.pipeline_context import PipelineSession\n", - "\n", - "from sagemaker.lambda_helper import Lambda\n", - "import sys" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "!{sys.executable} -m pip install \"sagemaker>=2.99.0\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Configuration Setup" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's now configure the setup we need, which includes the session object from the SageMaker Python SDK, and neccessary configurations for the pipelines, such as object types, input and output buckets and so on." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Create the SageMaker Session\n", - "\n", - "sagemaker_session = sagemaker.Session()\n", - "pipeline_session = PipelineSession()\n", - "sm_client = sagemaker_session.sagemaker_client\n", - "region = sagemaker_session.boto_region_name\n", - "prefix = \"lambda-step-pipeline\"\n", - "\n", - "account_id = sagemaker_session.account_id()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Define variables and parameters needed for the Pipeline steps\n", - "\n", - "role = sagemaker.get_execution_role()\n", - "default_bucket = sagemaker_session.default_bucket()\n", - "base_job_prefix = \"lambda-step-example\"\n", - "s3_prefix = \"lambda-step-pipeline\"\n", - "\n", - "processing_instance_count = ParameterInteger(name=\"ProcessingInstanceCount\", default_value=1)\n", - "training_instance_type = ParameterString(name=\"TrainingInstanceType\", default_value=\"ml.m5.xlarge\")\n", - "model_approval_status = ParameterString(\n", - " name=\"ModelApprovalStatus\", default_value=\"PendingManualApproval\"\n", - ")\n", - "input_data = ParameterString(\n", - " name=\"InputDataUrl\",\n", - " default_value=f\"s3://sagemaker-example-files-prod-{boto3.Session().region_name}/datasets/tabular/uci_abalone/abalone.csv\",\n", - ")\n", - "model_approval_status = ParameterString(\n", - " name=\"ModelApprovalStatus\", default_value=\"PendingManualApproval\"\n", - ")\n", - "\n", - "# Cache Pipeline steps to reduce execution time on subsequent executions\n", - "cache_config = CacheConfig(enable_caching=True, expire_after=\"30d\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Data Preparation\n", - "\n", - "An SKLearn processor is used to prepare the dataset for the Hyperparameter Tuning job. Using the script `preprocess.py`, the dataset is featurized and split into train, test, and validation datasets.\n", - "\n", - "The output of this step is used as the input to the TrainingStep." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "!mkdir -p code" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "%%writefile code/preprocess.py\n", - "\n", - "\"\"\"Feature engineers the abalone dataset.\"\"\"\n", - "import argparse\n", - "import logging\n", - "import os\n", - "import pathlib\n", - "import requests\n", - "import tempfile\n", - "\n", - "import boto3\n", - "import numpy as np\n", - "import pandas as pd\n", - "\n", - "from sklearn.compose import ColumnTransformer\n", - "from sklearn.impute import SimpleImputer\n", - "from sklearn.pipeline import Pipeline\n", - "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n", - "\n", - "logger = logging.getLogger()\n", - "logger.setLevel(logging.INFO)\n", - "logger.addHandler(logging.StreamHandler())\n", - "\n", - "\n", - "# Since we get a headerless CSV file we specify the column names here.\n", - "feature_columns_names = [\n", - " \"sex\",\n", - " \"length\",\n", - " \"diameter\",\n", - " \"height\",\n", - " \"whole_weight\",\n", - " \"shucked_weight\",\n", - " \"viscera_weight\",\n", - " \"shell_weight\",\n", - "]\n", - "label_column = \"rings\"\n", - "\n", - "feature_columns_dtype = {\n", - " \"sex\": str,\n", - " \"length\": np.float64,\n", - " \"diameter\": np.float64,\n", - " \"height\": np.float64,\n", - " \"whole_weight\": np.float64,\n", - " \"shucked_weight\": np.float64,\n", - " \"viscera_weight\": np.float64,\n", - " \"shell_weight\": np.float64,\n", - "}\n", - "label_column_dtype = {\"rings\": np.float64}\n", - "\n", - "\n", - "def merge_two_dicts(x, y):\n", - " \"\"\"Merges two dicts, returning a new copy.\"\"\"\n", - " z = x.copy()\n", - " z.update(y)\n", - " return z\n", - "\n", - "\n", - "if __name__ == \"__main__\":\n", - " logger.debug(\"Starting preprocessing.\")\n", - " parser = argparse.ArgumentParser()\n", - " parser.add_argument(\"--input-data\", type=str, required=True)\n", - " args = parser.parse_args()\n", - "\n", - " base_dir = \"/opt/ml/processing\"\n", - " pathlib.Path(f\"{base_dir}/data\").mkdir(parents=True, exist_ok=True)\n", - " input_data = args.input_data\n", - " bucket = input_data.split(\"/\")[2]\n", - " key = \"/\".join(input_data.split(\"/\")[3:])\n", - "\n", - " logger.info(\"Downloading data from bucket: %s, key: %s\", bucket, key)\n", - " fn = f\"{base_dir}/data/abalone-dataset.csv\"\n", - " s3 = boto3.resource(\"s3\")\n", - " s3.Bucket(bucket).download_file(key, fn)\n", - "\n", - " logger.debug(\"Reading downloaded data.\")\n", - " df = pd.read_csv(\n", - " fn,\n", - " header=None,\n", - " names=feature_columns_names + [label_column],\n", - " dtype=merge_two_dicts(feature_columns_dtype, label_column_dtype),\n", - " )\n", - " os.unlink(fn)\n", - "\n", - " logger.debug(\"Defining transformers.\")\n", - " numeric_features = list(feature_columns_names)\n", - " numeric_features.remove(\"sex\")\n", - " numeric_transformer = Pipeline(\n", - " steps=[\n", - " (\"imputer\", SimpleImputer(strategy=\"median\")),\n", - " (\"scaler\", StandardScaler()),\n", - " ]\n", - " )\n", - "\n", - " categorical_features = [\"sex\"]\n", - " categorical_transformer = Pipeline(\n", - " steps=[\n", - " (\"imputer\", SimpleImputer(strategy=\"constant\", fill_value=\"missing\")),\n", - " (\"onehot\", OneHotEncoder(handle_unknown=\"ignore\")),\n", - " ]\n", - " )\n", - "\n", - " preprocess = ColumnTransformer(\n", - " transformers=[\n", - " (\"num\", numeric_transformer, numeric_features),\n", - " (\"cat\", categorical_transformer, categorical_features),\n", - " ]\n", - " )\n", - "\n", - " logger.info(\"Applying transforms.\")\n", - " y = df.pop(\"rings\")\n", - " X_pre = preprocess.fit_transform(df)\n", - " y_pre = y.to_numpy().reshape(len(y), 1)\n", - "\n", - " X = np.concatenate((y_pre, X_pre), axis=1)\n", - "\n", - " logger.info(\"Splitting %d rows of data into train, validation, test datasets.\", len(X))\n", - " np.random.shuffle(X)\n", - " train, validation, test = np.split(X, [int(0.7 * len(X)), int(0.85 * len(X))])\n", - "\n", - " logger.info(\"Writing out datasets to %s.\", base_dir)\n", - " pd.DataFrame(train).to_csv(f\"{base_dir}/train/train.csv\", header=False, index=False)\n", - " pd.DataFrame(validation).to_csv(\n", - " f\"{base_dir}/validation/validation.csv\", header=False, index=False\n", - " )\n", - " pd.DataFrame(test).to_csv(f\"{base_dir}/test/test.csv\", header=False, index=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Process the training data step using a python script.\n", - "# Split the training data set into train, test, and validation datasets\n", - "\n", - "sklearn_processor = SKLearnProcessor(\n", - " framework_version=\"0.23-1\",\n", - " instance_type=\"ml.m5.xlarge\",\n", - " instance_count=processing_instance_count,\n", - " base_job_name=f\"{base_job_prefix}/sklearn-abalone-preprocess\",\n", - " sagemaker_session=pipeline_session,\n", - " role=role,\n", - ")\n", - "\n", - "processor_args = sklearn_processor.run(\n", - " outputs=[\n", - " ProcessingOutput(output_name=\"train\", source=\"/opt/ml/processing/train\"),\n", - " ProcessingOutput(output_name=\"validation\", source=\"/opt/ml/processing/validation\"),\n", - " ProcessingOutput(output_name=\"test\", source=\"/opt/ml/processing/test\"),\n", - " ],\n", - " code=\"code/preprocess.py\",\n", - " arguments=[\"--input-data\", input_data],\n", - ")\n", - "\n", - "step_process = ProcessingStep(\n", - " name=\"PreprocessAbaloneData\",\n", - " step_args=processor_args,\n", - " cache_config=cache_config,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Model Training and Evaluation\n", - "\n", - "We will now train an XGBoost model using the SageMaker Python SDK and the output of the ProcessingStep." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Training the Model" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Define the output path for the model artifacts from the Hyperparameter Tuning Job\n", - "model_path = f\"s3://{default_bucket}/{base_job_prefix}/AbaloneTrain\"\n", - "\n", - "image_uri = sagemaker.image_uris.retrieve(\n", - " framework=\"xgboost\",\n", - " region=region,\n", - " version=\"1.0-1\",\n", - " py_version=\"py3\",\n", - " instance_type=\"ml.m5.xlarge\",\n", - ")\n", - "\n", - "xgb_train = Estimator(\n", - " image_uri=image_uri,\n", - " instance_type=training_instance_type,\n", - " instance_count=1,\n", - " output_path=model_path,\n", - " base_job_name=f\"{prefix}/{base_job_prefix}/sklearn-abalone-preprocess\",\n", - " sagemaker_session=pipeline_session,\n", - " role=role,\n", - ")\n", - "\n", - "xgb_train.set_hyperparameters(\n", - " objective=\"reg:linear\",\n", - " num_round=50,\n", - " max_depth=5,\n", - " eta=0.2,\n", - " gamma=4,\n", - " min_child_weight=6,\n", - " subsample=0.7,\n", - " silent=0,\n", - ")\n", - "\n", - "train_args = xgb_train.fit(\n", - " inputs={\n", - " \"train\": TrainingInput(\n", - " s3_data=step_process.properties.ProcessingOutputConfig.Outputs[\"train\"].S3Output.S3Uri,\n", - " content_type=\"text/csv\",\n", - " ),\n", - " \"validation\": TrainingInput(\n", - " s3_data=step_process.properties.ProcessingOutputConfig.Outputs[\n", - " \"validation\"\n", - " ].S3Output.S3Uri,\n", - " content_type=\"text/csv\",\n", - " ),\n", - " },\n", - ")\n", - "\n", - "step_train = TrainingStep(\n", - " name=\"TrainAbaloneModel\",\n", - " step_args=train_args,\n", - " cache_config=cache_config,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Evaluating the model\n", - "\n", - "Use a processing job to evaluate the model from the TrainingStep. If the output of the evaluation is True, a model is created and a Lambda function is invoked to deploy the model to a SageMaker Endpoint." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "%%writefile code/evaluate.py\n", - "\n", - "\"\"\"Evaluation script for measuring mean squared error.\"\"\"\n", - "import json\n", - "import logging\n", - "import pathlib\n", - "import pickle\n", - "import tarfile\n", - "\n", - "import numpy as np\n", - "import pandas as pd\n", - "import xgboost\n", - "\n", - "from sklearn.metrics import mean_squared_error\n", - "\n", - "logger = logging.getLogger()\n", - "logger.setLevel(logging.INFO)\n", - "logger.addHandler(logging.StreamHandler())\n", - "\n", - "\n", - "if __name__ == \"__main__\":\n", - " logger.debug(\"Starting evaluation.\")\n", - " model_path = \"/opt/ml/processing/model/model.tar.gz\"\n", - " with tarfile.open(model_path) as tar:\n", - " tar.extractall(path=\".\")\n", - "\n", - " logger.debug(\"Loading xgboost model.\")\n", - " model = pickle.load(open(\"xgboost-model\", \"rb\"))\n", - "\n", - " logger.debug(\"Reading test data.\")\n", - " test_path = \"/opt/ml/processing/test/test.csv\"\n", - " df = pd.read_csv(test_path, header=None)\n", - "\n", - " logger.debug(\"Reading test data.\")\n", - " y_test = df.iloc[:, 0].to_numpy()\n", - " df.drop(df.columns[0], axis=1, inplace=True)\n", - " X_test = xgboost.DMatrix(df.values)\n", - "\n", - " logger.info(\"Performing predictions against test data.\")\n", - " predictions = model.predict(X_test)\n", - "\n", - " logger.debug(\"Calculating mean squared error.\")\n", - " mse = mean_squared_error(y_test, predictions)\n", - " std = np.std(y_test - predictions)\n", - " report_dict = {\n", - " \"regression_metrics\": {\n", - " \"mse\": {\"value\": mse, \"standard_deviation\": std},\n", - " },\n", - " }\n", - "\n", - " output_dir = \"/opt/ml/processing/evaluation\"\n", - " pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)\n", - "\n", - " logger.info(\"Writing out evaluation report with mse: %f\", mse)\n", - " evaluation_path = f\"{output_dir}/evaluation.json\"\n", - " with open(evaluation_path, \"w\") as f:\n", - " f.write(json.dumps(report_dict))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# A ProcessingStep is used to evaluate the performance of the trained model.\n", - "# Based on the results of the evaluation, the model is created and deployed.\n", - "\n", - "script_eval = ScriptProcessor(\n", - " image_uri=image_uri,\n", - " command=[\"python3\"],\n", - " instance_type=\"ml.m5.xlarge\",\n", - " instance_count=1,\n", - " base_job_name=f\"{prefix}/{base_job_prefix}/sklearn-abalone-preprocess\",\n", - " sagemaker_session=pipeline_session,\n", - " role=role,\n", - ")\n", - "\n", - "evaluation_report = PropertyFile(\n", - " name=\"AbaloneEvaluationReport\",\n", - " output_name=\"evaluation\",\n", - " path=\"evaluation.json\",\n", - ")\n", - "\n", - "eval_args = script_eval.run(\n", - " inputs=[\n", - " ProcessingInput(\n", - " source=step_train.properties.ModelArtifacts.S3ModelArtifacts,\n", - " destination=\"/opt/ml/processing/model\",\n", - " ),\n", - " ProcessingInput(\n", - " source=step_process.properties.ProcessingOutputConfig.Outputs[\"test\"].S3Output.S3Uri,\n", - " destination=\"/opt/ml/processing/test\",\n", - " ),\n", - " ],\n", - " outputs=[\n", - " ProcessingOutput(\n", - " output_name=\"evaluation\",\n", - " source=\"/opt/ml/processing/evaluation\",\n", - " destination=f\"s3://{default_bucket}/{s3_prefix}/evaluation_report\",\n", - " ),\n", - " ],\n", - " code=\"code/evaluate.py\",\n", - ")\n", - "step_eval = ProcessingStep(\n", - " name=\"EvaluateAbaloneModel\",\n", - " step_args=eval_args,\n", - " property_files=[evaluation_report],\n", - " cache_config=cache_config,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Creating the final model object\n", - "\n", - "The model is created and the name of the model is provided to the Lambda function for deployment. The `CreateModelStep` dynamically assigns a name to the model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Create Model\n", - "model = Model(\n", - " image_uri=image_uri,\n", - " model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,\n", - " sagemaker_session=pipeline_session,\n", - " role=role,\n", - " predictor_cls=XGBoostPredictor,\n", - ")\n", - "\n", - "step_create_model = ModelStep(\n", - " name=\"CreateModel\",\n", - " step_args=model.create(\"ml.m4.large\"),\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setting up Lambda\n", - "\n", - "When defining the LambdaStep, the SageMaker Lambda helper class provides helper functions for creating the Lambda function. Users can either use the `lambda_func` argument to provide the function ARN to an already deployed Lambda function OR use the `Lambda` class to create a Lambda function by providing a script, function name and role for the Lambda function.\n", - "\n", - "When passing inputs to the Lambda, the `inputs` argument can be used and within the Lambda function's handler, the `event` argument can be used to retrieve the inputs.\n", - "\n", - "The dictionary response from the Lambda function is parsed through the `LambdaOutput` objects provided to the `outputs` argument. The `output_name` in `LambdaOutput` corresponds to the dictionary key in the Lambda's return dictionary." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Define the Lambda function\n", - "\n", - "Users can choose the leverage the Lambda helper class to create a Lambda function and provide that function object to the LambdaStep. Alternatively, users can use a pre-deployed Lambda function and provide the function ARN to the `Lambda` helper class in the Lambda step." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "%%writefile code/lambda_helper.py\n", - "\n", - "\"\"\"\n", - "This Lambda function creates an Endpoint Configuration and deploys a model to an Endpoint.\n", - "The name of the model to deploy is provided via the `event` argument\n", - "\"\"\"\n", - "\n", - "import json\n", - "import boto3\n", - "\n", - "\n", - "def lambda_handler(event, context):\n", - " \"\"\" \"\"\"\n", - " sm_client = boto3.client(\"sagemaker\")\n", - "\n", - " # The name of the model created in the Pipeline CreateModelStep\n", - " model_name = event[\"model_name\"]\n", - "\n", - " endpoint_config_name = event[\"endpoint_config_name\"]\n", - " endpoint_name = event[\"endpoint_name\"]\n", - "\n", - " create_endpoint_config_response = sm_client.create_endpoint_config(\n", - " EndpointConfigName=endpoint_config_name,\n", - " ProductionVariants=[\n", - " {\n", - " \"InstanceType\": \"ml.m4.xlarge\",\n", - " \"InitialVariantWeight\": 1,\n", - " \"InitialInstanceCount\": 1,\n", - " \"ModelName\": model_name,\n", - " \"VariantName\": \"AllTraffic\",\n", - " }\n", - " ],\n", - " )\n", - "\n", - " create_endpoint_response = sm_client.create_endpoint(\n", - " EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n", - " )\n", - "\n", - " return {\n", - " \"statusCode\": 200,\n", - " \"body\": json.dumps(\"Created Endpoint!\"),\n", - " \"other_key\": \"example_value\",\n", - " }" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Setting up the custom IAM Role\n", - "\n", - "The Lambda function needs an IAM role that allows it to deploy a SageMaker Endpoint. The role ARN must be provided in the LambdaStep.\n", - "\n", - "The Lambda role should at minimum have policies to allow `sagemaker:CreateModel`, `sagemaker:CreateEndpointConfig`, `sagemaker:CreateEndpoint` in addition to the based Lambda execution policies.\n", - "\n", - "A helper function in `iam_helper.py` is available to create the Lambda function role. Please note that the role uses the Amazon managed policy - `SageMakerFullAccess`. This should be replaced with an IAM policy with least privileges as per AWS IAM best practices." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "from iam_helper import create_lambda_role\n", - "\n", - "lambda_role = create_lambda_role(\"lambda-deployment-role\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Custom Lambda Step\n", - "\n", - "current_time = time.strftime(\"%m-%d-%H-%M-%S\", time.localtime())\n", - "model_name = \"demo-lambda-model\" + current_time\n", - "endpoint_config_name = \"demo-lambda-deploy-endpoint-config-\" + current_time\n", - "endpoint_name = \"demo-lambda-deploy-endpoint-\" + current_time\n", - "\n", - "function_name = \"sagemaker-lambda-step-endpoint-deploy-\" + current_time\n", - "\n", - "# Lambda helper class can be used to create the Lambda function\n", - "func = Lambda(\n", - " function_name=function_name,\n", - " execution_role_arn=lambda_role,\n", - " script=\"code/lambda_helper.py\",\n", - " handler=\"lambda_helper.lambda_handler\",\n", - ")\n", - "\n", - "output_param_1 = LambdaOutput(output_name=\"statusCode\", output_type=LambdaOutputTypeEnum.String)\n", - "output_param_2 = LambdaOutput(output_name=\"body\", output_type=LambdaOutputTypeEnum.String)\n", - "output_param_3 = LambdaOutput(output_name=\"other_key\", output_type=LambdaOutputTypeEnum.String)\n", - "\n", - "step_deploy_lambda = LambdaStep(\n", - " name=\"LambdaStep\",\n", - " lambda_func=func,\n", - " inputs={\n", - " \"model_name\": step_create_model.properties.ModelName,\n", - " \"endpoint_config_name\": endpoint_config_name,\n", - " \"endpoint_name\": endpoint_name,\n", - " },\n", - " outputs=[output_param_1, output_param_2, output_param_3],\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# ConditionStep for evaluating model quality and branching execution.\n", - "# The `json_path` value is based on the `report_dict` variable in `evaluate.py`\n", - "\n", - "cond_lte = ConditionLessThanOrEqualTo(\n", - " left=JsonGet(\n", - " step_name=step_eval.name,\n", - " property_file=evaluation_report,\n", - " json_path=\"regression_metrics.mse.value\",\n", - " ),\n", - " right=6.0,\n", - ")\n", - "\n", - "step_cond = ConditionStep(\n", - " name=\"CheckMSEAbaloneEvaluation\",\n", - " conditions=[cond_lte],\n", - " if_steps=[step_create_model, step_deploy_lambda],\n", - " else_steps=[],\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Use the same pipeline name across executions for cache usage.\n", - "\n", - "pipeline_name = \"lambda-step-pipeline\" + current_time\n", - "\n", - "pipeline = Pipeline(\n", - " name=pipeline_name,\n", - " parameters=[\n", - " processing_instance_count,\n", - " training_instance_type,\n", - " input_data,\n", - " model_approval_status,\n", - " ],\n", - " steps=[step_process, step_train, step_eval, step_cond],\n", - " sagemaker_session=pipeline_session,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Execute the Pipeline" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "import json\n", - "\n", - "definition = json.loads(pipeline.definition())\n", - "definition" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "pipeline.upsert(role_arn=role)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "execution = pipeline.start()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "execution.wait()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Create a SageMaker client\n", - "sm_client = sagemaker.Session().sagemaker_client\n", - "\n", - "# Wait for the endpoint to be in service\n", - "waiter = sm_client.get_waiter(\"endpoint_in_service\")\n", - "waiter.wait(EndpointName=endpoint_name)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Clean up resources\n", - "\n", - "Running the following cell will delete the following resources created in this notebook -\n", - "* SageMaker Model\n", - "* SageMaker Endpoint Configuration\n", - "* SageMaker Endpoint\n", - "* SageMaker Pipeline\n", - "* Lambda Function" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Get the model name from the EndpointCofig. The CreateModelStep properties are not available\n", - "# outside the Pipeline execution context so `step_create_model.properties.ModelName`\n", - "# cannot be used while deleting the model.\n", - "\n", - "model_name = sm_client.describe_endpoint_config(EndpointConfigName=endpoint_config_name)[\n", - " \"ProductionVariants\"\n", - "][0][\"ModelName\"]\n", - "\n", - "# Delete the Model\n", - "sm_client.delete_model(ModelName=model_name)\n", - "\n", - "# Delete the EndpointConfig\n", - "sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n", - "\n", - "# Delete the Endpoint\n", - "sm_client.delete_endpoint(EndpointName=endpoint_name)\n", - "\n", - "# Delete the Lambda function\n", - "func.delete()\n", - "\n", - "# Delete the Pipeline\n", - "sm_client.delete_pipeline(PipelineName=pipeline_name)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Notebook CI Test Results\n", - "\n", - "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", - "\n", - "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/sagemaker-pipelines|tabular|lambda-step|sagemaker-pipelines-lambda-step.ipynb)\n", - "\n", - "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/sagemaker-pipelines|tabular|lambda-step|sagemaker-pipelines-lambda-step.ipynb)\n", - "\n", - "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/sagemaker-pipelines|tabular|lambda-step|sagemaker-pipelines-lambda-step.ipynb)\n", - "\n", - "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/sagemaker-pipelines|tabular|lambda-step|sagemaker-pipelines-lambda-step.ipynb)\n", - "\n", - "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/sagemaker-pipelines|tabular|lambda-step|sagemaker-pipelines-lambda-step.ipynb)\n", - "\n", - "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/sagemaker-pipelines|tabular|lambda-step|sagemaker-pipelines-lambda-step.ipynb)\n", - "\n", - "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/sagemaker-pipelines|tabular|lambda-step|sagemaker-pipelines-lambda-step.ipynb)\n", - "\n", - "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/sagemaker-pipelines|tabular|lambda-step|sagemaker-pipelines-lambda-step.ipynb)\n", - "\n", - "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/sagemaker-pipelines|tabular|lambda-step|sagemaker-pipelines-lambda-step.ipynb)\n", - "\n", - "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/sagemaker-pipelines|tabular|lambda-step|sagemaker-pipelines-lambda-step.ipynb)\n", - "\n", - "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/sagemaker-pipelines|tabular|lambda-step|sagemaker-pipelines-lambda-step.ipynb)\n", - "\n", - "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/sagemaker-pipelines|tabular|lambda-step|sagemaker-pipelines-lambda-step.ipynb)\n", - "\n", - "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/sagemaker-pipelines|tabular|lambda-step|sagemaker-pipelines-lambda-step.ipynb)\n", - "\n", - "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/sagemaker-pipelines|tabular|lambda-step|sagemaker-pipelines-lambda-step.ipynb)\n", - "\n", - "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/sagemaker-pipelines|tabular|lambda-step|sagemaker-pipelines-lambda-step.ipynb)\n" - ] - } - ], - "metadata": { - "availableInstances": [ - { - "_defaultOrder": 0, - "_isFastLaunch": true, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 4, - "name": "ml.t3.medium", - "vcpuNum": 2 - }, - { - "_defaultOrder": 1, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 8, - "name": "ml.t3.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 2, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.t3.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 3, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.t3.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 4, - "_isFastLaunch": true, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 8, - "name": "ml.m5.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 5, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.m5.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 6, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.m5.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 7, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.m5.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 8, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.m5.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 9, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.m5.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 10, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.m5.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 11, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 384, - "name": "ml.m5.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 12, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 8, - "name": "ml.m5d.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 13, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.m5d.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 14, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.m5d.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 15, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.m5d.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 16, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.m5d.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 17, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.m5d.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 18, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.m5d.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 19, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 384, - "name": "ml.m5d.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 20, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": true, - "memoryGiB": 0, - "name": "ml.geospatial.interactive", - "supportedImageNames": [ - "sagemaker-geospatial-v1-0" - ], - "vcpuNum": 0 - }, - { - "_defaultOrder": 21, - "_isFastLaunch": true, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 4, - "name": "ml.c5.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 22, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 8, - "name": "ml.c5.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 23, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.c5.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 24, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.c5.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 25, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 72, - "name": "ml.c5.9xlarge", - "vcpuNum": 36 - }, - { - "_defaultOrder": 26, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 96, - "name": "ml.c5.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 27, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 144, - "name": "ml.c5.18xlarge", - "vcpuNum": 72 - }, - { - "_defaultOrder": 28, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.c5.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 29, - "_isFastLaunch": true, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.g4dn.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 30, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.g4dn.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 31, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.g4dn.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 32, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.g4dn.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 33, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 4, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.g4dn.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 34, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.g4dn.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 35, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 61, - "name": "ml.p3.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 36, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 4, - "hideHardwareSpecs": false, - "memoryGiB": 244, - "name": "ml.p3.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 37, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 488, - "name": "ml.p3.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 38, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 768, - "name": "ml.p3dn.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 39, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.r5.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 40, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.r5.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 41, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.r5.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 42, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.r5.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 43, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.r5.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 44, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 384, - "name": "ml.r5.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 45, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 512, - "name": "ml.r5.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 46, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 768, - "name": "ml.r5.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 47, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.g5.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 48, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.g5.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 49, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.g5.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 50, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.g5.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 51, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.g5.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 52, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 4, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.g5.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 53, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 4, - "hideHardwareSpecs": false, - "memoryGiB": 384, - "name": "ml.g5.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 54, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 768, - "name": "ml.g5.48xlarge", - "vcpuNum": 192 - }, - { - "_defaultOrder": 55, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 1152, - "name": "ml.p4d.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 56, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 1152, - "name": "ml.p4de.24xlarge", - "vcpuNum": 96 - } - ], - "kernelspec": { - "display_name": "Python 3 (Data Science 3.0)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/sagemaker-data-science-310-v1" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.6" - }, - "metadata": { - "interpreter": { - "hash": "ac2eaa0ea0ebeafcc7822e65e46aa9d4f966f30b695406963e145ea4a91cd4fc" - } - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/sagemaker-python-sdk/scikit_learn_iris/scikit_learn_estimator_example_with_batch_transform.ipynb b/sagemaker-python-sdk/scikit_learn_iris/scikit_learn_estimator_example_with_batch_transform.ipynb deleted file mode 100644 index 4523a10420..0000000000 --- a/sagemaker-python-sdk/scikit_learn_iris/scikit_learn_estimator_example_with_batch_transform.ipynb +++ /dev/null @@ -1,684 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "# Iris Training and Prediction with Sagemaker Scikit-learn\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "\n", - "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", - "\n", - "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/sagemaker-python-sdk|scikit_learn_iris|scikit_learn_estimator_example_with_batch_transform.ipynb)\n", - "\n", - "---" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "This tutorial shows you how to use [Scikit-learn](https://scikit-learn.org/stable/) with SageMaker by utilizing the pre-built container. Scikit-learn is a popular Python machine learning framework. It includes a number of different algorithms for classification, regression, clustering, dimensionality reduction, and data/feature pre-processing. \n", - "\n", - "The [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) module makes it easy to take existing scikit-learn code, which we show by training a model on the Iris dataset and generating a set of predictions. For more information about the Scikit-learn container, see the [sagemaker-scikit-learn-containers](https://github.com/aws/sagemaker-scikit-learn-container) repository and the [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) repository.\n", - "\n", - "## Runtime\n", - "\n", - "This notebook takes approximately 15 minutes to run.\n", - "\n", - "## Contents\n", - "* [Upload the data for training](#upload_data)\n", - "* [Create a Scikit-learn script to train with](#create_sklearn_script)\n", - "* [Create the SageMaker Scikit Estimator](#create_sklearn_estimator)\n", - "* [Train the SKLearn Estimator on the Iris data](#train_sklearn)\n", - "* [Use the trained model to make inference requests](#inference)\n", - " * [Deploy the model](#deploy)\n", - " * [Choose some data and use it for a prediction](#prediction_request)\n", - " * [Endpoint cleanup](#endpoint_cleanup)\n", - "* [Batch Transform](#batch_transform)\n", - " * [Prepare Input Data](#prepare_input_data)\n", - " * [Run Transform Job](#run_transform_job)\n", - " * [Check Output Data](#check_output_data)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "%pip install -U sagemaker>=2.15" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "First, let's create our Sagemaker session and role, and create a S3 prefix to use for the notebook example." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "# S3 prefix\n", - "prefix = \"DEMO-scikit-iris\"\n", - "\n", - "import sagemaker\n", - "from sagemaker import get_execution_role\n", - "\n", - "sagemaker_session = sagemaker.Session()\n", - "region = sagemaker_session.boto_region_name\n", - "role = get_execution_role()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Upload the data for training \n", - "\n", - "When training large models with huge amounts of data, you may use big data tools like Amazon Athena, AWS Glue, or Amazon EMR to process your data backed by S3. For the purposes of this example, we're using a sample of the classic [Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris). We load the dataset, write it locally, then upload it to S3." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "import boto3\n", - "import numpy as np\n", - "import pandas as pd\n", - "import os\n", - "\n", - "os.makedirs(\"./data\", exist_ok=True)\n", - "\n", - "s3_client = boto3.client(\"s3\")\n", - "s3_client.download_file(\n", - " f\"sagemaker-example-files-prod-{region}\", \"datasets/tabular/iris/iris.data\", \"./data/iris.csv\"\n", - ")\n", - "\n", - "df_iris = pd.read_csv(\"./data/iris.csv\", header=None)\n", - "df_iris[4] = df_iris[4].map({\"Iris-setosa\": 0, \"Iris-versicolor\": 1, \"Iris-virginica\": 2})\n", - "iris = df_iris[[4, 0, 1, 2, 3]].to_numpy()\n", - "np.savetxt(\"./data/iris.csv\", iris, delimiter=\",\", fmt=\"%1.1f, %1.3f, %1.3f, %1.3f, %1.3f\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Once we have the data locally, we can use use the tools provided by the SageMaker Python SDK to upload the data to a default bucket. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "WORK_DIRECTORY = \"data\"\n", - "\n", - "train_input = sagemaker_session.upload_data(\n", - " WORK_DIRECTORY, key_prefix=\"{}/{}\".format(prefix, WORK_DIRECTORY)\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Create a Scikit-learn script for training \n", - "SageMaker can run a scikit-learn script using the `SKLearn` estimator. When run on SageMaker, a number of helpful environment variables are available to access properties of the training environment, such as:\n", - "\n", - "* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to. Any artifacts saved in this folder are uploaded to S3 for model hosting after the training job completes.\n", - "* `SM_OUTPUT_DIR`: A string representing the file system path to write output artifacts to. Output artifacts may include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed and uploaded to S3 to the same S3 prefix as the model artifacts.\n", - "\n", - "Supposing two input channels, 'train' and 'test', were used in the call to the `SKLearn` estimator's `fit()` method, the following environment variables are set, following the format `SM_CHANNEL_[channel_name]`:\n", - "\n", - "* `SM_CHANNEL_TRAIN`: A string representing the path to the directory containing data in the 'train' channel.\n", - "* `SM_CHANNEL_TEST`: Same as above, but for the 'test' channel.\n", - "\n", - "A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to the `model_dir` so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance. For example, the script that we run in this notebook is below:\n", - "\n", - "```python\n", - "from __future__ import print_function\n", - "\n", - "import argparse\n", - "import joblib\n", - "import os\n", - "import pandas as pd\n", - "\n", - "from sklearn import tree\n", - "\n", - "\n", - "if __name__ == '__main__':\n", - " parser = argparse.ArgumentParser()\n", - "\n", - " # Hyperparameters are described here. In this simple example we are just including one hyperparameter.\n", - " parser.add_argument('--max_leaf_nodes', type=int, default=-1)\n", - "\n", - " # Sagemaker specific arguments. Defaults are set in the environment variables.\n", - " parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])\n", - " parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])\n", - " parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])\n", - "\n", - " args = parser.parse_args()\n", - "\n", - " # Take the set of files and read them all into a single pandas dataframe\n", - " input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train) ]\n", - " if len(input_files) == 0:\n", - " raise ValueError(('There are no files in {}.\\n' +\n", - " 'This usually indicates that the channel ({}) was incorrectly specified,\\n' +\n", - " 'the data specification in S3 was incorrectly specified or the role specified\\n' +\n", - " 'does not have permission to access the data.').format(args.train, \"train\"))\n", - " raw_data = [ pd.read_csv(file, header=None, engine=\"python\") for file in input_files ]\n", - " train_data = pd.concat(raw_data)\n", - "\n", - " # labels are in the first column\n", - " train_y = train_data.iloc[:, 0]\n", - " train_X = train_data.iloc[:, 1:]\n", - "\n", - " # Here we support a single hyperparameter, 'max_leaf_nodes'. Note that you can add as many\n", - " # as your training my require in the ArgumentParser above.\n", - " max_leaf_nodes = args.max_leaf_nodes\n", - "\n", - " # Now use scikit-learn's decision tree classifier to train the model.\n", - " clf = tree.DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes)\n", - " clf = clf.fit(train_X, train_y)\n", - "\n", - " # Print the coefficients of the trained classifier, and save the coefficients\n", - " joblib.dump(clf, os.path.join(args.model_dir, \"model.joblib\"))\n", - "\n", - "\n", - "def model_fn(model_dir):\n", - " \"\"\"Deserialized and return fitted model\n", - " \n", - " Note that this should have the same name as the serialized model in the main method\n", - " \"\"\"\n", - " clf = joblib.load(os.path.join(model_dir, \"model.joblib\"))\n", - " return clf\n", - "```" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Because the Scikit-learn container imports your training script, you should always put your training code in a main guard `(if __name__=='__main__':)` so that the container does not inadvertently run your training code at the wrong point in execution.\n", - "\n", - "For more information about training environment variables, please visit https://github.com/aws/sagemaker-containers." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Create a SageMaker SKLearn Estimator \n", - "\n", - "To run our Scikit-learn training script on SageMaker, we construct a `sagemaker.sklearn.estimator.sklearn` estimator, which accepts several constructor arguments:\n", - "\n", - "* __entry_point__: The path to the Python script SageMaker runs for training and prediction.\n", - "* __role__: The IAM role ARN.\n", - "* __instance_type__ *(optional)*: The type of SageMaker instances for training. __Note__: Because Scikit-learn does not natively support GPU training, SageMaker Scikit-learn does not currently support training on GPU instance types.\n", - "* __sagemaker_session__ *(optional)*: The session used to train on SageMaker.\n", - "* __hyperparameters__ *(optional)*: A dictionary passed to the train function as hyperparameters.\n", - "\n", - "To see the code for the SKLearn Estimator, see: https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/sklearn" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "from sagemaker.sklearn.estimator import SKLearn\n", - "\n", - "FRAMEWORK_VERSION = \"1.2-1\"\n", - "script_path = \"scikit_learn_iris.py\"\n", - "\n", - "sklearn = SKLearn(\n", - " entry_point=script_path,\n", - " framework_version=FRAMEWORK_VERSION,\n", - " instance_type=\"ml.c4.xlarge\",\n", - " role=role,\n", - " sagemaker_session=sagemaker_session,\n", - " hyperparameters={\"max_leaf_nodes\": 30},\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Train SKLearn Estimator on Iris data \n", - "Training is straightforward, just call `fit()` on the Estimator! This starts a SageMaker training job that downloads the data, invokes our scikit-learn code (in the provided script file), and saves any model artifacts that the script creates." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - }, - "scrolled": true - }, - "outputs": [], - "source": [ - "sklearn.fit({\"train\": train_input})" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Use the trained model to make inference requests \n", - "\n", - "### Deploy the model \n", - "\n", - "Deploying the model to SageMaker hosting just requires a `deploy()` call on the fitted model. This call takes an instance count and instance type." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "predictor = sklearn.deploy(initial_instance_count=1, instance_type=\"ml.m5.xlarge\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "### Choose some data and use it for a prediction \n", - "\n", - "We extract some data we used for training and make predictions on it. This is not a recommended statistical practice, but it demonstrates how to run inference using the deployed endpoint." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "import itertools\n", - "import pandas as pd\n", - "\n", - "shape = pd.read_csv(\"data/iris.csv\", header=None)\n", - "\n", - "a = [50 * i for i in range(3)]\n", - "b = [40 + i for i in range(10)]\n", - "indices = [i + j for i, j in itertools.product(a, b)]\n", - "\n", - "test_data = shape.iloc[indices[:-1]]\n", - "test_X = test_data.iloc[:, 1:]\n", - "test_y = test_data.iloc[:, 0]" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "To make a prediction, call `predict()` on the predictor returned from `deploy()`, passing the data to do predictions on. The output from the endpoint returns a numerical representation of the classification prediction; in the original dataset, these are three flower category names, but in this example the labels are numerical. We can compare against the original label that we parsed." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "print(predictor.predict(test_X.values))\n", - "print(test_y.values)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "### Endpoint cleanup \n", - "\n", - "When you're done with the endpoint, delete it to release the resources and avoid incurring additional cost." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "predictor.delete_endpoint()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Batch Transform \n", - "We can also use the trained model for asynchronous batch inference on S3 data using SageMaker Batch Transform." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "# Define an SKLearn Transformer from the trained SKLearn Estimator\n", - "transformer = sklearn.transformer(instance_count=1, instance_type=\"ml.m5.xlarge\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "### Prepare Input Data \n", - "We extract 10 random samples of 100 rows from the training data, split the features (X) from the labels (Y), and upload the input data to a given location in S3." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "%%bash\n", - "# Randomly sample the iris dataset 10 times, then split X and Y\n", - "mkdir -p batch_data/XY batch_data/X batch_data/Y\n", - "for i in {0..9}; do\n", - " cat data/iris.csv | shuf -n 100 > batch_data/XY/iris_sample_${i}.csv\n", - " cat batch_data/XY/iris_sample_${i}.csv | cut -d',' -f2- > batch_data/X/iris_sample_X_${i}.csv\n", - " cat batch_data/XY/iris_sample_${i}.csv | cut -d',' -f1 > batch_data/Y/iris_sample_Y_${i}.csv\n", - "done" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "# Upload input data from local file system to S3\n", - "batch_input_s3 = sagemaker_session.upload_data(\"batch_data/X\", key_prefix=prefix + \"/batch_input\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "### Run Transform Job \n", - "Using the Transformer, run a transform job on the S3 input data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "# Start a transform job and wait for it to finish\n", - "transformer.transform(batch_input_s3, content_type=\"text/csv\")\n", - "print(\"Waiting for transform job: \" + transformer.latest_transform_job.job_name)\n", - "transformer.wait()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "### Check Output Data \n", - "After the transform job has completed, download the output data from S3. For each file \"f\" in the input data, we have a corresponding file \"f.out\" containing the predicted labels from each input row. We can compare the predicted labels to the true labels saved earlier." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "# Download the output data from S3 to local file system\n", - "batch_output = transformer.output_path\n", - "!mkdir -p batch_data/output\n", - "!aws s3 cp --recursive $batch_output/ batch_data/output/\n", - "# Head to see what the batch output looks like\n", - "!head batch_data/output/*" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "%%bash\n", - "# For each sample file, compare the predicted labels from batch output to the true labels\n", - "for i in {1..9}; do\n", - " diff -s batch_data/Y/iris_sample_Y_${i}.csv \\\n", - " <(cat batch_data/output/iris_sample_X_${i}.csv.out | sed 's/[[\"]//g' | sed 's/, \\|]/\\n/g') \\\n", - " | sed \"s/\\/dev\\/fd\\/63/batch_data\\/output\\/iris_sample_X_${i}.csv.out/\"\n", - "done" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Notebook CI Test Results\n", - "\n", - "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", - "\n", - "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/sagemaker-python-sdk|scikit_learn_iris|scikit_learn_estimator_example_with_batch_transform.ipynb)\n", - "\n", - "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/sagemaker-python-sdk|scikit_learn_iris|scikit_learn_estimator_example_with_batch_transform.ipynb)\n", - "\n", - "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/sagemaker-python-sdk|scikit_learn_iris|scikit_learn_estimator_example_with_batch_transform.ipynb)\n", - "\n", - "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/sagemaker-python-sdk|scikit_learn_iris|scikit_learn_estimator_example_with_batch_transform.ipynb)\n", - "\n", - "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/sagemaker-python-sdk|scikit_learn_iris|scikit_learn_estimator_example_with_batch_transform.ipynb)\n", - "\n", - "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/sagemaker-python-sdk|scikit_learn_iris|scikit_learn_estimator_example_with_batch_transform.ipynb)\n", - "\n", - "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/sagemaker-python-sdk|scikit_learn_iris|scikit_learn_estimator_example_with_batch_transform.ipynb)\n", - "\n", - "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/sagemaker-python-sdk|scikit_learn_iris|scikit_learn_estimator_example_with_batch_transform.ipynb)\n", - "\n", - "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/sagemaker-python-sdk|scikit_learn_iris|scikit_learn_estimator_example_with_batch_transform.ipynb)\n", - "\n", - "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/sagemaker-python-sdk|scikit_learn_iris|scikit_learn_estimator_example_with_batch_transform.ipynb)\n", - "\n", - "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/sagemaker-python-sdk|scikit_learn_iris|scikit_learn_estimator_example_with_batch_transform.ipynb)\n", - "\n", - "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/sagemaker-python-sdk|scikit_learn_iris|scikit_learn_estimator_example_with_batch_transform.ipynb)\n", - "\n", - "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/sagemaker-python-sdk|scikit_learn_iris|scikit_learn_estimator_example_with_batch_transform.ipynb)\n", - "\n", - "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/sagemaker-python-sdk|scikit_learn_iris|scikit_learn_estimator_example_with_batch_transform.ipynb)\n", - "\n", - "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/sagemaker-python-sdk|scikit_learn_iris|scikit_learn_estimator_example_with_batch_transform.ipynb)\n" - ] - } - ], - "metadata": { - "instance_type": "ml.t3.medium", - "kernelspec": { - "display_name": "Python 3 (Data Science 3.0)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.6" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/sagemaker-script-mode/pytorch_bert/deploy_bert.ipynb b/sagemaker-script-mode/pytorch_bert/deploy_bert.ipynb deleted file mode 100644 index 3672db392b..0000000000 --- a/sagemaker-script-mode/pytorch_bert/deploy_bert.ipynb +++ /dev/null @@ -1,295 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Host a Pretrained Model on SageMaker\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "\n", - "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", - "\n", - "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/sagemaker-script-mode|pytorch_bert|deploy_bert.ipynb)\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " \n", - "Amazon SageMaker is a service to accelerate the entire machine learning lifecycle. It includes components for building, training and deploying machine learning models. Each SageMaker component is modular, so you're welcome to only use the features needed for your use case. One of the most popular features of SageMaker is model hosting. Using SageMaker hosting, you can deploy your model as a scalable, highly available, multi-process API endpoint with a few lines of code. Read more at [Deploy a Model in Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html). In this notebook, we demonstrate how to host a pretrained BERT model in Amazon SageMaker to extract embeddings from text.\n", - "\n", - "SageMaker provides prebuilt containers that can be used for training, hosting, or data processing. The inference containers include a web serving stack, so you don't need to install and configure one. We use the SageMaker PyTorch container, but you may use the TensorFlow container, or bring your own container if needed. See all containers at [AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers).\n", - "\n", - "This notebook walks you through how to deploy a pretrained Hugging Face model as a scalable, highly available, production-ready API.\n", - "\n", - "## Runtime\n", - "\n", - "This notebook takes approximately 5 minutes to run.\n", - "\n", - "## Contents\n", - "\n", - "1. [Retrieve Model Artifacts](#Retrieve-Model-Artifacts)\n", - "1. [Write the Inference Script](#Write-the-Inference-Script)\n", - "1. [Package Model](#Package-Model)\n", - "1. [Deploy Model](#Deploy-Model)\n", - "1. [Get Predictions](#Get-Predictions)\n", - "1. [Conclusion](#Conclusion)\n", - "1. [Cleanup](#Cleanup)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Retrieve Model Artifacts\n", - "\n", - "First we download the model artifacts for the pretrained BERT model. BERT is a popular natural language processing (NLP) model that extracts meaning and context from text. You can read the original paper, [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "!pip install transformers==3.3.1 sagemaker==2.15.0 --quiet" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "from transformers import BertTokenizer, BertModel\n", - "\n", - "tokenizer = BertTokenizer.from_pretrained(\"bert-base-uncased\")\n", - "model = BertModel.from_pretrained(\"bert-base-uncased\")\n", - "\n", - "model_path = \"model/\"\n", - "code_path = \"code/\"\n", - "\n", - "if not os.path.exists(model_path):\n", - " os.mkdir(model_path)\n", - "\n", - "model.save_pretrained(save_directory=model_path)\n", - "tokenizer.save_pretrained(save_directory=model_path)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Write the Inference Script\n", - "\n", - "Since we are bringing a model to SageMaker, we must create an inference script. The script runs inside our PyTorch container. Our script should include a function for model loading, and optionally functions generating predictions, and input/output processing. The PyTorch container provides default implementations for generating a prediction and input/output processing. By including these functions in your script you are overriding the default functions. You can find additional details at [Serve a PyTorch Model](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#serve-a-pytorch-model).\n", - "\n", - "The next cell shows our inference script, whcich uses the [Transformers library from HuggingFace](https://huggingface.co/transformers/). This library is not installed in the container by default, so we add it in the next section." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pygmentize code/inference_code.py" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Package Model\n", - "\n", - "For hosting, SageMaker requires that the deployment package be structured in a compatible format. It expects all files to be packaged in a tar archive named \"model.tar.gz\" with gzip compression. To install additional libraries at container startup, we can add a requirements.txt file that specifies the libraries to be installed using [pip](https://pypi.org/project/pip/). Read more at [Using Third-Party Libraries](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#using-third-party-libraries). Within the archive, the PyTorch container expects all inference code and requirements.txt file to be inside the code/ directory. See the [Model Directory Structure](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#model-directory-structure) guide for a thorough explanation of the required directory structure. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import tarfile\n", - "\n", - "zipped_model_path = os.path.join(model_path, \"model.tar.gz\")\n", - "\n", - "with tarfile.open(zipped_model_path, \"w:gz\") as tar:\n", - " tar.add(model_path)\n", - " tar.add(code_path)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Deploy Model\n", - "\n", - "Now that we have our deployment package, we can use the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/index.html) to deploy our API endpoint with two lines of code. We need to specify an IAM role for the SageMaker endpoint to use. Minimally, it needs read access to the default SageMaker bucket (usually named `s3://sagemaker-{region}-{your account ID}`) so it can read the deployment package. When we call `deploy()`, the SDK saves our deployment archive to S3 for the SageMaker endpoint to use. We use the helper function [get_execution_role()](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html?highlight=get_execution_role#sagemaker.session.get_execution_role) to retrieve our current IAM role so we can pass it to the SageMaker endpoint. Minimally it requires read access to the model artifacts in S3 and the [ECR repository](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) where the container image is stored by AWS.\n", - "\n", - "\n", - "You may notice that we specify our PyTorch version and Python version when creating the PyTorchModel object. The SageMaker SDK uses these parameters to determine which PyTorch container to use. \n", - "\n", - "We use an m5.xlarge instance for our endpoint to ensure we have sufficient memory to serve our model. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sagemaker.pytorch import PyTorchModel\n", - "from sagemaker import get_execution_role\n", - "import time\n", - "\n", - "endpoint_name = \"bert-base-\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n", - "\n", - "model = PyTorchModel(\n", - " entry_point=\"inference_code.py\",\n", - " model_data=zipped_model_path,\n", - " role=get_execution_role(),\n", - " framework_version=\"1.5\",\n", - " py_version=\"py3\",\n", - ")\n", - "\n", - "predictor = model.deploy(\n", - " initial_instance_count=1, instance_type=\"ml.m5.xlarge\", endpoint_name=endpoint_name\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Get Predictions\n", - "\n", - "Now that our API endpoint is deployed, we send it text to get predictions from our BERT model. You can use the SageMaker SDK or the [InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) method of the SageMaker Runtime API to invoke the endpoint. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import sagemaker\n", - "\n", - "sm = sagemaker.Session().sagemaker_runtime_client\n", - "\n", - "prompt = \"The best part of Amazon SageMaker is that it makes machine learning easy.\"\n", - "\n", - "response = sm.invoke_endpoint(\n", - " EndpointName=endpoint_name, Body=prompt.encode(encoding=\"UTF-8\"), ContentType=\"text/csv\"\n", - ")\n", - "\n", - "response[\"Body\"].read()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Cleanup\n", - "\n", - "Delete the model and endpoint to release resources and stop incurring costs." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "predictor.delete_model()\n", - "predictor.delete_endpoint()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Conclusion\n", - "\n", - "We have successfully created a scalable, highly available, RESTful API that is backed by a BERT model! It can be used for downstream NLP tasks like text classification. If you are still interested in learning more, check out some of the more advanced features of SageMaker hosting, like [Monitor models for data and model quality, bias, and explainability](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html) to detect concept drift, [Automatically Scale Amazon SageMaker Models](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html) to dynamically adjust the number of instances, or [Give SageMaker Hosted Endpoints Access to Resources in Your Amazon VPC](https://docs.aws.amazon.com/sagemaker/latest/dg/host-vpc.html) to control network access to/from your endpoint.\n", - "\n", - "You can also read the blog [Deploy machine learning models to Amazon SageMaker using the ezsmdeploy Python package and a few lines of code](https://aws.amazon.com/blogs/opensource/deploy-machine-learning-models-to-amazon-sagemaker-using-the-ezsmdeploy-python-package-and-a-few-lines-of-code/). The ezsmdeploy package automates most of this process." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Notebook CI Test Results\n", - "\n", - "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", - "\n", - "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/sagemaker-script-mode|pytorch_bert|deploy_bert.ipynb)\n", - "\n", - "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/sagemaker-script-mode|pytorch_bert|deploy_bert.ipynb)\n", - "\n", - "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/sagemaker-script-mode|pytorch_bert|deploy_bert.ipynb)\n", - "\n", - "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/sagemaker-script-mode|pytorch_bert|deploy_bert.ipynb)\n", - "\n", - "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/sagemaker-script-mode|pytorch_bert|deploy_bert.ipynb)\n", - "\n", - "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/sagemaker-script-mode|pytorch_bert|deploy_bert.ipynb)\n", - "\n", - "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/sagemaker-script-mode|pytorch_bert|deploy_bert.ipynb)\n", - "\n", - "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/sagemaker-script-mode|pytorch_bert|deploy_bert.ipynb)\n", - "\n", - "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/sagemaker-script-mode|pytorch_bert|deploy_bert.ipynb)\n", - "\n", - "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/sagemaker-script-mode|pytorch_bert|deploy_bert.ipynb)\n", - "\n", - "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/sagemaker-script-mode|pytorch_bert|deploy_bert.ipynb)\n", - "\n", - "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/sagemaker-script-mode|pytorch_bert|deploy_bert.ipynb)\n", - "\n", - "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/sagemaker-script-mode|pytorch_bert|deploy_bert.ipynb)\n", - "\n", - "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/sagemaker-script-mode|pytorch_bert|deploy_bert.ipynb)\n", - "\n", - "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/sagemaker-script-mode|pytorch_bert|deploy_bert.ipynb)\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (PyTorch 1.10 Python 3.8 CPU Optimized)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/pytorch-1.10-cpu-py38" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.10" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/sagemaker-script-mode/sklearn/sklearn_byom.ipynb b/sagemaker-script-mode/sklearn/sklearn_byom.ipynb deleted file mode 100644 index 7d63f2d915..0000000000 --- a/sagemaker-script-mode/sklearn/sklearn_byom.ipynb +++ /dev/null @@ -1,445 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "id": "e950fa8e", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "# Train a SKLearn Model using Script Mode\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "0abdc17b", - "metadata": {}, - "source": [ - "---\n", - "\n", - "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", - "\n", - "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/sagemaker-script-mode|sklearn|sklearn_byom.ipynb)\n", - "\n", - "---" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "90e7cac6", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "\n", - "The aim of this notebook is to demonstrate how to train and deploy a scikit-learn model in Amazon SageMaker. The method used is called Script Mode, in which we write a script to train our model and submit it to the SageMaker Python SDK. For more information, feel free to read [Using Scikit-learn with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html).\n", - "\n", - "## Runtime\n", - "This notebook takes approximately 15 minutes to run.\n", - "\n", - "## Contents\n", - "1. [Download data](#Download-data)\n", - "1. [Prepare data](#Prepare-data)\n", - "1. [Train model](#Train-model)\n", - "1. [Deploy and test endpoint](#Deploy-and-test-endpoint)\n", - "1. [Cleanup](#Cleanup)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "a16db1a6", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Download data \n", - "Download the [Iris Data Set](https://archive.ics.uci.edu/ml/datasets/iris), which is the data used to trained the model in this demo." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e2d5c27c", - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "!pip install -U sagemaker" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a670c242", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "import boto3\n", - "import pandas as pd\n", - "import numpy as np\n", - "\n", - "s3 = boto3.client(\"s3\")\n", - "s3.download_file(\n", - " f\"sagemaker-example-files-prod-{boto3.session.Session().region_name}\",\n", - " \"datasets/tabular/iris/iris.data\",\n", - " \"iris.data\",\n", - ")\n", - "\n", - "df = pd.read_csv(\n", - " \"iris.data\", header=None, names=[\"sepal_len\", \"sepal_wid\", \"petal_len\", \"petal_wid\", \"class\"]\n", - ")\n", - "df.head()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "7c03b3d2", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Prepare data\n", - "Next, we prepare the data for training by first converting the labels from string to integers. Then we split the data into a train dataset (80% of the data) and test dataset (the remaining 20% of the data) before saving them into CSV files. Then, these files are uploaded to S3 where the SageMaker SDK can access and use them to train the model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "72748b04", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "# Convert the three classes from strings to integers in {0,1,2}\n", - "df[\"class_cat\"] = df[\"class\"].astype(\"category\").cat.codes\n", - "categories_map = dict(enumerate(df[\"class\"].astype(\"category\").cat.categories))\n", - "print(categories_map)\n", - "df.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fb5ea6cf", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "# Split the data into 80-20 train-test split\n", - "num_samples = df.shape[0]\n", - "split = round(num_samples * 0.8)\n", - "train = df.iloc[:split, :]\n", - "test = df.iloc[split:, :]\n", - "print(\"{} train, {} test\".format(split, num_samples - split))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "48770a6b", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "# Write train and test CSV files\n", - "train.to_csv(\"train.csv\", index=False)\n", - "test.to_csv(\"test.csv\", index=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ba40dab3", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "# Create a sagemaker session to upload data to S3\n", - "import sagemaker\n", - "\n", - "sagemaker_session = sagemaker.Session()\n", - "\n", - "# Upload data to default S3 bucket\n", - "prefix = \"DEMO-sklearn-iris\"\n", - "training_input_path = sagemaker_session.upload_data(\"train.csv\", key_prefix=prefix + \"/training\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "9d52c534", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Train model\n", - "The model is trained using the SageMaker SDK's Estimator class. Firstly, get the execution role for training. This role allows us to access the S3 bucket in the last step, where the train and test data set is located." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f7cbdad2", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "# Use the current execution role for training. It needs access to S3\n", - "role = sagemaker.get_execution_role()\n", - "print(role)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "10cdcfb6", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Then, it is time to define the SageMaker SDK Estimator class. We use an Estimator class specifically desgined to train scikit-learn models called `SKLearn`. In this estimator, we define the following parameters:\n", - "1. The script that we want to use to train the model (i.e. `entry_point`). This is the heart of the Script Mode method. Additionally, set the `script_mode` parameter to `True`.\n", - "1. The role which allows us access to the S3 bucket containing the train and test data set (i.e. `role`)\n", - "1. How many instances we want to use in training (i.e. `instance_count`) and what type of instance we want to use in training (i.e. `instance_type`)\n", - "1. Which version of scikit-learn to use (i.e. `framework_version`)\n", - "1. Training hyperparameters (i.e. `hyperparameters`)\n", - "\n", - "After setting these parameters, the `fit` function is invoked to train the model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ac14dcb7", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "# Docs: https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html\n", - "\n", - "from sagemaker.sklearn import SKLearn\n", - "\n", - "sk_estimator = SKLearn(\n", - " entry_point=\"train.py\",\n", - " role=role,\n", - " instance_count=1,\n", - " instance_type=\"ml.c5.xlarge\",\n", - " py_version=\"py3\",\n", - " framework_version=\"1.2-1\",\n", - " script_mode=True,\n", - " hyperparameters={\"estimators\": 20},\n", - ")\n", - "\n", - "# Train the estimator\n", - "sk_estimator.fit({\"train\": training_input_path})" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "3813b62c", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Deploy and test endpoint\n", - "After training the model, it is time to deploy it as an endpoint. To do so, we invoke the `deploy` function within the scikit-learn estimator. As shown in the code below, one can define the number of instances (i.e. `initial_instance_count`) and instance type (i.e. `instance_type`) used to deploy the model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "06aace5c", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "import time\n", - "\n", - "sk_endpoint_name = \"sklearn-rf-model\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n", - "sk_predictor = sk_estimator.deploy(\n", - " initial_instance_count=1, instance_type=\"ml.m5.large\", endpoint_name=sk_endpoint_name\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "bbc747e1", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "After the endpoint has been completely deployed, it can be invoked using the [SageMaker Runtime Client](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html) (which is the method used in the code cell below) or [Scikit Learn Predictor](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-predictor). If you plan to use the latter method, make sure to use a [Serializer](https://sagemaker.readthedocs.io/en/stable/api/inference/serializers.html) to serialize your data properly." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "85491166", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "import json\n", - "\n", - "client = sagemaker_session.sagemaker_runtime_client\n", - "\n", - "request_body = {\"Input\": [[9.0, 3571, 1976, 0.525]]}\n", - "data = json.loads(json.dumps(request_body))\n", - "payload = json.dumps(data)\n", - "\n", - "response = client.invoke_endpoint(\n", - " EndpointName=sk_endpoint_name, ContentType=\"application/json\", Body=payload\n", - ")\n", - "\n", - "result = json.loads(response[\"Body\"].read().decode())[\"Output\"]\n", - "print(\"Predicted class category {} ({})\".format(result, categories_map[result]))" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "90f26921", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Cleanup\n", - "If the model and endpoint are no longer in use, they should be deleted to save costs and free up resources." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "ec5a3a83", - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "sk_predictor.delete_model()\n", - "sk_predictor.delete_endpoint()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "454a7ca7", - "metadata": {}, - "source": [ - "## Notebook CI Test Results\n", - "\n", - "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", - "\n", - "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/sagemaker-script-mode|sklearn|sklearn_byom.ipynb)\n", - "\n", - "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/sagemaker-script-mode|sklearn|sklearn_byom.ipynb)\n", - "\n", - "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/sagemaker-script-mode|sklearn|sklearn_byom.ipynb)\n", - "\n", - "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/sagemaker-script-mode|sklearn|sklearn_byom.ipynb)\n", - "\n", - "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/sagemaker-script-mode|sklearn|sklearn_byom.ipynb)\n", - "\n", - "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/sagemaker-script-mode|sklearn|sklearn_byom.ipynb)\n", - "\n", - "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/sagemaker-script-mode|sklearn|sklearn_byom.ipynb)\n", - "\n", - "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/sagemaker-script-mode|sklearn|sklearn_byom.ipynb)\n", - "\n", - "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/sagemaker-script-mode|sklearn|sklearn_byom.ipynb)\n", - "\n", - "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/sagemaker-script-mode|sklearn|sklearn_byom.ipynb)\n", - "\n", - "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/sagemaker-script-mode|sklearn|sklearn_byom.ipynb)\n", - "\n", - "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/sagemaker-script-mode|sklearn|sklearn_byom.ipynb)\n", - "\n", - "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/sagemaker-script-mode|sklearn|sklearn_byom.ipynb)\n", - "\n", - "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/sagemaker-script-mode|sklearn|sklearn_byom.ipynb)\n", - "\n", - "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/sagemaker-script-mode|sklearn|sklearn_byom.ipynb)\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (Data Science 3.0)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.6" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/sagemaker_batch_transform/pytorch_mnist_batch_transform/pytorch-mnist-batch-transform.ipynb b/sagemaker_batch_transform/pytorch_mnist_batch_transform/pytorch-mnist-batch-transform.ipynb deleted file mode 100644 index 606743e2a9..0000000000 --- a/sagemaker_batch_transform/pytorch_mnist_batch_transform/pytorch-mnist-batch-transform.ipynb +++ /dev/null @@ -1,2290 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "id": "8c8a3cea", - "metadata": { - "papermill": { - "duration": 0.009489, - "end_time": "2021-06-03T00:10:10.266437", - "exception": false, - "start_time": "2021-06-03T00:10:10.256948", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "# Use SageMaker Batch Transform for PyTorch Batch Inference\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "ac52b806", - "metadata": {}, - "source": [ - "---\n", - "\n", - "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", - "\n", - "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/sagemaker_batch_transform|pytorch_mnist_batch_transform|pytorch-mnist-batch-transform.ipynb)\n", - "\n", - "---" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "ea2e8bde", - "metadata": { - "papermill": { - "duration": 0.009489, - "end_time": "2021-06-03T00:10:10.266437", - "exception": false, - "start_time": "2021-06-03T00:10:10.256948", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "In this notebook, we examine how to do a Batch Transform task with PyTorch in Amazon SageMaker. \n", - "\n", - "First, an image classification model is built on the MNIST dataset. Then, we demonstrate batch transform by using the SageMaker Python SDK PyTorch framework with different configurations:\n", - "- `data_type=S3Prefix`: uses all objects that match the specified S3 prefix for batch inference.\n", - "- `data_type=ManifestFile`: a manifest file contains a list of object keys to use in batch inference.\n", - "- `instance_count>1`: distributes the batch inference dataset to multiple inference instances.\n", - "\n", - "For batch transform in TensorFlow in Amazon SageMaker, you can follow other Jupyter notebooks in the [sagemaker_batch_transform](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker_batch_transform) directory.\n", - "\n", - "### Runtime\n", - "\n", - "This notebook takes approximately 15 minutes to run.\n", - "\n", - "### Contents\n", - "\n", - "1. [Setup](#Setup)\n", - "1. [Model training](#Model-training)\n", - "1. [Prepare batch inference data](#Prepare-batch-inference-data)\n", - "1. [Create model transformer](#Create-model-transformer)\n", - "1. [Batch inference](#Batch-inference)\n", - "1. [Look at all transform jobs](#Look-at-all-transform-jobs)\n", - "1. [Conclusion](#Conclusion)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "cb8aa488", - "metadata": { - "papermill": { - "duration": 0.009319, - "end_time": "2021-06-03T00:10:10.285106", - "exception": false, - "start_time": "2021-06-03T00:10:10.275787", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "## Setup\n", - "We'll begin with some necessary installs and imports, and get an Amazon SageMaker session to help perform certain tasks, as well as an IAM role with the necessary permissions." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "347fb3de", - "metadata": {}, - "outputs": [], - "source": [ - "!pip install nvidia-ml-py3\n", - "!yes | pip uninstall torchvision\n", - "!pip install torchvision" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "53e1a695", - "metadata": { - "execution": { - "iopub.execute_input": "2021-06-03T00:10:10.310480Z", - "iopub.status.busy": "2021-06-03T00:10:10.309977Z", - "iopub.status.idle": "2021-06-03T00:10:11.972019Z", - "shell.execute_reply": "2021-06-03T00:10:11.971547Z" - }, - "papermill": { - "duration": 1.677667, - "end_time": "2021-06-03T00:10:11.972131", - "exception": false, - "start_time": "2021-06-03T00:10:10.294464", - "status": "completed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "%matplotlib inline\n", - "import matplotlib\n", - "import numpy as np\n", - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "import os\n", - "from os import listdir\n", - "from os.path import isfile, join\n", - "from shutil import copyfile\n", - "import sagemaker\n", - "from sagemaker.pytorch import PyTorchModel\n", - "from sagemaker import get_execution_role\n", - "\n", - "sagemaker_session = sagemaker.Session()\n", - "region = sagemaker_session.boto_region_name\n", - "role = get_execution_role()\n", - "\n", - "bucket = sagemaker_session.default_bucket()\n", - "prefix = \"sagemaker/DEMO-pytorch-batch-inference-script\"\n", - "print(\"Bucket: {}\".format(bucket))" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "1df34f4f", - "metadata": { - "papermill": { - "duration": 0.009748, - "end_time": "2021-06-03T00:10:11.992188", - "exception": false, - "start_time": "2021-06-03T00:10:11.982440", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "## Model training" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "2e50d7ed", - "metadata": { - "papermill": { - "duration": 0.009924, - "end_time": "2021-06-03T00:10:12.012090", - "exception": false, - "start_time": "2021-06-03T00:10:12.002166", - "status": "completed" - }, - "tags": [] - }, - "source": [ - "Since the main purpose of this notebook is to demonstrate SageMaker PyTorch batch transform, we reuse a SageMaker Python SDK [PyTorch MNIST example](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/pytorch_mnist) to train a PyTorch model. It takes around 7 minutes to finish the training." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bfa3102c", - "metadata": { - "execution": { - "iopub.execute_input": "2021-06-03T00:10:12.038135Z", - "iopub.status.busy": "2021-06-03T00:10:12.037362Z", - "iopub.status.idle": "2021-06-03T00:15:42.451109Z", - "shell.execute_reply": "2021-06-03T00:15:42.449969Z" - }, - "papermill": { - "duration": 330.429296, - "end_time": "2021-06-03T00:15:42.451328", - "exception": true, - "start_time": "2021-06-03T00:10:12.022032", - "status": "failed" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "from torchvision.datasets import MNIST\n", - "from torchvision import transforms\n", - "\n", - "local_dir = \"data\"\n", - "MNIST.mirrors = [\n", - " f\"https://sagemaker-example-files-prod-{region}.s3.amazonaws.com/datasets/image/MNIST/\"\n", - "]\n", - "MNIST(\n", - " local_dir,\n", - " download=True,\n", - " transform=transforms.Compose(\n", - " [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]\n", - " ),\n", - ")\n", - "\n", - "\n", - "inputs = sagemaker_session.upload_data(path=local_dir, bucket=bucket, key_prefix=prefix)\n", - "print(\"input spec (in this case, just an S3 path): {}\".format(inputs))\n", - "\n", - "from sagemaker.pytorch import PyTorch\n", - "\n", - "estimator = PyTorch(\n", - " entry_point=\"model-script/mnist.py\",\n", - " role=role,\n", - " framework_version=\"1.8.0\",\n", - " py_version=\"py3\",\n", - " instance_count=3,\n", - " instance_type=\"ml.c5.2xlarge\",\n", - " hyperparameters={\n", - " \"epochs\": 1,\n", - " \"backend\": \"gloo\",\n", - " }, # set epochs to a more realistic number for real training\n", - ")\n", - "\n", - "estimator.fit({\"training\": inputs})" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "a0f0249f", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "source": [ - "## Prepare batch inference data\n", - "\n", - "Convert the test data into PNG image format." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "343a2a68", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "!ls data/MNIST/raw" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a29e9c07", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# untar gz => png\n", - "\n", - "import gzip\n", - "import numpy as np\n", - "import os\n", - "\n", - "with gzip.open(os.path.join(local_dir, \"MNIST/raw\", \"t10k-images-idx3-ubyte.gz\"), \"rb\") as f:\n", - " images = np.frombuffer(f.read(), np.uint8, offset=16).reshape(-1, 28, 28)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "91f0f659", - "metadata": {}, - "outputs": [], - "source": [ - "print(len(images), \"test images\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "b617160c", - "metadata": {}, - "source": [ - "Randomly sample 100 test images and upload them to S3." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "62f06915", - "metadata": {}, - "outputs": [], - "source": [ - "import random\n", - "from PIL import Image as im\n", - "\n", - "ids = random.sample(range(len(images)), 100)\n", - "ids = np.array(ids, dtype=np.int)\n", - "selected_images = images[ids]\n", - "\n", - "image_dir = \"data/images\"\n", - "\n", - "if not os.path.exists(image_dir):\n", - " os.makedirs(image_dir)\n", - "\n", - "for i, img in enumerate(selected_images):\n", - " pngimg = im.fromarray(img)\n", - " pngimg.save(os.path.join(image_dir, f\"{i}.png\"))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bf93b71e", - "metadata": {}, - "outputs": [], - "source": [ - "inference_prefix = \"batch_transform\"\n", - "inference_inputs = sagemaker_session.upload_data(\n", - " path=image_dir, bucket=bucket, key_prefix=inference_prefix\n", - ")\n", - "print(\"Input S3 path: {}\".format(inference_inputs))" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "ff8b9b66", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "source": [ - "## Create model transformer\n", - "Now, we create a transformer object for creating and interacting with Amazon SageMaker transform jobs. We can create the transformer in two ways:\n", - "1. Use a fitted estimator directly.\n", - "1. First create a PyTorchModel from a saved model artifact, and then create a transformer from the PyTorchModel object.\n", - "\n", - "\n", - "Here, we implement the `model_fn`, `input_fn`, `predict_fn` and `output_fn` function to override the default [PyTorch inference handler](https://github.com/aws/sagemaker-pytorch-inference-toolkit/blob/master/src/sagemaker_pytorch_serving_container/default_inference_handler.py). \n", - "\n", - "In the `input_fn()` function, the inferenced images are encoded as a Python ByteArray. That's why we use the `load_from_bytearray()` function to load images from `io.BytesIO` and then use `PIL.image` to read the images.\n", - "\n", - "```python\n", - "def model_fn(model_dir):\n", - " device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", - " model = torch.nn.DataParallel(Net())\n", - " with open(os.path.join(model_dir, \"model.pth\"), \"rb\") as f:\n", - " model.load_state_dict(torch.load(f))\n", - " return model.to(device)\n", - "\n", - " \n", - "def load_from_bytearray(request_body):\n", - " image_as_bytes = io.BytesIO(request_body)\n", - " image = Image.open(image_as_bytes)\n", - " image_tensor = ToTensor()(image).unsqueeze(0) \n", - " return image_tensor\n", - "\n", - "\n", - "def input_fn(request_body, request_content_type):\n", - " # if set content_type as \"image/jpg\" or \"application/x-npy\", \n", - " # the input is also a python bytearray\n", - " if request_content_type == \"application/x-image\": \n", - " image_tensor = load_from_bytearray(request_body)\n", - " else:\n", - " print(\"not support this type yet\")\n", - " raise ValueError(\"not support this type yet\")\n", - " return image_tensor\n", - "\n", - "\n", - "# Perform prediction on the deserialized object, with the loaded model\n", - "def predict_fn(input_object, model):\n", - " output = model.forward(input_object)\n", - " pred = output.max(1, keepdim=True)[1]\n", - "\n", - " return {\"predictions\": pred.item()}\n", - "\n", - "\n", - "# Serialize the prediction result into the desired response content type\n", - "def output_fn(predictions, response_content_type):\n", - " return json.dumps(predictions)\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "86782070", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Use fitted estimator directly\n", - "transformer = estimator.transformer(instance_count=1, instance_type=\"ml.c5.xlarge\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "09735ff2", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# You can also create a Transformer object from saved model artifact\n", - "\n", - "# Get model artifact location by estimator.model_data, or give an S3 key directly\n", - "model_artifact_s3_location = estimator.model_data # \"s3:////model.tar.gz\"\n", - "\n", - "# Create PyTorchModel from saved model artifact\n", - "pytorch_model = PyTorchModel(\n", - " model_data=model_artifact_s3_location,\n", - " role=role,\n", - " framework_version=\"1.8.0\",\n", - " py_version=\"py3\",\n", - " source_dir=\"model-script/\",\n", - " entry_point=\"mnist.py\",\n", - ")\n", - "\n", - "# Create transformer from PyTorchModel object\n", - "transformer = pytorch_model.transformer(instance_count=1, instance_type=\"ml.c5.xlarge\")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "f024f81c", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "source": [ - "## Batch inference\n", - "Next, we perform inference on the sampled 100 MNIST images in a batch manner. " - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "e3aafd66", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "source": [ - "### Input images directly from S3 location\n", - "We set `S3DataType=S3Prefix` to use all objects that match the specified S3 prefix for batch inference." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3f666cde", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "transformer.transform(\n", - " data=inference_inputs,\n", - " data_type=\"S3Prefix\",\n", - " content_type=\"application/x-image\",\n", - " wait=True,\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "9d42055d", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "source": [ - "### Input images by manifest file\n", - "First, we generate a manifest file. Then we use the manifest file containing a list of object keys as inputs to batch inference. Some key points:\n", - "- `content_type = \"application/x-image\"` (here the `content_type` is for the actual object for inference, not for the manifest file)\n", - "- `data_type = \"ManifestFile\"`\n", - "- Manifest file format must follow the format as [S3DataSource](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html#SageMaker-Type-S3DataSource-S3DataType) points out. We create the manifest file by using the jsonlines package.\n", - "``` json\n", - "[\n", - " {\"prefix\": \"s3://customer_bucket/some/prefix/\"},\n", - " \"relative/path/to/custdata-1\",\n", - " \"relative/path/custdata-2\",\n", - " ...\n", - " \"relative/path/custdata-N\"\n", - "]\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "295c39fc", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "!pip install -q jsonlines" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b279b271", - "metadata": {}, - "outputs": [], - "source": [ - "import jsonlines\n", - "\n", - "# Build image list\n", - "manifest_prefix = f\"s3://{bucket}/{prefix}/images/\"\n", - "\n", - "path = image_dir\n", - "img_files = [f for f in listdir(path) if isfile(join(path, f))]\n", - "\n", - "print(\"img_files\\n\", img_files)\n", - "\n", - "manifest_content = [{\"prefix\": manifest_prefix}]\n", - "manifest_content.extend(img_files)\n", - "\n", - "print(\"manifest_content\\n\", manifest_content)\n", - "\n", - "# Write jsonl file\n", - "manifest_file = \"manifest.json\"\n", - "with jsonlines.open(manifest_file, mode=\"w\") as writer:\n", - " writer.write(manifest_content)\n", - "\n", - "# Upload to S3\n", - "manifest_obj = sagemaker_session.upload_data(path=manifest_file, key_prefix=prefix)\n", - "\n", - "print(\"manifest_obj\\n\", manifest_obj)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b58e5fe6", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "# Batch transform with manifest file\n", - "transform_job = transformer.transform(\n", - " data=manifest_obj,\n", - " data_type=\"ManifestFile\",\n", - " content_type=\"application/x-image\",\n", - " wait=False,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "aaa60562", - "metadata": {}, - "outputs": [], - "source": [ - "print(\"Latest transform job:\", transformer.latest_transform_job.name)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "56dde353", - "metadata": {}, - "outputs": [], - "source": [ - "# look at the status of the transform job\n", - "import pprint as pp\n", - "\n", - "sm_cli = sagemaker_session.sagemaker_client\n", - "\n", - "job_info = sm_cli.describe_transform_job(TransformJobName=transformer.latest_transform_job.name)\n", - "\n", - "pp.pprint(job_info)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "f4a43f63", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "source": [ - "### Multiple instance\n", - "We use `instance_count > 1` to create multiple inference instances. When a batch transform job starts, Amazon SageMaker initializes compute instances and distributes the inference or preprocessing workload between them. Batch Transform partitions the Amazon S3 objects in the input by key and maps Amazon S3 objects to instances. Given multiple files, one instance might process input1.csv, and another instance might process input2.csv. Read more at [Use Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9661fe0e", - "metadata": { - "papermill": { - "duration": null, - "end_time": null, - "exception": null, - "start_time": null, - "status": "pending" - }, - "tags": [] - }, - "outputs": [], - "source": [ - "dist_transformer = estimator.transformer(instance_count=2, instance_type=\"ml.c4.xlarge\")\n", - "\n", - "dist_transformer.transform(\n", - " data=inference_inputs,\n", - " data_type=\"S3Prefix\",\n", - " content_type=\"application/x-image\",\n", - " wait=True,\n", - ")" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "57d2f7f8", - "metadata": {}, - "source": [ - "## Look at all transform jobs" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "942c6f2e", - "metadata": {}, - "source": [ - "We list and describe the transform jobs to retrieve information about them." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7725d230", - "metadata": {}, - "outputs": [], - "source": [ - "transform_jobs = sm_cli.list_transform_jobs()[\"TransformJobSummaries\"]\n", - "for job in transform_jobs:\n", - " pp.pprint(job)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5b694abf", - "metadata": {}, - "outputs": [], - "source": [ - "job_info = sm_cli.describe_transform_job(\n", - " TransformJobName=dist_transformer.latest_transform_job.name\n", - ")\n", - "\n", - "pp.pprint(job_info)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9e682401", - "metadata": {}, - "outputs": [], - "source": [ - "import re\n", - "\n", - "\n", - "def get_bucket_and_prefix(s3_output_path):\n", - " trim = re.sub(\"s3://\", \"\", s3_output_path)\n", - " bucket, prefix = trim.split(\"/\")\n", - " return bucket, prefix\n", - "\n", - "\n", - "local_path = \"output\" # Where to save the output locally\n", - "\n", - "bucket, output_prefix = get_bucket_and_prefix(job_info[\"TransformOutput\"][\"S3OutputPath\"])\n", - "print(bucket, output_prefix)\n", - "\n", - "sagemaker_session.download_data(path=local_path, bucket=bucket, key_prefix=output_prefix)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1ae24be8", - "metadata": {}, - "outputs": [], - "source": [ - "!ls {local_path}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8c336288", - "metadata": {}, - "outputs": [], - "source": [ - "# Inspect the output\n", - "\n", - "import json\n", - "\n", - "for f in os.listdir(local_path):\n", - " path = os.path.join(local_path, f)\n", - " with open(path, \"r\") as f:\n", - " pred = json.load(f)\n", - " print(pred)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "e3cbd160", - "metadata": {}, - "source": [ - "## Conclusion\n", - "\n", - "In this notebook, we trained a PyTorch model, created a transformer from it, and then performed batch inference using S3 inputs, manifest files, and on multiple instances. This shows a variety of options that are available when running SageMaker Batch Transform jobs for batch inference." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "id": "cdb3abb1", - "metadata": {}, - "source": [ - "## Notebook CI Test Results\n", - "\n", - "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", - "\n", - "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/sagemaker_batch_transform|pytorch_mnist_batch_transform|pytorch-mnist-batch-transform.ipynb)\n", - "\n", - "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/sagemaker_batch_transform|pytorch_mnist_batch_transform|pytorch-mnist-batch-transform.ipynb)\n", - "\n", - "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/sagemaker_batch_transform|pytorch_mnist_batch_transform|pytorch-mnist-batch-transform.ipynb)\n", - "\n", - "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/sagemaker_batch_transform|pytorch_mnist_batch_transform|pytorch-mnist-batch-transform.ipynb)\n", - "\n", - "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/sagemaker_batch_transform|pytorch_mnist_batch_transform|pytorch-mnist-batch-transform.ipynb)\n", - "\n", - "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/sagemaker_batch_transform|pytorch_mnist_batch_transform|pytorch-mnist-batch-transform.ipynb)\n", - "\n", - "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/sagemaker_batch_transform|pytorch_mnist_batch_transform|pytorch-mnist-batch-transform.ipynb)\n", - "\n", - "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/sagemaker_batch_transform|pytorch_mnist_batch_transform|pytorch-mnist-batch-transform.ipynb)\n", - "\n", - "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/sagemaker_batch_transform|pytorch_mnist_batch_transform|pytorch-mnist-batch-transform.ipynb)\n", - "\n", - "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/sagemaker_batch_transform|pytorch_mnist_batch_transform|pytorch-mnist-batch-transform.ipynb)\n", - "\n", - "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/sagemaker_batch_transform|pytorch_mnist_batch_transform|pytorch-mnist-batch-transform.ipynb)\n", - "\n", - "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/sagemaker_batch_transform|pytorch_mnist_batch_transform|pytorch-mnist-batch-transform.ipynb)\n", - "\n", - "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/sagemaker_batch_transform|pytorch_mnist_batch_transform|pytorch-mnist-batch-transform.ipynb)\n", - "\n", - "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/sagemaker_batch_transform|pytorch_mnist_batch_transform|pytorch-mnist-batch-transform.ipynb)\n", - "\n", - "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/sagemaker_batch_transform|pytorch_mnist_batch_transform|pytorch-mnist-batch-transform.ipynb)\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (PyTorch 1.13 Python 3.9 CPU Optimized)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/pytorch-1.13-cpu-py39" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.16" - }, - "papermill": { - "default_parameters": {}, - "duration": 333.854918, - "end_time": "2021-06-03T00:15:43.072184", - "environment_variables": {}, - "exception": true, - "input_path": "pytorch-mnist-batch-transform.ipynb", - "output_path": "/opt/ml/processing/output/pytorch-mnist-batch-transform-2021-06-03-00-06-06.ipynb", - "parameters": { - "kms_key": "arn:aws:kms:us-west-2:521695447989:key/6e9984db-50cf-4c7e-926c-877ec47a8b25" - }, - "start_time": "2021-06-03T00:10:09.217266", - "version": "2.3.3" - }, - "widgets": { - "application/vnd.jupyter.widget-state+json": { - "state": { - "01005530a5b1473b9f4a024b19c04c0e": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_968ed82ad8f0453e8f81a839df4428db", - "placeholder": "​", - "style": "IPY_MODEL_e4f0965e53ee40adb1ae44da87428325", - "value": " 0%" - } - }, - "0995f6633c0f4facabe6759837c606ba": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": "20px" - } - }, - "1410dcfcd117434889e9594cdde4e1b0": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "18caaab41d6146c1824859691f6cb435": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_d823500ff0dc4c2198b83cd231f8bffe", - "placeholder": "​", - "style": "IPY_MODEL_7dab31892241494e8d27d38ca98e5aa6", - "value": " 0/28881 [00:00<?, ?it/s]" - } - }, - "19ef65b0ecae45bdbca066cea679878d": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HBoxModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_2ceacd43f28744eb9b7a12f8276b6016", - "IPY_MODEL_e44ddce6c5704f0b9495ee662806f5f6", - "IPY_MODEL_7717cc87ebcc4c0581ae32848b40982c" - ], - "layout": "IPY_MODEL_59d0678977a343abb8a02dc5c9699b89" - } - }, - "2126024805384bff9b0409b4dc91e60c": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "216ba33f9f1b486ebac2a6fce0510246": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_f94b5a0d68c541e894e325a0e2f899d2", - "placeholder": "​", - "style": "IPY_MODEL_633cc1cdb94e43a6a07559483496c60d", - "value": " 0%" - } - }, - "23445154eb524df985b5a755fcbddd32": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HBoxModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_216ba33f9f1b486ebac2a6fce0510246", - "IPY_MODEL_c4f4f4bfe979469c9bc59ab73bbf518f", - "IPY_MODEL_fe83e178358040eaa07f6198ba693fc9" - ], - "layout": "IPY_MODEL_cf1f337300394948bce741af7bcd8b8c" - } - }, - "235ae38cf16e4aacb95c3d16d9749da3": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "2c2474d5a8144bf8930fa5cc02c73ccf": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HBoxModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_5495428879544d6da73e2ed7e70f0c96", - "IPY_MODEL_6e2a4641cd944d9a8196f4a836e90590", - "IPY_MODEL_9179e5f467c8450a988b988d7da06090" - ], - "layout": "IPY_MODEL_596f8cbad0884ec79cf6ee757cc9f38a" - } - }, - "2ceacd43f28744eb9b7a12f8276b6016": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_a540362f86774590851c1d0892bea723", - "placeholder": "​", - "style": "IPY_MODEL_bb9ebd025f05499da7b847b8ef7a9ff5", - "value": "" - } - }, - "495839f4239743669d9ee61cfbc33967": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "ProgressStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "4d62b9fde9104c8081b545c3933a077e": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "ProgressStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "51a28ca59cf9407ea0e02da868d79ebd": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "ProgressStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "5495428879544d6da73e2ed7e70f0c96": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_235ae38cf16e4aacb95c3d16d9749da3", - "placeholder": "​", - "style": "IPY_MODEL_fe60ae53dd1646ca91018ba20934948b", - "value": "" - } - }, - "596f8cbad0884ec79cf6ee757cc9f38a": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "59d0678977a343abb8a02dc5c9699b89": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "633cc1cdb94e43a6a07559483496c60d": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "63a57f663bfa4a1585c1ba36501b6b23": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "6e2a4641cd944d9a8196f4a836e90590": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "FloatProgressModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "info", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_fb8c653eeeb24799bcc9279389fdb523", - "max": 1, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_b513a456776d40b496f035c64360db90", - "value": 1 - } - }, - "7717cc87ebcc4c0581ae32848b40982c": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_9b63360561b34257b171498e67902dda", - "placeholder": "​", - "style": "IPY_MODEL_f86487d9a78940a394503b2bea77d756", - "value": " 9920512/? [04:50<00:00, 36552.15it/s]" - } - }, - "7bceed60fb344aa182dccc3dcf0ee886": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HBoxModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HBoxModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HBoxView", - "box_style": "", - "children": [ - "IPY_MODEL_01005530a5b1473b9f4a024b19c04c0e", - "IPY_MODEL_e82a5227430443d98d29555fd77b2bd3", - "IPY_MODEL_18caaab41d6146c1824859691f6cb435" - ], - "layout": "IPY_MODEL_63a57f663bfa4a1585c1ba36501b6b23" - } - }, - "7dab31892241494e8d27d38ca98e5aa6": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "8b5b76e77cb14ecf95a310ba46ed86f5": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": "20px" - } - }, - "9179e5f467c8450a988b988d7da06090": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_94ef992b73d44d829b863815da70111f", - "placeholder": "​", - "style": "IPY_MODEL_1410dcfcd117434889e9594cdde4e1b0", - "value": " 1654784/? [00:47<00:00, 33514.08it/s]" - } - }, - "94ef992b73d44d829b863815da70111f": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "968ed82ad8f0453e8f81a839df4428db": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "9b63360561b34257b171498e67902dda": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "a540362f86774590851c1d0892bea723": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "b513a456776d40b496f035c64360db90": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "ProgressStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "ProgressStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "bar_color": null, - "description_width": "" - } - }, - "bb9ebd025f05499da7b847b8ef7a9ff5": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "c0b88a223b374693b6b0c74db9ffe346": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": "20px" - } - }, - "c4f4f4bfe979469c9bc59ab73bbf518f": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "FloatProgressModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "info", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_8b5b76e77cb14ecf95a310ba46ed86f5", - "max": 1, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_51a28ca59cf9407ea0e02da868d79ebd", - "value": 0 - } - }, - "cf1f337300394948bce741af7bcd8b8c": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "d823500ff0dc4c2198b83cd231f8bffe": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "e44ddce6c5704f0b9495ee662806f5f6": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "FloatProgressModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "info", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_0995f6633c0f4facabe6759837c606ba", - "max": 1, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_4d62b9fde9104c8081b545c3933a077e", - "value": 1 - } - }, - "e4f0965e53ee40adb1ae44da87428325": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "e82a5227430443d98d29555fd77b2bd3": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "FloatProgressModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "FloatProgressModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "ProgressView", - "bar_style": "info", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_c0b88a223b374693b6b0c74db9ffe346", - "max": 1, - "min": 0, - "orientation": "horizontal", - "style": "IPY_MODEL_495839f4239743669d9ee61cfbc33967", - "value": 0 - } - }, - "eb4c77cfe2c54976aef8efc0e3207140": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "f86487d9a78940a394503b2bea77d756": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "f94b5a0d68c541e894e325a0e2f899d2": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": null - } - }, - "fb8c653eeeb24799bcc9279389fdb523": { - "model_module": "@jupyter-widgets/base", - "model_module_version": "1.2.0", - "model_name": "LayoutModel", - "state": { - "_model_module": "@jupyter-widgets/base", - "_model_module_version": "1.2.0", - "_model_name": "LayoutModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "LayoutView", - "align_content": null, - "align_items": null, - "align_self": null, - "border": null, - "bottom": null, - "display": null, - "flex": null, - "flex_flow": null, - "grid_area": null, - "grid_auto_columns": null, - "grid_auto_flow": null, - "grid_auto_rows": null, - "grid_column": null, - "grid_gap": null, - "grid_row": null, - "grid_template_areas": null, - "grid_template_columns": null, - "grid_template_rows": null, - "height": null, - "justify_content": null, - "justify_items": null, - "left": null, - "margin": null, - "max_height": null, - "max_width": null, - "min_height": null, - "min_width": null, - "object_fit": null, - "object_position": null, - "order": null, - "overflow": null, - "overflow_x": null, - "overflow_y": null, - "padding": null, - "right": null, - "top": null, - "visibility": null, - "width": "20px" - } - }, - "fe60ae53dd1646ca91018ba20934948b": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "DescriptionStyleModel", - "state": { - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "DescriptionStyleModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/base", - "_view_module_version": "1.2.0", - "_view_name": "StyleView", - "description_width": "" - } - }, - "fe83e178358040eaa07f6198ba693fc9": { - "model_module": "@jupyter-widgets/controls", - "model_module_version": "1.5.0", - "model_name": "HTMLModel", - "state": { - "_dom_classes": [], - "_model_module": "@jupyter-widgets/controls", - "_model_module_version": "1.5.0", - "_model_name": "HTMLModel", - "_view_count": null, - "_view_module": "@jupyter-widgets/controls", - "_view_module_version": "1.5.0", - "_view_name": "HTMLView", - "description": "", - "description_tooltip": null, - "layout": "IPY_MODEL_2126024805384bff9b0409b4dc91e60c", - "placeholder": "​", - "style": "IPY_MODEL_eb4c77cfe2c54976aef8efc0e3207140", - "value": " 0/4542 [00:00<?, ?it/s]" - } - } - }, - "version_major": 2, - "version_minor": 0 - } - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/sagemaker_model_monitor/introduction/SageMaker-ModelMonitoring.ipynb b/sagemaker_model_monitor/introduction/SageMaker-ModelMonitoring.ipynb deleted file mode 100644 index 190f8bb19d..0000000000 --- a/sagemaker_model_monitor/introduction/SageMaker-ModelMonitoring.ipynb +++ /dev/null @@ -1,814 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Amazon SageMaker Model Monitor\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "\n", - "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", - "\n", - "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/sagemaker_model_monitor|introduction|SageMaker-ModelMonitoring.ipynb)\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This notebook shows how to:\n", - "* Host a machine learning model in Amazon SageMaker and capture inference requests, results, and metadata \n", - "* Analyze a training dataset to generate baseline constraints\n", - "* Monitor a live endpoint for violations against constraints\n", - "\n", - "---\n", - "## Background\n", - "\n", - "Amazon SageMaker provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly. Amazon SageMaker is a fully-managed service that encompasses the entire machine learning workflow. You can label and prepare your data, choose an algorithm, train a model, and then tune and optimize it for deployment. You can deploy your models to production with Amazon SageMaker to make predictions and lower costs than was previously possible.\n", - "\n", - "In addition, Amazon SageMaker enables you to capture the input, output and metadata for invocations of the models that you deploy. It also enables you to analyze the data and monitor its quality. In this notebook, you learn how Amazon SageMaker enables these capabilities.\n", - "\n", - "## Runtime\n", - "\n", - "This notebook uses an hourly monitor, so it takes between 30-90 minutes to run.\n", - "\n", - "## Contents\n", - "\n", - "1. [PART A: Capturing real-time inference data from Amazon SageMaker endpoints](#PART-A:-Capturing-real-time-inference-data-from-Amazon-SageMaker-endpoints)\n", - "1. [PART B: Model Monitor - Baselining and continuous monitoring](#PART-B:-Model-Monitor---Baselining-and-continuous-monitoring)\n", - " 1. [Constraint suggestion with baseline/training dataset](#1.-Constraint-suggestion-with-baseline/training-dataset)\n", - " 1. [Analyze collected data for data quality issues](#2.-Analyze-collected-data-for-data-quality-issues)\n", - "---\n", - "## Setup\n", - "\n", - "To get started, make sure you have these prerequisites completed:\n", - "\n", - "* Specify an AWS Region to host your model.\n", - "* An IAM role ARN exists that is used to give Amazon SageMaker access to your data in Amazon Simple Storage Service (Amazon S3).\n", - "* Use the default S3 bucket to store the data used to train your model, any additional model data, and the data captured from model invocations. For demonstration purposes, you are using the same bucket for these. In reality, you might want to separate them with different security policies." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "isConfigCell": true - }, - "outputs": [], - "source": [ - "import os\n", - "import boto3\n", - "import re\n", - "import json\n", - "import sagemaker\n", - "from sagemaker import get_execution_role, session\n", - "\n", - "sm_session = sagemaker.Session()\n", - "region = sm_session.boto_region_name\n", - "\n", - "role = get_execution_role()\n", - "print(\"Role ARN: {}\".format(role))\n", - "\n", - "bucket = sm_session.default_bucket()\n", - "print(\"Demo Bucket: {}\".format(bucket))\n", - "prefix = \"sagemaker/DEMO-ModelMonitor\"\n", - "\n", - "data_capture_prefix = \"{}/datacapture\".format(prefix)\n", - "s3_capture_upload_path = \"s3://{}/{}\".format(bucket, data_capture_prefix)\n", - "reports_prefix = \"{}/reports\".format(prefix)\n", - "s3_report_path = \"s3://{}/{}\".format(bucket, reports_prefix)\n", - "code_prefix = \"{}/code\".format(prefix)\n", - "s3_code_preprocessor_uri = \"s3://{}/{}/{}\".format(bucket, code_prefix, \"preprocessor.py\")\n", - "s3_code_postprocessor_uri = \"s3://{}/{}/{}\".format(bucket, code_prefix, \"postprocessor.py\")\n", - "\n", - "print(\"Capture path: {}\".format(s3_capture_upload_path))\n", - "print(\"Report path: {}\".format(s3_report_path))\n", - "print(\"Preproc Code path: {}\".format(s3_code_preprocessor_uri))\n", - "print(\"Postproc Code path: {}\".format(s3_code_postprocessor_uri))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## PART A: Capturing real-time inference data from Amazon SageMaker endpoints\n", - "Create an endpoint to showcase the data capture capability in action.\n", - "\n", - "### Upload the pre-trained model to Amazon S3\n", - "This code uploads a pre-trained XGBoost model that is ready for you to deploy. This model was trained using the XGB Churn Prediction Notebook in SageMaker. You can also use your own pre-trained model in this step. If you already have a pretrained model in Amazon S3, you can add it instead by specifying the s3_key." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "model_file = open(\"model/xgb-churn-prediction-model.tar.gz\", \"rb\")\n", - "s3_key = os.path.join(prefix, \"xgb-churn-prediction-model.tar.gz\")\n", - "boto3.Session().resource(\"s3\").Bucket(bucket).Object(s3_key).upload_fileobj(model_file)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Deploy the model to Amazon SageMaker\n", - "Start with deploying a pre-trained churn prediction model. Here, you create the model object with the image and model data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from time import gmtime, strftime\n", - "from sagemaker.model import Model\n", - "from sagemaker.image_uris import retrieve\n", - "\n", - "model_name = \"DEMO-xgb-churn-pred-model-monitor-\" + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", - "model_url = \"https://{}.s3-{}.amazonaws.com/{}/xgb-churn-prediction-model.tar.gz\".format(\n", - " bucket, region, prefix\n", - ")\n", - "\n", - "image_uri = retrieve(\"xgboost\", region, \"0.90-1\")\n", - "\n", - "model = Model(image_uri=image_uri, model_data=model_url, role=role)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To enable data capture for monitoring the model data quality, you specify the new capture option called `DataCaptureConfig`. You can capture the request payload, the response payload or both with this configuration. The capture config applies to all variants. Go ahead with the deployment." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sagemaker.model_monitor import DataCaptureConfig\n", - "\n", - "endpoint_name = \"DEMO-xgb-churn-pred-model-monitor-\" + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", - "print(\"EndpointName={}\".format(endpoint_name))\n", - "\n", - "data_capture_config = DataCaptureConfig(\n", - " enable_capture=True, sampling_percentage=100, destination_s3_uri=s3_capture_upload_path\n", - ")\n", - "\n", - "predictor = model.deploy(\n", - " initial_instance_count=1,\n", - " instance_type=\"ml.m4.xlarge\",\n", - " endpoint_name=endpoint_name,\n", - " data_capture_config=data_capture_config,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Invoke the deployed model\n", - "\n", - "You can now send data to this endpoint to get inferences in real time. Because you enabled the data capture in the previous steps, the request and response payload, along with some additional metadata, is saved in the Amazon Simple Storage Service (Amazon S3) location you have specified in the DataCaptureConfig." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This step invokes the endpoint with included sample data for about 3 minutes. Data is captured based on the sampling percentage specified and the capture continues until the data capture option is turned off." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sagemaker.predictor import Predictor\n", - "from sagemaker.serializers import CSVSerializer\n", - "import time\n", - "\n", - "predictor = Predictor(endpoint_name=endpoint_name, serializer=CSVSerializer())\n", - "\n", - "# Get a subset of test data for a quick test\n", - "!head -180 test_data/test-dataset-input-cols.csv > test_data/test_sample.csv\n", - "print(\"Sending test traffic to the endpoint {}. \\nPlease wait...\".format(endpoint_name))\n", - "\n", - "with open(\"test_data/test_sample.csv\", \"r\") as f:\n", - " for row in f:\n", - " payload = row.rstrip(\"\\n\")\n", - " response = predictor.predict(data=payload)\n", - " time.sleep(1)\n", - "\n", - "print(\"Done!\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### View captured data\n", - "\n", - "Now list the data capture files stored in Amazon S3. You should expect to see different files from different time periods organized based on the hour in which the invocation occurred. The format of the Amazon S3 path is:\n", - "\n", - "`s3://{destination-bucket-prefix}/{endpoint-name}/{variant-name}/yyyy/mm/dd/hh/filename.jsonl`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "s3_client = boto3.Session().client(\"s3\")\n", - "current_endpoint_capture_prefix = \"{}/{}\".format(data_capture_prefix, endpoint_name)\n", - "result = s3_client.list_objects(Bucket=bucket, Prefix=current_endpoint_capture_prefix)\n", - "capture_files = [capture_file.get(\"Key\") for capture_file in result.get(\"Contents\")]\n", - "print(\"Found Capture Files:\")\n", - "print(\"\\n \".join(capture_files))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, view the contents of a single capture file. Here you should see all the data captured in an Amazon SageMaker specific JSON-line formatted file. Take a quick peek at the first few lines in the captured file." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def get_obj_body(obj_key):\n", - " return s3_client.get_object(Bucket=bucket, Key=obj_key).get(\"Body\").read().decode(\"utf-8\")\n", - "\n", - "\n", - "capture_file = get_obj_body(capture_files[-1])\n", - "print(capture_file[:2000])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, the contents of a single line is present below in a formatted JSON file so that you can observe a little better." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "import json\n", - "\n", - "print(json.dumps(json.loads(capture_file.split(\"\\n\")[0]), indent=2))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As you can see, each inference request is captured in one line in the jsonl file. The line contains both the input and output merged together. In the example, you provided the ContentType as `text/csv` which is reflected in the `observedContentType` value. Also, you expose the encoding that you used to encode the input and output payloads in the capture format with the `encoding` value.\n", - "\n", - "To recap, you observed how you can enable capturing the input or output payloads to an endpoint with a new parameter. You have also observed what the captured format looks like in Amazon S3. Next, continue to explore how Amazon SageMaker helps with monitoring the data collected in Amazon S3." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## PART B: Model Monitor - Baselining and continuous monitoring" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In addition to collecting the data, Amazon SageMaker provides the capability for you to monitor and evaluate the data observed by the endpoints. For this:\n", - "1. Create a baseline with which you compare the realtime traffic. \n", - "1. Once a baseline is ready, setup a schedule to continously evaluate and compare against the baseline." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1. Constraint suggestion with baseline/training dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The training dataset with which you trained the model is usually a good baseline dataset. Note that the training dataset data schema and the inference dataset schema should exactly match (i.e. the number and order of the features).\n", - "\n", - "From the training dataset you can ask Amazon SageMaker to suggest a set of baseline `constraints` and generate descriptive `statistics` to explore the data. For this example, upload the training dataset that was used to train the pre-trained model included in this example. If you already have it in Amazon S3, you can directly point to it." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# copy over the training dataset to Amazon S3 (if you already have it in Amazon S3, you could reuse it)\n", - "baseline_prefix = prefix + \"/baselining\"\n", - "baseline_data_prefix = baseline_prefix + \"/data\"\n", - "baseline_results_prefix = baseline_prefix + \"/results\"\n", - "\n", - "baseline_data_uri = \"s3://{}/{}\".format(bucket, baseline_data_prefix)\n", - "baseline_results_uri = \"s3://{}/{}\".format(bucket, baseline_results_prefix)\n", - "print(\"Baseline data uri: {}\".format(baseline_data_uri))\n", - "print(\"Baseline results uri: {}\".format(baseline_results_uri))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "training_data_file = open(\"test_data/training-dataset-with-header.csv\", \"rb\")\n", - "s3_key = os.path.join(baseline_prefix, \"data\", \"training-dataset-with-header.csv\")\n", - "boto3.Session().resource(\"s3\").Bucket(bucket).Object(s3_key).upload_fileobj(training_data_file)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Create a baselining job with training dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now that you have the training data ready in Amazon S3, start a job to `suggest` constraints. `DefaultModelMonitor.suggest_baseline(..)` starts a `ProcessingJob` using an Amazon SageMaker provided Model Monitor container to generate the constraints." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sagemaker.model_monitor import DefaultModelMonitor\n", - "from sagemaker.model_monitor.dataset_format import DatasetFormat\n", - "\n", - "my_default_monitor = DefaultModelMonitor(\n", - " role=role,\n", - " instance_count=1,\n", - " instance_type=\"ml.m5.xlarge\",\n", - " volume_size_in_gb=20,\n", - " max_runtime_in_seconds=3600,\n", - ")\n", - "\n", - "my_default_monitor.suggest_baseline(\n", - " baseline_dataset=baseline_data_uri + \"/training-dataset-with-header.csv\",\n", - " dataset_format=DatasetFormat.csv(header=True),\n", - " output_s3_uri=baseline_results_uri,\n", - " wait=True,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Explore the generated constraints and statistics" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "s3_client = boto3.Session().client(\"s3\")\n", - "result = s3_client.list_objects(Bucket=bucket, Prefix=baseline_results_prefix)\n", - "report_files = [report_file.get(\"Key\") for report_file in result.get(\"Contents\")]\n", - "print(\"Found Files:\")\n", - "print(\"\\n \".join(report_files))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "\n", - "baseline_job = my_default_monitor.latest_baselining_job\n", - "schema_df = pd.io.json.json_normalize(baseline_job.baseline_statistics().body_dict[\"features\"])\n", - "schema_df.head(10)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "constraints_df = pd.io.json.json_normalize(\n", - " baseline_job.suggested_constraints().body_dict[\"features\"]\n", - ")\n", - "constraints_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2. Analyze collected data for data quality issues\n", - "\n", - "When you have collected the data above, analyze and monitor the data with Monitoring Schedules." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Create a schedule" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Upload some test scripts to the S3 bucket for pre- and post-processing\n", - "bucket = boto3.Session().resource(\"s3\").Bucket(bucket)\n", - "bucket.Object(code_prefix + \"/preprocessor.py\").upload_file(\"preprocessor.py\")\n", - "bucket.Object(code_prefix + \"/postprocessor.py\").upload_file(\"postprocessor.py\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can create a model monitoring schedule for the endpoint created earlier. Use the baseline resources (constraints and statistics) to compare against the realtime traffic." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sagemaker.model_monitor import CronExpressionGenerator\n", - "\n", - "mon_schedule_name = \"DEMO-xgb-churn-pred-model-monitor-schedule-\" + strftime(\n", - " \"%Y-%m-%d-%H-%M-%S\", gmtime()\n", - ")\n", - "my_default_monitor.create_monitoring_schedule(\n", - " monitor_schedule_name=mon_schedule_name,\n", - " endpoint_input=predictor.endpoint,\n", - " # record_preprocessor_script=pre_processor_script,\n", - " post_analytics_processor_script=s3_code_postprocessor_uri,\n", - " output_s3_uri=s3_report_path,\n", - " statistics=my_default_monitor.baseline_statistics(),\n", - " constraints=my_default_monitor.suggested_constraints(),\n", - " schedule_cron_expression=CronExpressionGenerator.hourly(),\n", - " enable_cloudwatch_metrics=True,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Start generating some artificial traffic\n", - "The cell below starts a thread to send some traffic to the endpoint. Note that you need to stop the kernel to terminate this thread. If there is no traffic, the monitoring jobs are marked as `Failed` since there is no data to process." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from threading import Thread\n", - "from time import sleep\n", - "\n", - "endpoint_name = predictor.endpoint\n", - "runtime_client = sm_session.sagemaker_runtime_client\n", - "\n", - "\n", - "# (just repeating code from above for convenience/ able to run this section independently)\n", - "def invoke_endpoint(ep_name, file_name, runtime_client):\n", - " with open(file_name, \"r\") as f:\n", - " for row in f:\n", - " payload = row.rstrip(\"\\n\")\n", - " response = runtime_client.invoke_endpoint(\n", - " EndpointName=ep_name, ContentType=\"text/csv\", Body=payload\n", - " )\n", - " response[\"Body\"].read()\n", - " time.sleep(1)\n", - "\n", - "\n", - "def invoke_endpoint_forever():\n", - " while True:\n", - " try:\n", - " invoke_endpoint(endpoint_name, \"test_data/test-dataset-input-cols.csv\", runtime_client)\n", - " except runtime_client.exceptions.ValidationError:\n", - " pass\n", - "\n", - "\n", - "thread = Thread(target=invoke_endpoint_forever)\n", - "thread.start()\n", - "\n", - "# Note that you need to stop the kernel to stop the invocations" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Describe and inspect the schedule\n", - "Once you describe, observe that the MonitoringScheduleStatus changes to Scheduled." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "desc_schedule_result = my_default_monitor.describe_schedule()\n", - "print(\"Schedule status: {}\".format(desc_schedule_result[\"MonitoringScheduleStatus\"]))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### List executions\n", - "The schedule starts jobs at the previously specified intervals. Here, you list the latest five executions. Note that if you are kicking this off after creating the hourly schedule, you might find the executions empty. You might have to wait until you cross the hour boundary (in UTC) to see executions kick off. The code below has the logic for waiting.\n", - "\n", - "Note: Even for an hourly schedule, Amazon SageMaker has a buffer period of 20 minutes to schedule your execution. You might see your execution start in anywhere from zero to ~20 minutes from the hour boundary. This is expected and done for load balancing in the backend." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "mon_executions = my_default_monitor.list_executions()\n", - "print(\n", - " \"We created a hourly schedule above that begins executions ON the hour (plus 0-20 min buffer.\\nWe will have to wait till we hit the hour...\"\n", - ")\n", - "\n", - "while len(mon_executions) == 0:\n", - " print(\"Waiting for the first execution to happen...\")\n", - " time.sleep(60)\n", - " mon_executions = my_default_monitor.list_executions()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Inspect a specific execution (latest execution)\n", - "In the previous cell, you picked up the latest completed or failed scheduled execution. Here are the possible terminal states and what each of them mean: \n", - "* `Completed` - The monitoring execution completed and no issues were found in the violations report.\n", - "* `CompletedWithViolations` - The execution completed, but constraint violations were detected.\n", - "* `Failed` - The monitoring execution failed, maybe due to client error (perhaps incorrect role premissions) or infrastructure issues. Further examination of `FailureReason` and `ExitMessage` is necessary to identify what exactly happened.\n", - "* `Stopped` - The job exceeded max runtime or was manually stopped." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "latest_execution = mon_executions[-1] # Latest execution's index is -1, second to last is -2, etc\n", - "time.sleep(60)\n", - "latest_execution.wait(logs=False)\n", - "\n", - "print(\"Latest execution status: {}\".format(latest_execution.describe()[\"ProcessingJobStatus\"]))\n", - "print(\"Latest execution result: {}\".format(latest_execution.describe()[\"ExitMessage\"]))\n", - "\n", - "latest_job = latest_execution.describe()\n", - "if latest_job[\"ProcessingJobStatus\"] != \"Completed\":\n", - " print(\n", - " \"====STOP==== \\n No completed executions to inspect further. Please wait till an execution completes or investigate previously reported failures.\"\n", - " )" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "report_uri = latest_execution.output.destination\n", - "print(\"Report Uri: {}\".format(report_uri))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### List the generated reports" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from urllib.parse import urlparse\n", - "\n", - "s3uri = urlparse(report_uri)\n", - "report_bucket = s3uri.netloc\n", - "report_key = s3uri.path.lstrip(\"/\")\n", - "print(\"Report bucket: {}\".format(report_bucket))\n", - "print(\"Report key: {}\".format(report_key))\n", - "\n", - "s3_client = boto3.Session().client(\"s3\")\n", - "result = s3_client.list_objects(Bucket=report_bucket, Prefix=report_key)\n", - "report_files = [report_file.get(\"Key\") for report_file in result.get(\"Contents\")]\n", - "print(\"Found Report Files:\")\n", - "print(\"\\n \".join(report_files))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Violations report" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Any violations compared to the baseline are listed below." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "violations = my_default_monitor.latest_monitoring_constraint_violations()\n", - "pd.set_option(\"display.max_colwidth\", None)\n", - "constraints_df = pd.io.json.json_normalize(violations.body_dict[\"violations\"])\n", - "constraints_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Other commands\n", - "We can also start and stop the monitoring schedules." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# my_default_monitor.stop_monitoring_schedule()\n", - "# my_default_monitor.start_monitoring_schedule()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Delete resources\n", - "\n", - "You can keep your endpoint running to continue capturing data. If you do not plan to collect more data or use this endpoint further, delete the endpoint to avoid incurring additional charges. Note that deleting your endpoint does not delete the data that was captured during the model invocations. That data persists in Amazon S3 until you delete it yourself.\n", - "\n", - "You need to delete the schedule before deleting the model and endpoint." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "my_default_monitor.stop_monitoring_schedule()\n", - "my_default_monitor.delete_monitoring_schedule()\n", - "time.sleep(60) # Wait for the deletion" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "predictor.delete_model()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "predictor.delete_endpoint()" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Notebook CI Test Results\n", - "\n", - "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", - "\n", - "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/sagemaker_model_monitor|introduction|SageMaker-ModelMonitoring.ipynb)\n", - "\n", - "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/sagemaker_model_monitor|introduction|SageMaker-ModelMonitoring.ipynb)\n", - "\n", - "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/sagemaker_model_monitor|introduction|SageMaker-ModelMonitoring.ipynb)\n", - "\n", - "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/sagemaker_model_monitor|introduction|SageMaker-ModelMonitoring.ipynb)\n", - "\n", - "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/sagemaker_model_monitor|introduction|SageMaker-ModelMonitoring.ipynb)\n", - "\n", - "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/sagemaker_model_monitor|introduction|SageMaker-ModelMonitoring.ipynb)\n", - "\n", - "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/sagemaker_model_monitor|introduction|SageMaker-ModelMonitoring.ipynb)\n", - "\n", - "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/sagemaker_model_monitor|introduction|SageMaker-ModelMonitoring.ipynb)\n", - "\n", - "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/sagemaker_model_monitor|introduction|SageMaker-ModelMonitoring.ipynb)\n", - "\n", - "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/sagemaker_model_monitor|introduction|SageMaker-ModelMonitoring.ipynb)\n", - "\n", - "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/sagemaker_model_monitor|introduction|SageMaker-ModelMonitoring.ipynb)\n", - "\n", - "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/sagemaker_model_monitor|introduction|SageMaker-ModelMonitoring.ipynb)\n", - "\n", - "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/sagemaker_model_monitor|introduction|SageMaker-ModelMonitoring.ipynb)\n", - "\n", - "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/sagemaker_model_monitor|introduction|SageMaker-ModelMonitoring.ipynb)\n", - "\n", - "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/sagemaker_model_monitor|introduction|SageMaker-ModelMonitoring.ipynb)\n" - ] - } - ], - "metadata": { - "anaconda-cloud": {}, - "kernelspec": { - "display_name": "Python 3 (Data Science 3.0)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.6" - }, - "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/sagemaker_neo_compilation_jobs/pytorch_torchvision/pytorch_torchvision_neo.ipynb b/sagemaker_neo_compilation_jobs/pytorch_torchvision/pytorch_torchvision_neo.ipynb deleted file mode 100644 index 215a037530..0000000000 --- a/sagemaker_neo_compilation_jobs/pytorch_torchvision/pytorch_torchvision_neo.ipynb +++ /dev/null @@ -1,975 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Deploying pre-trained PyTorch vision models with Amazon SageMaker Neo" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "\n", - "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", - "\n", - "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/sagemaker_neo_compilation_jobs|pytorch_torchvision|pytorch_torchvision_neo.ipynb)\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Amazon SageMaker Neo is an API to compile machine learning models to optimize them for our choice of hardware targets. Currently, Neo supports pre-trained PyTorch models from [TorchVision](https://pytorch.org/docs/stable/torchvision/models.html). General support for other PyTorch models is forthcoming.\n", - "\n", - "### Runtime\n", - "\n", - "This notebook takes approximately 8 minutes to run.\n", - "\n", - "### Contents\n", - "\n", - "1. [Import ResNet18 from TorchVision](#Import-ResNet18-from-TorchVision)\n", - "1. [Invoke Neo Compilation API](#Invoke-Neo-Compilation-API)\n", - "1. [Deploy the model](#Deploy-the-model)\n", - "1. [Send requests](#Send-requests)\n", - "1. [Delete the Endpoint](#Delete-the-Endpoint)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Import ResNet18 from TorchVision" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We import the [ResNet18](https://arxiv.org/abs/1512.03385) model from TorchVision and create a model artifact `model.tar.gz`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import sys\n", - "\n", - "!{sys.executable} -m pip install torch==1.13.0 torchvision==0.14.0\n", - "!{sys.executable} -m pip install s3transfer==0.5.0\n", - "!{sys.executable} -m pip install --upgrade sagemaker" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Specify the input data shape. For more information, see [Prepare Model for Compilation](https://docs.aws.amazon.com/sagemaker/latest/dg/neo-compilation-preparing-model.html)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import sagemaker\n", - "import torch\n", - "import torchvision.models as models\n", - "import tarfile\n", - "\n", - "resnet18 = models.resnet18(pretrained=True)\n", - "input_shape = [1, 3, 224, 224]\n", - "trace = torch.jit.trace(resnet18.float().eval(), torch.zeros(input_shape).float())\n", - "trace.save(\"model.pth\")\n", - "\n", - "with tarfile.open(\"model.tar.gz\", \"w:gz\") as f:\n", - " f.add(\"model.pth\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Upload the model archive to S3" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Specify parameters for the compilation job and upload the `model.tar.gz` archive file." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import boto3\n", - "import sagemaker\n", - "import time\n", - "from sagemaker.utils import name_from_base\n", - "\n", - "role = sagemaker.get_execution_role()\n", - "sess = sagemaker.Session()\n", - "region = sess.boto_region_name\n", - "bucket = sess.default_bucket()\n", - "\n", - "compilation_job_name = name_from_base(\"TorchVision-ResNet18-Neo\")\n", - "prefix = compilation_job_name + \"/model\"\n", - "\n", - "model_path = sess.upload_data(path=\"model.tar.gz\", key_prefix=prefix)\n", - "\n", - "data_shape = '{\"input0\":[1,3,224,224]}'\n", - "target_device = \"ml_c5\"\n", - "framework = \"PYTORCH\"\n", - "framework_version = \"1.13\"\n", - "compiled_model_path = \"s3://{}/{}/output\".format(bucket, compilation_job_name)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Invoke Neo Compilation API" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Create a PyTorch SageMaker model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Use the `PyTorchModel` and define parameters including the path to the model, the `entry_point` script that is used to perform inference, and other version and environment variables." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from sagemaker.pytorch.model import PyTorchModel\n", - "from sagemaker.predictor import Predictor\n", - "\n", - "sagemaker_model = PyTorchModel(\n", - " model_data=model_path,\n", - " predictor_cls=Predictor,\n", - " framework_version=framework_version,\n", - " role=role,\n", - " sagemaker_session=sess,\n", - " entry_point=\"resnet18.py\",\n", - " source_dir=\"code\",\n", - " py_version=\"py3\",\n", - " env={\"MMS_DEFAULT_RESPONSE_TIMEOUT\": \"500\"},\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Use Neo compiler to compile the model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Run the compilation job, which is saved in S3 at the specified `compiled_model_path` location." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "compiled_model = sagemaker_model.compile(\n", - " target_instance_family=target_device,\n", - " input_shape=data_shape,\n", - " job_name=compilation_job_name,\n", - " role=role,\n", - " framework=framework.lower(),\n", - " framework_version=framework_version,\n", - " output_path=compiled_model_path,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Deploy the model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Deploy the compiled model to an endpoint so it can be used for inference." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "predictor = compiled_model.deploy(initial_instance_count=1, instance_type=\"ml.c5.9xlarge\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Send requests" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's send a picture to the endpoint to predict the image subject.\n", - "\n", - "![title](cat.jpg)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Open the image and pass the payload as a bytearray to the predictor, receiving a response." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "import numpy as np\n", - "import json\n", - "\n", - "with open(\"cat.jpg\", \"rb\") as f:\n", - " payload = f.read()\n", - " payload = bytearray(payload)\n", - "\n", - "response = predictor.predict(payload)\n", - "result = json.loads(response.decode())\n", - "print(\"Most likely class: {}\".format(np.argmax(result)))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Use the ImageNet class ID response to look up which subject the image contains, and with what probability." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "# Load names for ImageNet classes\n", - "object_categories = {}\n", - "with open(\"imagenet1000_clsidx_to_labels.txt\", \"r\") as f:\n", - " for line in f:\n", - " key, val = line.strip().split(\":\")\n", - " object_categories[key] = val.strip(\" \").strip(\",\")\n", - "print(\n", - " \"The label is\",\n", - " object_categories[str(np.argmax(result))],\n", - " \"with probability\",\n", - " str(np.amax(result))[:5],\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Delete the Endpoint\n", - "Delete the endpoint to avoid incurring costs now that it is no longer needed." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "predictor.delete_model()\n", - "sess.delete_endpoint(predictor.endpoint_name)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Notebook CI Test Results\n", - "\n", - "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", - "\n", - "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/sagemaker_neo_compilation_jobs|pytorch_torchvision|pytorch_torchvision_neo.ipynb)\n", - "\n", - "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/sagemaker_neo_compilation_jobs|pytorch_torchvision|pytorch_torchvision_neo.ipynb)\n", - "\n", - "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/sagemaker_neo_compilation_jobs|pytorch_torchvision|pytorch_torchvision_neo.ipynb)\n", - "\n", - "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/sagemaker_neo_compilation_jobs|pytorch_torchvision|pytorch_torchvision_neo.ipynb)\n", - "\n", - "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/sagemaker_neo_compilation_jobs|pytorch_torchvision|pytorch_torchvision_neo.ipynb)\n", - "\n", - "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/sagemaker_neo_compilation_jobs|pytorch_torchvision|pytorch_torchvision_neo.ipynb)\n", - "\n", - "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/sagemaker_neo_compilation_jobs|pytorch_torchvision|pytorch_torchvision_neo.ipynb)\n", - "\n", - "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/sagemaker_neo_compilation_jobs|pytorch_torchvision|pytorch_torchvision_neo.ipynb)\n", - "\n", - "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/sagemaker_neo_compilation_jobs|pytorch_torchvision|pytorch_torchvision_neo.ipynb)\n", - "\n", - "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/sagemaker_neo_compilation_jobs|pytorch_torchvision|pytorch_torchvision_neo.ipynb)\n", - "\n", - "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/sagemaker_neo_compilation_jobs|pytorch_torchvision|pytorch_torchvision_neo.ipynb)\n", - "\n", - "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/sagemaker_neo_compilation_jobs|pytorch_torchvision|pytorch_torchvision_neo.ipynb)\n", - "\n", - "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/sagemaker_neo_compilation_jobs|pytorch_torchvision|pytorch_torchvision_neo.ipynb)\n", - "\n", - "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/sagemaker_neo_compilation_jobs|pytorch_torchvision|pytorch_torchvision_neo.ipynb)\n", - "\n", - "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/sagemaker_neo_compilation_jobs|pytorch_torchvision|pytorch_torchvision_neo.ipynb)\n" - ] - } - ], - "metadata": { - "availableInstances": [ - { - "_defaultOrder": 0, - "_isFastLaunch": true, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 4, - "name": "ml.t3.medium", - "vcpuNum": 2 - }, - { - "_defaultOrder": 1, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 8, - "name": "ml.t3.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 2, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.t3.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 3, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.t3.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 4, - "_isFastLaunch": true, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 8, - "name": "ml.m5.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 5, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.m5.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 6, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.m5.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 7, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.m5.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 8, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.m5.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 9, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.m5.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 10, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.m5.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 11, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 384, - "name": "ml.m5.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 12, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 8, - "name": "ml.m5d.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 13, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.m5d.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 14, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.m5d.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 15, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.m5d.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 16, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.m5d.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 17, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.m5d.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 18, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.m5d.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 19, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 384, - "name": "ml.m5d.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 20, - "_isFastLaunch": false, - "category": "General purpose", - "gpuNum": 0, - "hideHardwareSpecs": true, - "memoryGiB": 0, - "name": "ml.geospatial.interactive", - "supportedImageNames": [ - "sagemaker-geospatial-v1-0" - ], - "vcpuNum": 0 - }, - { - "_defaultOrder": 21, - "_isFastLaunch": true, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 4, - "name": "ml.c5.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 22, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 8, - "name": "ml.c5.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 23, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.c5.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 24, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.c5.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 25, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 72, - "name": "ml.c5.9xlarge", - "vcpuNum": 36 - }, - { - "_defaultOrder": 26, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 96, - "name": "ml.c5.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 27, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 144, - "name": "ml.c5.18xlarge", - "vcpuNum": 72 - }, - { - "_defaultOrder": 28, - "_isFastLaunch": false, - "category": "Compute optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.c5.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 29, - "_isFastLaunch": true, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.g4dn.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 30, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.g4dn.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 31, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.g4dn.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 32, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.g4dn.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 33, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 4, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.g4dn.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 34, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.g4dn.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 35, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 61, - "name": "ml.p3.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 36, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 4, - "hideHardwareSpecs": false, - "memoryGiB": 244, - "name": "ml.p3.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 37, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 488, - "name": "ml.p3.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 38, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 768, - "name": "ml.p3dn.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 39, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.r5.large", - "vcpuNum": 2 - }, - { - "_defaultOrder": 40, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.r5.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 41, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.r5.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 42, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.r5.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 43, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.r5.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 44, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 384, - "name": "ml.r5.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 45, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 512, - "name": "ml.r5.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 46, - "_isFastLaunch": false, - "category": "Memory Optimized", - "gpuNum": 0, - "hideHardwareSpecs": false, - "memoryGiB": 768, - "name": "ml.r5.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 47, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 16, - "name": "ml.g5.xlarge", - "vcpuNum": 4 - }, - { - "_defaultOrder": 48, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 32, - "name": "ml.g5.2xlarge", - "vcpuNum": 8 - }, - { - "_defaultOrder": 49, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 64, - "name": "ml.g5.4xlarge", - "vcpuNum": 16 - }, - { - "_defaultOrder": 50, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 128, - "name": "ml.g5.8xlarge", - "vcpuNum": 32 - }, - { - "_defaultOrder": 51, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 1, - "hideHardwareSpecs": false, - "memoryGiB": 256, - "name": "ml.g5.16xlarge", - "vcpuNum": 64 - }, - { - "_defaultOrder": 52, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 4, - "hideHardwareSpecs": false, - "memoryGiB": 192, - "name": "ml.g5.12xlarge", - "vcpuNum": 48 - }, - { - "_defaultOrder": 53, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 4, - "hideHardwareSpecs": false, - "memoryGiB": 384, - "name": "ml.g5.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 54, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 768, - "name": "ml.g5.48xlarge", - "vcpuNum": 192 - }, - { - "_defaultOrder": 55, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 1152, - "name": "ml.p4d.24xlarge", - "vcpuNum": 96 - }, - { - "_defaultOrder": 56, - "_isFastLaunch": false, - "category": "Accelerated computing", - "gpuNum": 8, - "hideHardwareSpecs": false, - "memoryGiB": 1152, - "name": "ml.p4de.24xlarge", - "vcpuNum": 96 - } - ], - "kernelspec": { - "display_name": "Python 3 (PyTorch 1.12 Python 3.8 CPU Optimized)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/pytorch-1.12-cpu-py38" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.16" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/sagemaker_processing/basic_sagemaker_data_processing/basic_sagemaker_processing.ipynb b/sagemaker_processing/basic_sagemaker_data_processing/basic_sagemaker_processing.ipynb deleted file mode 100644 index 86324ba686..0000000000 --- a/sagemaker_processing/basic_sagemaker_data_processing/basic_sagemaker_processing.ipynb +++ /dev/null @@ -1,378 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "# Get started with SageMaker Processing\n" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "\n", - "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", - "\n", - "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "\n", - "This notebook corresponds to the section \"Preprocessing Data With The Built-In Scikit-Learn Container\" in the blog post [Amazon SageMaker Processing – Fully Managed Data Processing and Model Evaluation](https://aws.amazon.com/blogs/aws/amazon-sagemaker-processing-fully-managed-data-processing-and-model-evaluation/). \n", - "It shows a lightweight example of using SageMaker Processing to create train, test, and validation datasets. SageMaker Processing is used to create these datasets, which then are written back to S3.\n", - "\n", - "## Runtime\n", - "\n", - "This notebook takes approximately 5 minutes to run.\n", - "\n", - "## Contents\n", - "\n", - "1. [Prepare resources](#Prepare-resources)\n", - "1. [Download data](#Download-data)\n", - "1. [Prepare Processing script](#Prepare-Processing-script)\n", - "1. [Run Processing job](#Run-Processing-job)\n", - "1. [Conclusion](#Conclusion)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Prepare resources\n", - "\n", - "First, let’s create an SKLearnProcessor object, passing the scikit-learn version we want to use, as well as our managed infrastructure requirements." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "!pip install -U sagemaker" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "import boto3\n", - "import sagemaker\n", - "from sagemaker import get_execution_role\n", - "from sagemaker.sklearn.processing import SKLearnProcessor\n", - "\n", - "region = sagemaker.Session().boto_region_name\n", - "role = get_execution_role()\n", - "sklearn_processor = SKLearnProcessor(\n", - " framework_version=\"1.2-1\", role=role, instance_type=\"ml.m5.xlarge\", instance_count=1\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Download data\n", - "\n", - "Read in the raw data from a public S3 bucket. This example uses the [Census-Income (KDD) Dataset](https://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29) from the UCI Machine Learning Repository.\n", - "\n", - "> Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "import pandas as pd\n", - "\n", - "s3 = boto3.client(\"s3\")\n", - "s3.download_file(\n", - " \"sagemaker-sample-data-{}\".format(region),\n", - " \"processing/census/census-income.csv\",\n", - " \"census-income.csv\",\n", - ")\n", - "df = pd.read_csv(\"census-income.csv\")\n", - "df.to_csv(\"dataset.csv\")\n", - "df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Prepare Processing script\n", - "\n", - "Write the Python script that will be run by SageMaker Processing. This script reads the single data file from S3; splits the rows into train, test, and validation sets; and then writes the three output files to S3." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "%%writefile preprocessing.py\n", - "import pandas as pd\n", - "import os\n", - "from sklearn.model_selection import train_test_split\n", - "\n", - "input_data_path = os.path.join(\"/opt/ml/processing/input\", \"dataset.csv\")\n", - "df = pd.read_csv(input_data_path)\n", - "print(\"Shape of data is:\", df.shape)\n", - "train, test = train_test_split(df, test_size=0.2)\n", - "train, validation = train_test_split(train, test_size=0.2)\n", - "\n", - "try:\n", - " os.makedirs(\"/opt/ml/processing/output/train\")\n", - " os.makedirs(\"/opt/ml/processing/output/validation\")\n", - " os.makedirs(\"/opt/ml/processing/output/test\")\n", - " print(\"Successfully created directories\")\n", - "except Exception as e:\n", - " # if the Processing call already creates these directories (or directory otherwise cannot be created)\n", - " print(e)\n", - " print(\"Could not make directories\")\n", - " pass\n", - "\n", - "try:\n", - " train.to_csv(\"/opt/ml/processing/output/train/train.csv\")\n", - " validation.to_csv(\"/opt/ml/processing/output/validation/validation.csv\")\n", - " test.to_csv(\"/opt/ml/processing/output/test/test.csv\")\n", - " print(\"Wrote files successfully\")\n", - "except Exception as e:\n", - " print(\"Failed to write the files\")\n", - " print(e)\n", - " pass\n", - "\n", - "print(\"Completed running the processing job\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Run Processing job" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Run the Processing job, specifying the script name, input file, and output files." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "%%capture output\n", - "\n", - "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", - "\n", - "sklearn_processor.run(\n", - " code=\"preprocessing.py\",\n", - " # arguments = [\"arg1\", \"arg2\"], # Arguments can optionally be specified here\n", - " inputs=[ProcessingInput(source=\"dataset.csv\", destination=\"/opt/ml/processing/input\")],\n", - " outputs=[\n", - " ProcessingOutput(source=\"/opt/ml/processing/output/train\"),\n", - " ProcessingOutput(source=\"/opt/ml/processing/output/validation\"),\n", - " ProcessingOutput(source=\"/opt/ml/processing/output/test\"),\n", - " ],\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Get the Processing job logs and retrieve the job name." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "print(output)\n", - "job_name = str(output).split(\"\\n\")[1].split(\" \")[-1]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "Confirm that the output dataset files were written to S3." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "import boto3\n", - "\n", - "s3_client = boto3.client(\"s3\")\n", - "default_bucket = sagemaker.Session().default_bucket()\n", - "for i in range(1, 4):\n", - " prefix = s3_client.list_objects(Bucket=default_bucket, Prefix=\"sagemaker-scikit-learn\")[\n", - " \"Contents\"\n", - " ][-i][\"Key\"]\n", - " print(\"s3://\" + default_bucket + \"/\" + prefix)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "pycharm": { - "name": "#%% md\n" - } - }, - "source": [ - "## Conclusion\n", - "\n", - "In this notebook, we read a dataset from S3 and processed it into train, test, and validation sets using a SageMaker Processing job. You can extend this example for preprocessing your own datasets in preparation for machine learning or other applications." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Notebook CI Test Results\n", - "\n", - "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", - "\n", - "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", - "\n", - "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", - "\n", - "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", - "\n", - "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", - "\n", - "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", - "\n", - "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", - "\n", - "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", - "\n", - "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", - "\n", - "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", - "\n", - "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", - "\n", - "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", - "\n", - "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", - "\n", - "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", - "\n", - "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", - "\n", - "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (Data Science 3.0)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.6" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/sagemaker_processing/spark_distributed_data_processing/sagemaker-spark-processing.ipynb b/sagemaker_processing/spark_distributed_data_processing/sagemaker-spark-processing.ipynb deleted file mode 100644 index b48847305e..0000000000 --- a/sagemaker_processing/spark_distributed_data_processing/sagemaker-spark-processing.ipynb +++ /dev/null @@ -1,705 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Distributed Data Processing using Apache Spark and SageMaker Processing\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "---\n", - "\n", - "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", - "\n", - "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n", - "\n", - "---" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "Apache Spark is a unified analytics engine for large-scale data processing. The Spark framework is often used within the context of machine learning workflows to run data transformation or feature engineering workloads at scale. Amazon SageMaker provides a set of prebuilt Docker images that include Apache Spark and other dependencies needed to run distributed data processing jobs on Amazon SageMaker. This example notebook demonstrates how to use the prebuilt Spark images on SageMaker Processing using the SageMaker Python SDK.\n", - "\n", - "This notebook walks through the following scenarios to illustrate the functionality of the SageMaker Spark Container:\n", - "\n", - "* Running a basic PySpark application using the SageMaker Python SDK's `PySparkProcessor` class\n", - "* Viewing the Spark UI via the `start_history_server()` function of a `PySparkProcessor` object\n", - "* Adding additional Python and jar file dependencies to jobs\n", - "* Running a basic Java/Scala-based Spark job using the SageMaker Python SDK's `SparkJarProcessor` class\n", - "* Specifying additional Spark configuration" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Runtime\n", - "\n", - "This notebook takes approximately 22 minutes to run." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Contents\n", - "\n", - "1. [Setup](#Setup)\n", - "1. [Example 1: Running a basic PySpark application](#Example-1:-Running-a-basic-PySpark-application)\n", - "1. [Example 2: Specify additional Python and jar file dependencies](#Example-2:-Specify-additional-Python-and-jar-file-dependencies)\n", - "1. [Example 3: Run a Java/Scala Spark application](#Example-3:-Run-a-Java/Scala-Spark-application)\n", - "1. [Example 4: Specifying additional Spark configuration](#Example-4:-Specifying-additional-Spark-configuration)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setup" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Install the latest SageMaker Python SDK\n", - "\n", - "This notebook requires the latest v2.x version of the SageMaker Python SDK. First, ensure that the latest version is installed." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pip install -U \"sagemaker>2.0\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "*Restart your notebook kernel after upgrading the SDK*" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Example 1: Running a basic PySpark application\n", - "\n", - "The first example is a basic Spark MLlib data processing script. This script will take a raw data set and do some transformations on it such as string indexing and one hot encoding.\n", - "\n", - "### Setup S3 bucket locations and roles\n", - "\n", - "First, setup some locations in the default SageMaker bucket to store the raw input datasets and the Spark job output. Here, you'll also define the role that will be used to run all SageMaker Processing jobs." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import logging\n", - "import sagemaker\n", - "from time import gmtime, strftime\n", - "\n", - "sagemaker_logger = logging.getLogger(\"sagemaker\")\n", - "sagemaker_logger.setLevel(logging.INFO)\n", - "sagemaker_logger.addHandler(logging.StreamHandler())\n", - "\n", - "sagemaker_session = sagemaker.Session()\n", - "bucket = sagemaker_session.default_bucket()\n", - "role = sagemaker.get_execution_role()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, you'll download the example dataset from a SageMaker staging bucket." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Fetch the dataset from the SageMaker bucket\n", - "import boto3\n", - "\n", - "s3 = boto3.client(\"s3\")\n", - "s3.download_file(\n", - " f\"sagemaker-example-files-prod-{sagemaker_session.boto_region_name}\",\n", - " \"datasets/tabular/uci_abalone/abalone.csv\",\n", - " \"./data/abalone.csv\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Write the PySpark script\n", - "\n", - "The source for a preprocessing script is in the cell below. The cell uses the `%%writefile` directive to save this file locally. This script does some basic feature engineering on a raw input dataset. In this example, the dataset is the [Abalone Data Set](https://archive.ics.uci.edu/ml/datasets/abalone) and the code below performs string indexing, one hot encoding, vector assembly, and combines them into a pipeline to perform these transformations in order. The script then does an 80-20 split to produce training and validation datasets as output." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%writefile ./code/preprocess.py\n", - "from __future__ import print_function\n", - "from __future__ import unicode_literals\n", - "\n", - "import argparse\n", - "import csv\n", - "import os\n", - "import shutil\n", - "import sys\n", - "import time\n", - "\n", - "import pyspark\n", - "from pyspark.sql import SparkSession\n", - "from pyspark.ml import Pipeline\n", - "from pyspark.ml.feature import (\n", - " OneHotEncoder,\n", - " StringIndexer,\n", - " VectorAssembler,\n", - " VectorIndexer,\n", - ")\n", - "from pyspark.sql.functions import *\n", - "from pyspark.sql.types import (\n", - " DoubleType,\n", - " StringType,\n", - " StructField,\n", - " StructType,\n", - ")\n", - "\n", - "\n", - "def csv_line(data):\n", - " r = \",\".join(str(d) for d in data[1])\n", - " return str(data[0]) + \",\" + r\n", - "\n", - "\n", - "def main():\n", - " parser = argparse.ArgumentParser(description=\"app inputs and outputs\")\n", - " parser.add_argument(\"--s3_input_bucket\", type=str, help=\"s3 input bucket\")\n", - " parser.add_argument(\"--s3_input_key_prefix\", type=str, help=\"s3 input key prefix\")\n", - " parser.add_argument(\"--s3_output_bucket\", type=str, help=\"s3 output bucket\")\n", - " parser.add_argument(\"--s3_output_key_prefix\", type=str, help=\"s3 output key prefix\")\n", - " args = parser.parse_args()\n", - "\n", - " spark = SparkSession.builder.appName(\"PySparkApp\").getOrCreate()\n", - "\n", - " # This is needed to save RDDs which is the only way to write nested Dataframes into CSV format\n", - " spark.sparkContext._jsc.hadoopConfiguration().set(\n", - " \"mapred.output.committer.class\", \"org.apache.hadoop.mapred.FileOutputCommitter\"\n", - " )\n", - "\n", - " # Defining the schema corresponding to the input data. The input data does not contain the headers\n", - " schema = StructType(\n", - " [\n", - " StructField(\"sex\", StringType(), True),\n", - " StructField(\"length\", DoubleType(), True),\n", - " StructField(\"diameter\", DoubleType(), True),\n", - " StructField(\"height\", DoubleType(), True),\n", - " StructField(\"whole_weight\", DoubleType(), True),\n", - " StructField(\"shucked_weight\", DoubleType(), True),\n", - " StructField(\"viscera_weight\", DoubleType(), True),\n", - " StructField(\"shell_weight\", DoubleType(), True),\n", - " StructField(\"rings\", DoubleType(), True),\n", - " ]\n", - " )\n", - "\n", - " # Downloading the data from S3 into a Dataframe\n", - " total_df = spark.read.csv(\n", - " (\"s3://\" + os.path.join(args.s3_input_bucket, args.s3_input_key_prefix, \"abalone.csv\")),\n", - " header=False,\n", - " schema=schema,\n", - " )\n", - "\n", - " # StringIndexer on the sex column which has categorical value\n", - " sex_indexer = StringIndexer(inputCol=\"sex\", outputCol=\"indexed_sex\")\n", - "\n", - " # one-hot-encoding is being performed on the string-indexed sex column (indexed_sex)\n", - " sex_encoder = OneHotEncoder(inputCol=\"indexed_sex\", outputCol=\"sex_vec\")\n", - "\n", - " # vector-assembler will bring all the features to a 1D vector for us to save easily into CSV format\n", - " assembler = VectorAssembler(\n", - " inputCols=[\n", - " \"sex_vec\",\n", - " \"length\",\n", - " \"diameter\",\n", - " \"height\",\n", - " \"whole_weight\",\n", - " \"shucked_weight\",\n", - " \"viscera_weight\",\n", - " \"shell_weight\",\n", - " ],\n", - " outputCol=\"features\",\n", - " )\n", - "\n", - " # The pipeline is comprised of the steps added above\n", - " pipeline = Pipeline(stages=[sex_indexer, sex_encoder, assembler])\n", - "\n", - " # This step trains the feature transformers\n", - " model = pipeline.fit(total_df)\n", - "\n", - " # This step transforms the dataset with information obtained from the previous fit\n", - " transformed_total_df = model.transform(total_df)\n", - "\n", - " # Split the overall dataset into 80-20 training and validation\n", - " (train_df, validation_df) = transformed_total_df.randomSplit([0.8, 0.2])\n", - "\n", - " # Convert the train dataframe to RDD to save in CSV format and upload to S3\n", - " train_rdd = train_df.rdd.map(lambda x: (x.rings, x.features))\n", - " train_lines = train_rdd.map(csv_line)\n", - " train_lines.saveAsTextFile(\n", - " \"s3://\" + os.path.join(args.s3_output_bucket, args.s3_output_key_prefix, \"train\")\n", - " )\n", - "\n", - " # Convert the validation dataframe to RDD to save in CSV format and upload to S3\n", - " validation_rdd = validation_df.rdd.map(lambda x: (x.rings, x.features))\n", - " validation_lines = validation_rdd.map(csv_line)\n", - " validation_lines.saveAsTextFile(\n", - " \"s3://\" + os.path.join(args.s3_output_bucket, args.s3_output_key_prefix, \"validation\")\n", - " )\n", - "\n", - "\n", - "if __name__ == \"__main__\":\n", - " main()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Run the SageMaker Processing Job\n", - "\n", - "Next, you'll use the `PySparkProcessor` class to define a Spark job and run it using SageMaker Processing. A few things to note in the definition of the `PySparkProcessor`:\n", - "\n", - "* This is a multi-node job with two m5.xlarge instances (which is specified via the `instance_count` and `instance_type` parameters)\n", - "* Spark framework version 3.1 is specified via the `framework_version` parameter\n", - "* The PySpark script defined above is passed via via the `submit_app` parameter\n", - "* Command-line arguments to the PySpark script (such as the S3 input and output locations) are passed via the `arguments` parameter\n", - "* Spark event logs will be offloaded to the S3 location specified in `spark_event_logs_s3_uri` and can be used to view the Spark UI while the job is in progress or after it completes\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sagemaker.spark.processing import PySparkProcessor\n", - "\n", - "# Upload the raw input dataset to a unique S3 location\n", - "timestamp_prefix = strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", - "prefix = \"sagemaker/spark-preprocess-demo/{}\".format(timestamp_prefix)\n", - "input_prefix_abalone = \"{}/input/raw/abalone\".format(prefix)\n", - "input_preprocessed_prefix_abalone = \"{}/input/preprocessed/abalone\".format(prefix)\n", - "\n", - "sagemaker_session.upload_data(\n", - " path=\"./data/abalone.csv\", bucket=bucket, key_prefix=input_prefix_abalone\n", - ")\n", - "\n", - "# Run the processing job\n", - "spark_processor = PySparkProcessor(\n", - " base_job_name=\"sm-spark\",\n", - " framework_version=\"3.1\",\n", - " role=role,\n", - " instance_count=2,\n", - " instance_type=\"ml.m5.xlarge\",\n", - " max_runtime_in_seconds=1200,\n", - ")\n", - "\n", - "spark_processor.run(\n", - " submit_app=\"./code/preprocess.py\",\n", - " arguments=[\n", - " \"--s3_input_bucket\",\n", - " bucket,\n", - " \"--s3_input_key_prefix\",\n", - " input_prefix_abalone,\n", - " \"--s3_output_bucket\",\n", - " bucket,\n", - " \"--s3_output_key_prefix\",\n", - " input_preprocessed_prefix_abalone,\n", - " ],\n", - " spark_event_logs_s3_uri=\"s3://{}/{}/spark_event_logs\".format(bucket, prefix),\n", - " logs=False,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Validate Data Processing Results\n", - "\n", - "Next, validate the output of our data preprocessing job by looking at the first 5 rows of the output dataset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(\"Top 5 rows from s3://{}/{}/train/\".format(bucket, input_preprocessed_prefix_abalone))\n", - "!aws s3 cp --quiet s3://$bucket/$input_preprocessed_prefix_abalone/train/part-00000 - | head -n5" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### View the Spark UI\n", - "\n", - "Next, you can view the Spark UI by running the history server locally in this notebook. (**Note:** this feature will only work in a local development environment with docker installed or on a Sagemaker Notebook Instance. This feature does not currently work in SageMaker Studio.)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# uses docker\n", - "spark_processor.start_history_server()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "After viewing the Spark UI, you can terminate the history server before proceeding." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "spark_processor.terminate_history_server()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Example 2: Specify additional Python and jar file dependencies\n", - "\n", - "The next example demonstrates a scenario where additional Python file dependencies are required by the PySpark script. You'll use a sample PySpark script that requires additional user-defined functions (UDFs) defined in a local module." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%writefile ./code/hello_py_spark_app.py\n", - "import argparse\n", - "import time\n", - "\n", - "# Import local module to test spark-submit--py-files dependencies\n", - "import hello_py_spark_udfs as udfs\n", - "from pyspark.sql import SparkSession, SQLContext\n", - "from pyspark.sql.functions import udf\n", - "from pyspark.sql.types import IntegerType\n", - "import time\n", - "\n", - "if __name__ == \"__main__\":\n", - " print(\"Hello World, this is PySpark!\")\n", - "\n", - " parser = argparse.ArgumentParser(description=\"inputs and outputs\")\n", - " parser.add_argument(\"--input\", type=str, help=\"path to input data\")\n", - " parser.add_argument(\"--output\", required=False, type=str, help=\"path to output data\")\n", - " args = parser.parse_args()\n", - " spark = SparkSession.builder.appName(\"SparkTestApp\").getOrCreate()\n", - " sqlContext = SQLContext(spark.sparkContext)\n", - "\n", - " # Load test data set\n", - " inputPath = args.input\n", - " outputPath = args.output\n", - " salesDF = spark.read.json(inputPath)\n", - " salesDF.printSchema()\n", - "\n", - " salesDF.createOrReplaceTempView(\"sales\")\n", - "\n", - " # Define a UDF that doubles an integer column\n", - " # The UDF function is imported from local module to test spark-submit--py-files dependencies\n", - " double_udf_int = udf(udfs.double_x, IntegerType())\n", - "\n", - " # Save transformed data set to disk\n", - " salesDF.select(\"date\", \"sale\", double_udf_int(\"sale\").alias(\"sale_double\")).write.json(\n", - " outputPath\n", - " )" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%writefile ./code/hello_py_spark_udfs.py\n", - "def double_x(x):\n", - " return x + x" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Create a processing job with Python file dependencies\n", - "\n", - "Then, you'll create a processing job where the additional Python file dependencies are specified via the `submit_py_files` argument in the `run()` function. If your Spark application requires additional jar file dependencies, these can be specified via the `submit_jars` argument of the `run()` function." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define job input/output URIs\n", - "timestamp_prefix = strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", - "prefix = \"sagemaker/spark-preprocess-demo/{}\".format(timestamp_prefix)\n", - "input_prefix_sales = \"{}/input/sales\".format(prefix)\n", - "output_prefix_sales = \"{}/output/sales\".format(prefix)\n", - "input_s3_uri = \"s3://{}/{}\".format(bucket, input_prefix_sales)\n", - "output_s3_uri = \"s3://{}/{}\".format(bucket, output_prefix_sales)\n", - "\n", - "sagemaker_session.upload_data(\n", - " path=\"./data/data.jsonl\", bucket=bucket, key_prefix=input_prefix_sales\n", - ")\n", - "\n", - "spark_processor = PySparkProcessor(\n", - " base_job_name=\"sm-spark-udfs\",\n", - " framework_version=\"3.1\",\n", - " role=role,\n", - " instance_count=2,\n", - " instance_type=\"ml.m5.xlarge\",\n", - " max_runtime_in_seconds=1200,\n", - ")\n", - "\n", - "spark_processor.run(\n", - " submit_app=\"./code/hello_py_spark_app.py\",\n", - " submit_py_files=[\"./code/hello_py_spark_udfs.py\"],\n", - " arguments=[\"--input\", input_s3_uri, \"--output\", output_s3_uri],\n", - " logs=False,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Validate Data Processing Results\n", - "\n", - "Next, validate the output of the Spark job by ensuring that the output URI contains the Spark `_SUCCESS` file along with the output json lines file." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(\"Output files in {}\".format(output_s3_uri))\n", - "!aws s3 ls $output_s3_uri/" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Example 3: Run a Java/Scala Spark application\n", - "\n", - "In the next example, you'll take a Spark application jar (located in `./code/spark-test-app.jar`) that is already built and run it using SageMaker Processing. Here, you'll use the `SparkJarProcessor` class to define the job parameters. \n", - "\n", - "In the `run()` function you'll specify: \n", - "\n", - "* The location of the Spark application jar file in the `submit_app` argument\n", - "* The main class for the Spark application in the `submit_class` argument\n", - "* Input/output arguments for the Spark application" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sagemaker.spark.processing import SparkJarProcessor\n", - "\n", - "# Upload the raw input dataset to S3\n", - "timestamp_prefix = strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", - "prefix = \"sagemaker/spark-preprocess-demo/{}\".format(timestamp_prefix)\n", - "input_prefix_sales = \"{}/input/sales\".format(prefix)\n", - "output_prefix_sales = \"{}/output/sales\".format(prefix)\n", - "input_s3_uri = \"s3://{}/{}\".format(bucket, input_prefix_sales)\n", - "output_s3_uri = \"s3://{}/{}\".format(bucket, output_prefix_sales)\n", - "\n", - "sagemaker_session.upload_data(\n", - " path=\"./data/data.jsonl\", bucket=bucket, key_prefix=input_prefix_sales\n", - ")\n", - "\n", - "spark_processor = SparkJarProcessor(\n", - " base_job_name=\"sm-spark-java\",\n", - " framework_version=\"3.1\",\n", - " role=role,\n", - " instance_count=2,\n", - " instance_type=\"ml.m5.xlarge\",\n", - " max_runtime_in_seconds=1200,\n", - ")\n", - "\n", - "spark_processor.run(\n", - " submit_app=\"./code/spark-test-app.jar\",\n", - " submit_class=\"com.amazonaws.sagemaker.spark.test.HelloJavaSparkApp\",\n", - " arguments=[\"--input\", input_s3_uri, \"--output\", output_s3_uri],\n", - " logs=False,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Example 4: Specifying additional Spark configuration\n", - "\n", - "Overriding Spark configuration is crucial for a number of tasks such as tuning your Spark application or configuring the Hive metastore. Using the SageMaker Python SDK, you can easily override Spark/Hive/Hadoop configuration.\n", - "\n", - "The next example demonstrates this by overriding Spark executor memory/cores.\n", - "\n", - "For more information on configuring your Spark application, see the EMR documentation on [Configuring Applications](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Upload the raw input dataset to a unique S3 location\n", - "timestamp_prefix = strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", - "prefix = \"sagemaker/spark-preprocess-demo/{}\".format(timestamp_prefix)\n", - "input_prefix_abalone = \"{}/input/raw/abalone\".format(prefix)\n", - "input_preprocessed_prefix_abalone = \"{}/input/preprocessed/abalone\".format(prefix)\n", - "\n", - "sagemaker_session.upload_data(\n", - " path=\"./data/abalone.csv\", bucket=bucket, key_prefix=input_prefix_abalone\n", - ")\n", - "\n", - "spark_processor = PySparkProcessor(\n", - " base_job_name=\"sm-spark\",\n", - " framework_version=\"3.1\",\n", - " role=role,\n", - " instance_count=2,\n", - " instance_type=\"ml.m5.xlarge\",\n", - " max_runtime_in_seconds=1200,\n", - ")\n", - "\n", - "configuration = [\n", - " {\n", - " \"Classification\": \"spark-defaults\",\n", - " \"Properties\": {\"spark.executor.memory\": \"2g\", \"spark.executor.cores\": \"1\"},\n", - " }\n", - "]\n", - "\n", - "spark_processor.run(\n", - " submit_app=\"./code/preprocess.py\",\n", - " arguments=[\n", - " \"--s3_input_bucket\",\n", - " bucket,\n", - " \"--s3_input_key_prefix\",\n", - " input_prefix_abalone,\n", - " \"--s3_output_bucket\",\n", - " bucket,\n", - " \"--s3_output_key_prefix\",\n", - " input_preprocessed_prefix_abalone,\n", - " ],\n", - " configuration=configuration,\n", - " logs=False,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Notebook CI Test Results\n", - "\n", - "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", - "\n", - "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n", - "\n", - "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n", - "\n", - "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n", - "\n", - "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n", - "\n", - "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n", - "\n", - "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n", - "\n", - "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n", - "\n", - "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n", - "\n", - "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n", - "\n", - "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n", - "\n", - "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n", - "\n", - "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n", - "\n", - "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n", - "\n", - "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n", - "\n", - "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/sagemaker_processing|spark_distributed_data_processing|sagemaker-spark-processing.ipynb)\n" - ] - } - ], - "metadata": { - "instance_type": "ml.t3.medium", - "kernelspec": { - "display_name": "conda_python3", - "language": "python", - "name": "conda_python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.8" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -}