diff --git a/end_to_end/fraud_detection/0-AutoClaimFraudDetection.ipynb b/end_to_end/fraud_detection/0-AutoClaimFraudDetection.ipynb index 10f0eab24f..a0bf5248a8 100644 --- a/end_to_end/fraud_detection/0-AutoClaimFraudDetection.ipynb +++ b/end_to_end/fraud_detection/0-AutoClaimFraudDetection.ipynb @@ -4,153 +4,31 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# SageMaker End to End Solutions: Fraud Detection for Automobile Claims" + "# Fraud Detection for Automobile Claims: Data Exploration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "\n", + "## Background\n", "\n", - "## [Overview](./0-AutoClaimFraudDetection.ipynb)\n", - "* **[Notebook 0 : Overview, Architecture, and Data Exploration](./0-AutoClaimFraudDetection.ipynb)**\n", - " * **[Business Problem](#business-problem)**\n", - " * **[Technical Solution](#nb0-solution)**\n", - " * **[Solution Components](#nb0-components)**\n", - " * **[Solution Architecture](#nb0-architecture)**\n", - " * **[DataSets and Exploratory Data Analysis](#nb0-data-explore)**\n", - " * **[Exploratory Data Science and Operational ML workflows](#nb0-workflows)**\n", - " * **[The ML Life Cycle: Detailed View](#nb0-ml-lifecycle)**\n", - "* [Notebook 1: Data Prep, Process, Store Features](./1-data-prep-e2e.ipynb)\n", - " * Architecture\n", - " * Getting started\n", - " * DataSets\n", - " * SageMaker Feature Store\n", - " * Create train and test datasets\n", - "* [Notebook 2: Train, Check Bias, Tune, Record Lineage, and Register a Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb)\n", - " * Architecture\n", - " * Train a model using XGBoost\n", - " * Model lineage with artifacts and associations\n", - " * Evaluate the model for bias with Clarify\n", - " * Deposit Model and Lineage in SageMaker Model Registry\n", - "* [Notebook 3: Mitigate Bias, Train New Model, Store in Registry](./3-mitigate-bias-train-model2-registry-e2e.ipynb)\n", - " * Architecture\n", - " * Develop a second model\n", - " * Analyze the Second Model for Bias\n", - " * View Results of Clarify Bias Detection Job\n", - " * Configure and Run Clarify Explainability Job\n", - " * Create Model Package for second trained model\n", - "* [Notebook 4: Deploy Model, Run Predictions](./4-deploy-run-inference-e2e.ipynb)\n", - " * Architecture\n", - " * Deploy an approved model and Run Inference via Feature Store\n", - " * Create a Predictor\n", - " * Run Predictions from Online FeatureStore\n", - "* [Notebook 5 : Create and Run an End-to-End Pipeline to Deploy the Model](./5-pipeline-e2e.ipynb)\n", - " * Architecture\n", - " * Create an Automated Pipeline\n", - " * Clean up" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overview, Architecture, and Data Exploration\n", - "\n", - "In this overview notebook, we will address business problems regarding auto insurance fraud, technical solutions, explore dataset, solution architecture, and scope the machine learning (ML) life cycle." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " \n", - "\n", - "## Business Problem\n", - "\n", - "[overview](#overview-0)\n", - "\n", - " \"Auto insurance fraud ranges from misrepresenting facts on insurance applications and inflating insurance claims to staging accidents and submitting claim forms for injuries or damage that never occurred, to false reports of stolen vehicles.\n", - "Fraud accounted for between 15 percent and 17 percent of total claims payments for auto insurance bodily injury in 2012, according to an Insurance Research Council (IRC) study. The study estimated that between $\\$5.6$ billion and $\\$7.7$ billion was fraudulently added to paid claims for auto insurance bodily injury payments in 2012, compared with a range of $\\$4.3$ billion to $\\$5.8$ billion in 2002. \" [source: Insurance Information Institute](https://www.iii.org/article/background-on-insurance-fraud)\n", - "\n", - "In this example, we will use an *auto insurance domain* to detect claims that are possibly fraudulent. \n", - "more precisley we address the use-case: \"what is the likelihood that a given autoclaim is fraudulent?\" , and explore the technical solution. \n", - "\n", - "As you review the [notebooks](#nb0-notebooks) and the [architectures](#nb0-architecture) presented at each stage of the ML life cycle, you will see how you can leverage SageMaker services and features to enhance your effectiveness as a data scientist, as a machine learning engineer, and as an ML Ops Engineer.\n", - "\n", - "We will then do [data exploration](#nb0-data-explore) on the synthetically generated datasets for Customers and Claims.\n", - "\n", - "Then, we will provide an overview of the technical solution by examining the [Solution Components](#nb0-components) and the [Solution Architecture](#nb0-architecture).\n", - "We will be motivated by the need to accomplish new tasks in ML by examining a [detailed view of the Machine Learning Life-cycle](#nb0-ml-lifecycle), recognizing the [separation of exploratory data science and operationalizing an ML worklfow](#nb0-workflows).\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Car Insurance Claims: Data Sets and Problem Domain\n", - "\n", - "The inputs for building our model and workflow are two tables of insurance data: a claims table and a customers table. This data was synthetically generated is provided to you in its raw state for pre-processing with SageMaker Data Wrangler. However, completing the Data Wragnler step is not required to continue with the rest of this notebook. If you wish, you may use the `claims_preprocessed.csv` and `customers_preprocessed.csv` in the `data` directory as they are exact copies of what Data Wragnler would output." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " \n", - "\n", - "## Technical Solution\n", - "[overview](#overview-0)\n", + "This notebook is the first part of a series of notebooks that will demonstrate how to prepare, train, and deploy a model that detects fradulent autoclaims. In this notebook, we will focusing on data exploration. You can choose to run this notebook by itself or in sequence with the other notebooks listed below. Please see the [README.md](README.md) for more information about this use case implemented by this series of notebooks. \n", "\n", - "In this introduction, you will look at the technical architecture and solution components to build a solution for predicting fraudulent insurance claims and deploy it using SageMaker for real-time predictions. While a deployed model is the end-product of this notebook series, the purpose of this guide is to walk you through all the detailed stages of the [machine learning (ML) lifecycle](#ml-lifecycle) and show you what SageMaker servcies and features are there to support your activities in each stage.\n", "\n", - "**Topics**\n", - "- [Solution Components](#nb0-components)\n", - "- [Solution Architecture](#nb0-architecture)\n", - "- [Code Resources](#nb0-code)\n", - "- [ML lifecycle details](#nb0-ml-lifecycle)\n", - "- [Manual/exploratory and automated workflows](#nb0-workflows) " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " \n", - "\n", - "## Solution Components\n", - "[overview](#overview-0)\n", - " \n", - "The following [SageMaker](https://sagemaker.readthedocs.io/en/stable/v2.html) Services are used in this solution:\n", - "\n", - " 1. [SageMaker DataWrangler](https://aws.amazon.com/sagemaker/data-wrangler/) - [docs](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler.html)\n", - " 1. [SageMaker Processing](https://aws.amazon.com/blogs/aws/amazon-sagemaker-processing-fully-managed-data-processing-and-model-evaluation/) - [docs](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_processing.html)\n", - " 1. [SageMaker Feature Store](https://aws.amazon.com/sagemaker/feature-store/)- [docs](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_featurestore.html)\n", - " 1. [SageMaker Clarify](https://aws.amazon.com/sagemaker/clarify/)- [docs](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-processing-job-run.html)\n", - " 1. [SageMaker Training with XGBoost Algorithm and Hyperparameter Optimization](https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/using_xgboost.html)- [docs](https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/index.html)\n", - " 1. [SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html)- [docs](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-deploy.html#model-registry-deploy-api)\n", - " 1. [SageMaker Hosted Endpoints]()- [predictors - docs](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html)\n", - " 1. [SageMaker Pipelines]()- [docs](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/index.html)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "![Solution Components](images/solution-components-e2e.png)" + "1. **[Fraud Detection for Automobile Claims: Data Exploration](./0-AutoClaimFraudDetection.ipynb)**\n", + "1. [Fraud Detection for Automobile Claims: Data Preparation, Process, and Store Features](./1-data-prep-e2e.ipynb)\n", + "1. [Fraud Detection for Automobile Claims: Train, Check Bias, Tune, Record Lineage, and Register a Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb)\n", + "1. [Fraud Detection for Automobile Claims: Mitigate Bias, Train, Register, and Deploy Unbiased Model](./3-mitigate-bias-train-model2-registry-e2e.ipynb)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - " \n", - "\n", - "## DataSets and Exploratory Visualizations\n", - "[overview](#overview-0)\n", + "## Datasets and Exploratory Visualizations\n", "\n", - "The dataset is synthetically generated and consists of customers and claims datasets.\n", + "The dataset is synthetically generated and consists of customers and claims datasets.\n", "Here we will load them and do some exploratory visualizations." ] }, @@ -201,20 +79,9 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXgAAAEaCAYAAAAboUz3AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAWW0lEQVR4nO3dfXBUd73H8c/uCQGE0CQ7u+lGKxEqsEpwxirOSEF56F3UzSQFa5z0wQ4ax4cZtFO1kbnNgzqFtbSOxaKS3rbG6MiklKZsY3GoYw3MkE5tR6ILsWIQYpckbpohqZfCPdn7R8fYbSC7Ibs5yY/3669N9pvMB/bw4eSbs7uuRCKREADAOG6nAwAAsoOCBwBDUfAAYCgKHgAMRcEDgKEoeAAwFAUPAIbKcTrAW7322usaGeGy/EzweOYrHh92OgYwBsdm5rjdLhUUzLvs/dOq4EdGEhR8BvF3iemKY3NqsKIBAENR8ABgKAoeAAyV1g6+u7tbNTU1GhwcVH5+vsLhsEpKSpJmvvWtb6mrq2v0466uLj388MNav359RgMDANLjSufVJO+44w5t3rxZ5eXlam1t1b59+9TU1HTZ+RMnTuhzn/uc2tvblZubm3aYeHyYX75kiNebp/7+IadjAGNwbGaO2+2SxzP/8ven+gbxeFzRaFShUEiSFAqFFI1GNTAwcNmveeKJJ1RWVjahcgcAZFbKgo/FYioqKpJlWZIky7Lk8/kUi8UuOX/hwgUdOHBAmzdvzmxSAMCEZPw6+EOHDqm4uFiBQGDCXzvejxrTxYWLtnJnWU7HSIvXm+d0hJRm0t8nMmcmHJsmSFnwfr9fvb29sm1blmXJtm319fXJ7/dfcn7fvn1XfPY+E3bwXm+eyu5udTqGMQ48UM4+9irDDj5zJr2D93g8CgQCikQikqRIJKJAIKDCwsIxs2fPntUf/vCH0X09AMA5aV0HX19fr+bmZgWDQTU3N6uhoUGSVF1drc7OztG5/fv3a+3atcrPz89OWgBA2tK6THKqsKK5+rCiufqwosmcSa9oAAAzEwUPAIai4AHAUBQ8ABiKggcAQ1HwAGAoCh4ADEXBA4ChKHgAMBQFDwCGouABwFAUPAAYioIHAENR8ABgKAoeAAxFwQOAoSh4ADAUBQ8AhqLgAcBQFDwAGCqtgu/u7lZlZaWCwaAqKyt16tSpS861tbWprKxMoVBIZWVl+uc//5nJrACACchJZ6iurk5VVVUqLy9Xa2uramtr1dTUlDTT2dmpH/3oR/rZz34mr9eroaEh5ebmZiU0ACC1lGfw8Xhc0WhUoVBIkhQKhRSNRjUwMJA09/jjj2vLli3yer2SpLy8PM2ePTsLkQEA6UhZ8LFYTEVFRbIsS5JkWZZ8Pp9isVjS3MmTJ3XmzBndeuutuvnmm7V7924lEonspAYApJTWiiYdtm2rq6tLjz32mC5cuKAvfOELKi4uVkVFRdrfw+OZn6k4mEG83jynI2CK8ZhPjZQF7/f71dvbK9u2ZVmWbNtWX1+f/H5/0lxxcbE2btyo3Nxc5ebmav369Tp27NiECj4eH9bIyPQ+6+fAzLz+/iGnI2AKeb15POYZ4na7xj0xTrmi8Xg8CgQCikQikqRIJKJAIKDCwsKkuVAopMOHDyuRSOjixYs6evSoli1bNsn4AIArldZlkvX19WpublYwGFRzc7MaGhokSdXV1ers7JQkfepTn5LH49EnP/lJVVRU6Prrr9enP/3p7CUHAIzLlZhGvwmdKSuasrtbnY5hjAMPlPPj+lWGFU3mTHpFAwCYmSh4ADAUBQ8AhqLgAcBQFDwAGIqCBwBDUfAAYCgKHgAMRcEDgKEoeAAwFAUPAIai4AHAUBQ8ABiKggcAQ1HwAGAoCh4ADEXBA4ChKHgAMBQFDwCGouABwFAUPAAYKiedoe7ubtXU1GhwcFD5+fkKh8MqKSlJmtm1a5d++ctfyufzSZI++MEPqq6uLuOBAQDpSavg6+rqVFVVpfLycrW2tqq2tlZNTU1j5ioqKnTPPfdkPCQAYOJSrmji8bii0ahCoZAkKRQKKRqNamBgIOvhAABXLmXBx2IxFRUVybIsSZJlWfL5fIrFYmNmn3nmGZWVlWnLli16+eWXM58WAJC2tFY06fjsZz+rL33pS5o1a5aOHDmir3zlK2pra1NBQUHa38PjmZ+pOJhBvN48pyNgivGYT42UBe/3+9Xb2yvbtmVZlmzbVl9fn/x+f9Kc1+sdvb1q1Sr5/X698sorWrlyZdph4vFhjYwkJhB/6nFgZl5//5DTETCFvN48HvMMcbtd454Yp1zReDweBQIBRSIRSVIkElEgEFBhYWHSXG9v7+jt48eP6x//+Ife8573XGluAMAkpbWiqa+vV01NjXbv3q0FCxYoHA5Lkqqrq7V161aVlpbqwQcf1J///Ge53W7NmjVL3//+95PO6gEAU8uVSCSmzU5kpqxoyu5udTqGMQ48UM6P61cZVjSZM+kVDQBgZqLgAcBQFDwAGIqCBwBDUfAAYCgKHgAMRcEDgKEoeAAwFAUPAIai4AHAUBQ8ABiKggcAQ1HwAGAoCh4ADEXBA4ChKHgAMBQFDwCGouABwFAUPAAYioIHAENR8ABgqLQKvru7W5WVlQoGg6qsrNSpU6cuO/u3v/1NH/jABxQOhzOVEQBwBdIq+Lq6OlVVVengwYOqqqpSbW3tJeds21ZdXZ02bNiQ0ZAAgIlLWfDxeFzRaFShUEiSFAqFFI1GNTAwMGZ2z549+vjHP66SkpKMBwUATEzKgo/FYioqKpJlWZIky7Lk8/kUi8WS5k6cOKHDhw/rzjvvzEpQAMDE5GTim1y8eFH33nuvtm/fPvofwZXweOZnIg5mGK83z+kImGI85lMjZcH7/X719vbKtm1ZliXbttXX1ye/3z8609/fr9OnT+uLX/yiJOncuXNKJBIaHh7Wd7/73bTDxOPDGhlJXMEfY+pwYGZef/+Q0xEwhbzePB7zDHG7XeOeGKcseI/Ho0AgoEgkovLyckUiEQUCARUWFo7OFBcXq6OjY/TjXbt26V//+pfuueeeScYHAFyptK6iqa+vV3Nzs4LBoJqbm9XQ0CBJqq6uVmdnZ1YDAgCujCuRSEybnchMWdGU3d3qdAxjHHignB/XrzKsaDIn1YqGZ7ICgKEoeAAwFAUPAIai4AHAUBQ8ABiKggcAQ1HwAGAoCh4ADEXBA4ChKHgAMBQFDwCGouABwFAUPAAYioIHAENR8ABgKAoeAAxFwQOAoSh4ADAUBQ8AhqLgAcBQOekMdXd3q6amRoODg8rPz1c4HFZJSUnSzL59+/T444/L7XZrZGREt9xyi+64445sZAYApCGtgq+rq1NVVZXKy8vV2tqq2tpaNTU1Jc0Eg0Ft2rRJLpdLw8PDKisr08qVK7Vs2bKsBAcAjC/liiYejysajSoUCkmSQqGQotGoBgYGkubmz58vl8slSTp//rwuXrw4+jEAYOqlPIOPxWIqKiqSZVmSJMuy5PP5FIvFVFhYmDT73HPP6cEHH9Tp06d19913a+nSpdlJDWCMvAVzNWd2Wj+UO87rzXM6Qkrn3/g/DZ37X6djTEpGj4b169dr/fr1evXVV/XVr35Va9as0aJFi9L+eo9nfibjYIaYCf/YZ4qyu1udjmCMAw+Ua84MPzZTFrzf71dvb69s25ZlWbJtW319ffL7/Zf9muLiYpWWlup3v/vdhAo+Hh/WyEgi7XknUEaZ198/5HQEI3BsZt50Pzbdbte4J8Ypd/Aej0eBQECRSESSFIlEFAgExqxnTp48OXp7YGBAHR0dWrJkyZXmBgBMUlormvr6etXU1Gj37t1asGCBwuGwJKm6ulpbt25VaWmp9u7dqyNHjignJ0eJREK33XabbrzxxqyGBwBcXloFv3jxYrW0tIz5fGNj4+jtbdu2ZS4VAGDSeCYrABiKggcAQ1HwAGAoCh4ADEXBA4ChKHgAMBQFDwCGouABwFAUPAAYioIHAENR8ABgKAoeAAxFwQOAoSh4ADAUBQ8AhqLgAcBQFDwAGIqCBwBDUfAAYCgKHgAMRcEDgKFy0hnq7u5WTU2NBgcHlZ+fr3A4rJKSkqSZhx9+WG1tbbIsSzk5Obrrrru0evXqbGQGAKQhrYKvq6tTVVWVysvL1draqtraWjU1NSXNrFixQlu2bNHcuXN14sQJ3XbbbTp8+LDmzJmTleAAgPGlXNHE43FFo1GFQiFJUigUUjQa1cDAQNLc6tWrNXfuXEnS0qVLlUgkNDg4mIXIAIB0pCz4WCymoqIiWZYlSbIsSz6fT7FY7LJf89RTT+nd7363rr322swlBQBMSFormol44YUX9MMf/lCPPvrohL/W45mf6TiYAbzePKcjAJc004/NlAXv9/vV29sr27ZlWZZs21ZfX5/8fv+Y2Zdfflnf/OY3tXv3bi1atGjCYeLxYY2MJCb8dVNppj/g01F//5DTEYzAsZl50/3YdLtd454Yp1zReDweBQIBRSIRSVIkElEgEFBhYWHS3LFjx3TXXXfpoYce0vvf//5JxgYATFZa18HX19erublZwWBQzc3NamhokCRVV1ers7NTktTQ0KDz58+rtrZW5eXlKi8vV1dXV/aSAwDGldYOfvHixWppaRnz+cbGxtHb+/bty1wqAMCk8UxWADAUBQ8AhqLgAcBQFDwAGIqCBwBDUfAAYCgKHgAMRcEDgKEoeAAwFAUPAIai4AHAUBQ8ABiKggcAQ1HwAGAoCh4ADEXBA4ChKHgAMBQFDwCGouABwFAUPAAYioIHAEOlVfDd3d2qrKxUMBhUZWWlTp06NWbm8OHD2rRpk5YvX65wOJzpnACACUqr4Ovq6lRVVaWDBw+qqqpKtbW1Y2auu+46fe9739PnP//5jIcEAExcyoKPx+OKRqMKhUKSpFAopGg0qoGBgaS5hQsX6n3ve59ycnKykxQAMCEpCz4Wi6moqEiWZUmSLMuSz+dTLBbLejgAwJWbVqfbHs98pyPAAV5vntMRgEua6cdmyoL3+/3q7e2VbduyLEu2bauvr09+vz/jYeLxYY2MJDL+fTNppj/g01F//5DTEYzAsZl50/3YdLtd454Yp1zReDweBQIBRSIRSVIkElEgEFBhYWHmUgIAMi6tq2jq6+vV3NysYDCo5uZmNTQ0SJKqq6vV2dkpSXrxxRe1Zs0aPfbYY/rVr36lNWvWqL29PXvJAQDjSmsHv3jxYrW0tIz5fGNj4+jtD33oQ/r973+fuWQAgEnhmawAYCgKHgAMRcEDgKEoeAAwFAUPAIai4AHAUBQ8ABiKggcAQ1HwAGAoCh4ADEXBA4ChKHgAMBQFDwCGouABwFAUPAAYioIHAENR8ABgKAoeAAxFwQOAoSh4ADAUBQ8Ahkqr4Lu7u1VZWalgMKjKykqdOnVqzIxt22poaNCGDRt00003qaWlJdNZAQATkFbB19XVqaqqSgcPHlRVVZVqa2vHzBw4cECnT5/Wb37zG+3du1e7du1ST09PxgMDANKTsuDj8bii0ahCoZAkKRQKKRqNamBgIGmura1Nt9xyi9xutwoLC7VhwwY9++yz2UkNAEgpJ9VALBZTUVGRLMuSJFmWJZ/Pp1gspsLCwqS54uLi0Y/9fr/Onj07oTBut2tC807xFcx1OoJRZsrjPhNwbGbWdD82U+VLWfBTqaBgntMR0vI///1fTkcwiscz3+kIxuDYzKyZfmymXNH4/X719vbKtm1Jb/4yta+vT36/f8zcq6++OvpxLBbTtddem+G4AIB0pSx4j8ejQCCgSCQiSYpEIgoEAknrGUnauHGjWlpaNDIyooGBAR06dEjBYDA7qQEAKbkSiUQi1dDJkydVU1Ojc+fOacGCBQqHw1q0aJGqq6u1detWlZaWyrZtfec739GRI0ckSdXV1aqsrMz6HwAAcGlpFTwAYObhmawAYCgKHgAMRcEDgKEoeAAwFAUPAIai4AHAUBQ8ABiKgjdIIpFQS0uL7r//fklST0+PXnrpJYdTAf/x9lehRXZR8AbZvn27jh49queee06SNG/ePN13330OpwKkP/7xj1q7dq1uvvlmSVJnZ6fuvfdeh1OZj4I3SEdHh3bu3Kk5c+ZIkgoKCvTGG284nAp48+SjsbFRBQUFkqTS0lJ+upwCFLxBZs+eLZfrP68PPTIy4mAa4D8uXryo66+/Pulzs2bNcijN1WNavR48JmfJkiV6+umnlUgk1NPToz179uiGG25wOhag3Nxcvf7666MnIH/96181e/Zsh1OZjxcbM8jw8LB27Nih3/72t5KkdevW6dvf/rbmzZsZb6QCcz3//PP68Y9/rDNnzmj16tVqb2/X/fffr49+9KNORzMaBQ9gSpw5c0bt7e1KJBK68cYbtXDhQqcjGY+CN8Dzzz8/7v0f+9jHpigJgOmEHbwBHnnkkcve53K5KHg4ZvPmzUm/+H+7J554YgrTXH04gweQNS+88MK4969cuXKKklydKHjDDA0Nqbu7O+n69w9/+MMOJgLgFFY0Bmlra1M4HNa5c+fk8/l0+vRpLVu2TPv373c6Gq5yQ0NDamxs1PHjx5NOPpqamhxMZT6e6GSQn/zkJ3ryySe1cOFCHTx4UI888ohWrFjhdCxA27Ztk9vt1qlTp/SZz3xGlmVxbE4BCt4gOTk58ng8sm1bkrRq1Sp1dXU5nAqQ/v73v+vrX/+65syZo1AopJ/+9Kf605/+5HQs47GiMUhubq4SiYQWLlyon//853rnO9+p1157zelYgHJzcyW9+fIEg4ODuuaaa3T27FmHU5mPgjfI1772NQ0PD+sb3/iG6uvrNTQ0pPr6eqdjASopKdHg4KDKyspUWVmpvLw8BQIBp2MZj6toDPCLX/xi3PtvvfXWKUoCpPbiiy9qaGhIa9askWVZTscxGgVvgGXLlmn58uV673vfe8n7t2/fPsWJgEu7cOHC6O+IJGnu3LkOpjEfBW+Affv26amnntL58+dVUVGhUCika665xulYwKhnn31W27dvV19fn6Q3333M5XLp+PHjDiczGwVvkJ6eHu3fv1+//vWvtWTJEn35y1/W0qVLnY4FaP369frBD36g5cuXy+3m4r2pwt+0Qd71rnfpzjvv1O23366Ojg4dO3bM6UiAJMnr9WrFihWU+xTjDN4AiURC7e3tevLJJ/WXv/xFn/jEJ1RRUaHrrrvO6WiAJOmZZ57RK6+8optuuinpjT7e/i5PyCwK3gCrV6+W1+vVpk2b9JGPfGTMq/fxjwhOe/TRR/XQQw8pPz9/9Cze5XKNvkE8soOCN8C6detGb7tcLr31IeUfEaaDtWvXau/evfL5fE5HuarwRCcD/Pst+oDpqri4mHJ3AGfwALIuHA6rt7dXGzduTNrB82Y02UXBA8i622+/fcznXC4XLxecZRQ8ABiKi1IBZF0ikVBLS4t27twp6c0n5b300ksOpzIfBQ8g67Zv366jR4/q0KFDkqR58+bpvvvucziV+Sh4AFnX0dGhnTt3as6cOZKkgoKCpLfuQ3ZQ8ACybvbs2UlPwBsZGXEwzdWD6+ABZN2SJUv09NNPK5FIqKenR3v27NENN9zgdCzjcRUNgKwbHh7Wjh07Rp+Ut27dOm3btk3veMc7HE5mNgoeQNbs2LFDNTU1kqQjR45o1apVDie6urCDB5A1HR0do7f/fYkkpg4FDyBr3rogYFkw9fglK4CsuXDhgk6ePKlEIpF0+994KevsYgcPIGve+lLWb8dLWWcfBQ8AhmIHDwCGouABwFAUPAAYioIHAENR8ABgqP8HjhpeEnC9pxEAAAAASUVORK5CYII=\n", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "# plot the bar graph customer gender\n", "df_customers.customer_gender_female.value_counts(normalize=True).plot.bar()\n", @@ -230,20 +97,9 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXgAAAEnCAYAAACjRViEAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAVBklEQVR4nO3de2zV9f3H8dc5hzut0nPWltMxUkHnTuaFOf+Y2VCnyCHzdMVF2q2MBI2HxTnNiGPr3NZSILrqjHHVjGl+MEiZM0S06QlriS7LxGyYXa0rNg7LGuC0xVMqbbm1p+f3Bz/PjwbhnMJpvz3vPh+JyTmnH46vcr68+PT9/XKOK5FIJAQAMMftdAAAwNig4AHAKAoeAIyi4AHAKAoeAIyi4AHAqClOBzjXsWMDGh7mqs1M8PlyFIv1Ox0DOA/HZua43S7l5c2+4NdTFnxtba2am5t1+PBhNTY26rOf/ex5a+LxuDZt2qQ333xTLpdLa9as0YoVK0Yddng4QcFnEL+XmKg4NsdHyhHNnXfeqR07dujTn/70Bdc0Njaqo6NDe/bs0csvv6y6ujodOnQoo0EBAKOTsuBvvvlm+f3+i67ZvXu3VqxYIbfbLa/XqyVLlqipqSljIQEAo5eRGXw0GlVRUVHyvt/vV2dn56ifx+fLyUQc/J/8/FynIwCfiGNzfEyok6yxWD+zuQzJz8/V0aN9TscAzsOxmTlut+uiG+OMXCbp9/t15MiR5P1oNKq5c+dm4qkBAJcoIwW/bNky7dy5U8PDw+rp6dHrr7+uYDCYiacGAFyilAW/adMm3Xrrrers7NR9992nu+++W5IUDofV0tIiSSotLdW8efO0dOlSlZWV6aGHHtJnPvOZsU0OALgo10R6P3hm8JnDnBMTFcdm5qSawU+ok6zZIPeKmZoxPTt+27LhSoVTp4fUd/yk0zEAk7KjqSaQGdOnqOTRBqdjmNH4dKnYywFjgzcbAwCjKHgAMIqCBwCjKHgAMIqCBwCjKHgAMIqCBwCjKHgAMIqCBwCjKHgAMIqCBwCjKHgAMIqCBwCjKHgAMIqCBwCjKHgAMIqCBwCjKHgAMIqCBwCjKHgAMIqCBwCjKHgAMIqCBwCjKHgAMIqCBwCjKHgAMIqCBwCjKHgAMIqCBwCjKHgAMIqCBwCjpqSzqL29XZWVlert7dWcOXNUW1ur4uLiEWtisZh+/OMfKxqNanBwUF/60pf005/+VFOmpPW/AABkWFo7+OrqalVUVKi5uVkVFRWqqqo6b83mzZu1cOFCNTY2qrGxUf/+97+1Z8+ejAcGAKQnZcHHYjG1trYqFApJkkKhkFpbW9XT0zNincvl0sDAgIaHh3XmzBkNDg6qsLBwbFIDAFJKOT+JRqMqLCyUx+ORJHk8HhUUFCgajcrr9SbXffe739XDDz+sr3zlKzp58qRWrlypL37xi6MK4/PljDI+LMjPz3U6AsYZr/n4yNiAvKmpSddee622bdumgYEBhcNhNTU1admyZWk/RyzWr+HhRKYijQkOzMw7erTP6QgYR/n5ubzmGeJ2uy66MU45ovH7/erq6lI8HpckxeNxdXd3y+/3j1hXX1+vr3/963K73crNzdUdd9yhffv2XWZ8AMClSlnwPp9PgUBAkUhEkhSJRBQIBEaMZyRp3rx5+tOf/iRJOnPmjP785z/rmmuuGYPIAIB0pHUVzfr161VfX69gMKj6+nrV1NRIksLhsFpaWiRJjz32mP72t7+ppKREy5cvV3FxscrKysYuOQDgolyJRGLCDL2zZQZf8miD0zHMaHy6lHnsJMMMPnMuewYPAMhOFDwAGEXBA4BRFDwAGEXBA4BRFDwAGEXBA4BRFDwAGEXBA4BRFDwAGEXBA4BRFDwAGEXBA4BRFDwAGEXBA4BRFDwAGEXBA4BRFDwAGEXBA4BRFDwAGEXBA4BRFDwAGEXBA4BRFDwAGEXBA4BRFDwAGEXBA4BRFDwAGEXBA4BRFDwAGEXBA4BRFDwAGEXBA4BRFDwAGJVWwbe3t6u8vFzBYFDl5eU6ePDgJ67bvXu3SkpKFAqFVFJSog8//DCTWQEAozAlnUXV1dWqqKhQaWmpGhoaVFVVpe3bt49Y09LSoueee07btm1Tfn6++vr6NG3atDEJDQBILeUOPhaLqbW1VaFQSJIUCoXU2tqqnp6eEet+85vf6P7771d+fr4kKTc3V9OnTx+DyACAdKTcwUejURUWFsrj8UiSPB6PCgoKFI1G5fV6k+sOHDigefPmaeXKlTpx4oTuuusuPfjgg3K5XGmH8flyLuFbQLbLz891OgLGGa/5+EhrRJOOeDyutrY2bd26VWfOnNEDDzygoqIiLV++PO3niMX6NTycyFSkMcGBmXlHj/Y5HQHjKD8/l9c8Q9xu10U3xilHNH6/X11dXYrH45LOFnl3d7f8fv+IdUVFRVq2bJmmTZumnJwc3XnnnXrnnXcuMz4A4FKlLHifz6dAIKBIJCJJikQiCgQCI8Yz0tnZ/N69e5VIJDQ4OKi//OUv+tznPjc2qQEAKaV1meT69etVX1+vYDCo+vp61dTUSJLC4bBaWlokSXfffbd8Pp++9rWvafny5br66qt17733jl1yAMBFuRKJxIQZemfLDL7k0QanY5jR+HQp89hJhhl85lz2DB4AkJ0oeAAwioIHAKMoeAAwioIHAKMoeAAwioIHAKMoeAAwioIHAKMoeAAwioIHAKMoeAAwioIHAKMoeAAwioIHAKMoeAAwioIHAKMoeAAwioIHAKMoeAAwioIHAKMoeAAwioIHAKMoeAAwioIHAKMoeAAwioIHAKMoeAAwioIHAKMoeAAwioIHAKMoeAAwioIHAKPSKvj29naVl5crGAyqvLxcBw8evODaDz74QDfeeKNqa2szlREAcAnSKvjq6mpVVFSoublZFRUVqqqq+sR18Xhc1dXVWrJkSUZDAgBGL2XBx2Ixtba2KhQKSZJCoZBaW1vV09Nz3toXXnhBt99+u4qLizMeFAAwOikLPhqNqrCwUB6PR5Lk8XhUUFCgaDQ6Yt17772nvXv3avXq1WMSFAAwOlMy8SSDg4P62c9+pieeeCL5F8Gl8PlyMhEHWSY/P9fpCBhnvObjI2XB+/1+dXV1KR6Py+PxKB6Pq7u7W36/P7nm6NGj6ujo0Jo1ayRJx48fVyKRUH9/vzZu3Jh2mFisX8PDiUv4NsYPB2bmHT3a53QEjKP8/Fxe8wxxu10X3RinLHifz6dAIKBIJKLS0lJFIhEFAgF5vd7kmqKiIu3bty95v66uTidOnNCPfvSjy4wPALhUaV1Fs379etXX1ysYDKq+vl41NTWSpHA4rJaWljENCAC4NK5EIjFhZiLZMqIpebTB6RhmND5dyo/rkwwjmsxJNaLhX7ICgFEUPAAYRcEDgFEUPAAYRcEDgFEUPAAYRcEDgFEUPAAYRcEDgFEUPAAYRcEDgFEUPAAYRcEDgFEUPAAYRcEDgFEUPAAYRcEDgFEUPAAYRcEDgFEUPAAYRcEDgFEUPAAYRcEDgFEUPAAYRcEDgFEUPAAYRcEDgFEUPAAYRcEDgFEUPAAYRcEDgFEUPAAYRcEDgFEUPAAYNSWdRe3t7aqsrFRvb6/mzJmj2tpaFRcXj1jz/PPPa/fu3fJ4PJoyZYrWrl2rxYsXj0VmAEAa0ir46upqVVRUqLS0VA0NDaqqqtL27dtHrLnhhht0//33a+bMmXrvvff07W9/W3v37tWMGTPGJDgA4OJSjmhisZhaW1sVCoUkSaFQSK2trerp6RmxbvHixZo5c6Yk6dprr1UikVBvb+8YRAYApCPlDj4ajaqwsFAej0eS5PF4VFBQoGg0Kq/X+4m/5rXXXtP8+fM1d+7cUYXx+XJGtR425OfnOh0B44zXfHykNaIZjbffflvPPvustmzZMupfG4v1a3g4kelIGcWBmXlHj/Y5HQHjKD8/l9c8Q9xu10U3xilHNH6/X11dXYrH45KkeDyu7u5u+f3+89b+4x//0Lp16/T8889rwYIFlxEbAHC5Uha8z+dTIBBQJBKRJEUiEQUCgfPGM++8847Wrl2rX/7yl/r85z8/NmkBAGlL6zr49evXq76+XsFgUPX19aqpqZEkhcNhtbS0SJJqamp06tQpVVVVqbS0VKWlpWpraxu75ACAi3IlEokJM/TOlhl8yaMNTscwo/HpUuaxkwwz+My57Bk8ACA7UfAAYBQFDwBGUfAAYBQFDwBGUfAAYBQFDwBGUfAAYBQFDwBGUfAAYBQFDwBGUfAAYBQFDwBGUfAAYBQFDwBGUfAAYBQFDwBGUfAAYBQFDwBGUfAAYBQFDwBGUfAAYBQFDwBGUfAAYBQFDwBGUfAAYBQFDwBGUfAAYBQFDwBGUfAAYBQFDwBGTXE6AIDMyL1ipmZMz44/0vn5uU5HSOnU6SH1HT/pdIzLkh1HA4CUZkyfopJHG5yOYUbj06XqczrEZWJEAwBGpVXw7e3tKi8vVzAYVHl5uQ4ePHjemng8rpqaGi1ZskR33XWXdu7cmemsAIBRSKvgq6urVVFRoebmZlVUVKiqquq8NY2Njero6NCePXv08ssvq66uTocOHcp4YABAelLO4GOxmFpbW7V161ZJUigU0saNG9XT0yOv15tct3v3bq1YsUJut1ter1dLlixRU1OTHnjggbTDuN2uS/gWxl9B3kynI5iSLa97NuDYzKyJfmymypey4KPRqAoLC+XxeCRJHo9HBQUFikajIwo+Go2qqKgoed/v96uzs3NUYfPyZo9qvVP+56dLnY5gis+X43QEMzg2Myvbj01OsgKAUSkL3u/3q6urS/F4XNLZk6nd3d3y+/3nrTty5EjyfjQa1dy5czMcFwCQrpQF7/P5FAgEFIlEJEmRSESBQGDEeEaSli1bpp07d2p4eFg9PT16/fXXFQwGxyY1ACAlVyKRSKRadODAAVVWVur48eO64oorVFtbqwULFigcDuuRRx7R9ddfr3g8rg0bNuitt96SJIXDYZWXl4/5NwAA+GRpFTwAIPtwkhUAjKLgAcAoCh4AjKLgAcAoCh4AjKLgAcAoPvAjyz355JMX/foPf/jDcUoCYKJhB5/lZs2apVmzZunDDz/U73//ew0NDWloaEhNTU3q7+93Oh4AB/EPnYwIh8N68sknlZeXJ0k6duyYKisr9etf/9rhZJjMHnnkEblcF35L22effXYc00w+7OCNiEajyXKXpLy8PB0+fNjBRID01a9+Vbfffru8Xq8OHTqkm266STfddJOOHDmiefPmOR3PPGbwRixYsEA/+clPdO+990qSdu3apQULFjicCpPdPffcI0lqaGjQjh07NGPGDElSeXm5HnzwQSejTQrs4I14/PHHlZubq40bN2rDhg3KycnR448/7nQsQJLU2dmpadOmJe9PnTpV0WjUwUSTAzN4AGOuqqpKhw8fHrGj9/v92rBhg8PJbKPgjbjQ5ZJcJomJYHBwUL/73e/09ttvK5FI6JZbblFZWZmmTp3qdDTTmMEbMWvWrOTt06dP649//KOuu+46BxMB/2/q1KlatWqVVq1a5XSUSYUdvFH9/f1at26dfvWrXzkdBbjg5ZJcJjm22MEbNXv2bHV0dDgdA5B09nLJj50+fVrNzc1auHChg4kmBwreiHNn8IlEQu+++66uuuoqBxMB/+/jk6sf+8Y3vsFlkuOAgjfi3Bm8x+PRN7/5TS1dutTBRMCFuVwuHTp0yOkY5lHwRnzve99zOgJwQefO4BOJhNra2nTLLbc4nMo+TrIaMTQ0pFdeeUX79+/X6dOnk48/8cQTDqYCznr11VeTtz0ej+bPn69FixY5mGhyYAdvRFVVleLxuPbt26dvfetbikQiuvnmm52OBUg6fwaP8UHBG9HS0qLGxkaVlJToO9/5jioqKvT973/f6ViAJKmvr08vvvjieT9hbt++3cFU9vFeNEZMnz5d0tkff0+ePKnc3Fx1d3c7nAo467HHHpPb7dbBgwdVVlYmj8ejG264welY5rGDN+LKK6/URx99pMWLFyscDisvL0+f+tSnnI4FSJL++9//qq6uTm+88YZCoZCWLl2qNWvWOB3LPAreiBdeeEEej0dr165VY2Oj+vr6tHz5cqdjAZKUfCfJqVOnqre3V1deeaU6OzsdTmUfBW9APB7XQw89pM2bN8vtdqu0tNTpSMAIxcXF6u3tVUlJicrLy5Wbm6tAIOB0LPO4TNKI1atXa8uWLXK7Oa2Cie2vf/2r+vr6dOutt8rj8TgdxzQK3ohnnnlG77//vkKhkGbPnp18/LbbbnMwFXD2J8yysjK98sorTkeZdBjRGPH3v/9dkvTSSy8lH3O5XBQ8HOfxeJSXl6fTp08nr/bC+GAHD2DMbdq0Sf/85z8VDAZHvG/SypUrHUxlHwPbLLdly5bk7ffff9/BJMCFDQwM6JprrtEHH3ygd999N/kfxhY7+Cx3zz33JN/n49zbwETw85//XJWVlZKkt956S1/+8pcdTjS5sIPPcuf+/czf1Zho9u3bl7z9i1/8wsEkkxMnWbNcIpHQqVOnlEgkRtz+2MyZMx1Mh8mODYizKPgs19bWpi984QvJPzyLFi2Sy+VSIpGQy+XS/v37HU6IyezMmTM6cOCAEonEiNsfu/rqqx1MZx8zeABj5o477rjg11wul954441xTDP5UPAAYBQnWQHAKAoeAIyi4I3o7+9P6zEAkwcFb8SqVavSegzA5MFlklluaGhIg4ODGh4eHnENfF9fn06ePOlwOgBOouCz3ObNm/Xcc8/J5XJp0aJFycdzcnJ03333OZgMgNO4TNKIDRs2qKqqyukYACYQCt6QY8eO6V//+pdcLpduvPFGzZkzx+lIABzEiMaIN998U+vWrUt+zmVbW5ueeuop3r0PmMQoeCOeeeYZ7dixQwsXLpQkHThwQOvWraPggUmMyySNGBoaSpa7JC1cuFBDQ0MOJgLgNAreCK/Xq127diXvv/rqq/J6vQ4mAuA0TrIa0dHRoR/84Afav3+/XC6XAoGAnnrqKc2fP9/paAAcQsEbMzAwoEQioZycHKejAHAYJ1mz3H/+85+Lfp0PVAAmL3bwWe6TPlDB5XJpYGBAH330EZ/oBExi7OCz3B/+8IcR90+cOKGtW7fqt7/9rVavXu1MKAATAgVvxNDQkF566SW9+OKLuu2227Rr1y4VFhY6HQuAgyh4A1577TXV1dXp+uuv17Zt23TVVVc5HQnABMAMPsuVlJToxIkTevjhh3Xddded93VOsgKTFwWf5c49yepyuXTuy8mn1gOTGwUPAEbxVgUAYBQFDwBGUfAAYBQFDwBGUfAAYNT/Ai+/6Cos/gCjAAAAAElFTkSuQmCC\n", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "# plot the bar graph of fraudulent claims\n", "df_claims.fraud.value_counts(normalize=True).plot.bar()\n", @@ -259,20 +115,9 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "# plot the education categories\n", "educ = df_customers.customer_education.value_counts(normalize=True, sort=False)\n", @@ -282,20 +127,9 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "# plot the total claim amounts\n", "plt.hist(df_claims.total_claim_amount, bins=30)\n", @@ -311,30 +145,9 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Text(0.5, 0, 'Number of claims per year')" - ] - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "# plot the number of claims filed in the past year\n", "df_customers.num_claims_past_year.hist(density=True)\n", @@ -351,20 +164,9 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "sns.pairplot(\n", " data=df_customers, vars=[\"num_insurers_past_5_years\", \"months_as_customer\", \"customer_age\"]\n", @@ -382,20 +184,9 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "df_combined = df_customers.join(df_claims)\n", "sns.lineplot(x=\"num_insurers_past_5_years\", y=\"fraud\", data=df_combined);" @@ -410,40 +201,18 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "sns.boxplot(x=df_customers[\"months_as_customer\"]);" ] }, { "cell_type": "code", - "execution_count": 11, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAV0AAAEMCAYAAABnWmXlAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAARgklEQVR4nO3de1BU9f/H8ZeAoOQ3QMPE9KeTZUM5JWFCIZDodDElq8mM6quNTXSzLE0tM8tbYpaWl7GmbOY702UqtUzLmUa7qJNEmZUzjfk1BzCQBAG5ru7u5/eHw458s7zkvtfF5+MvYQ/nvD/s4TnLUc+2c845AQBMRIR6AAA4mxBdADBEdAHAENEFAENEFwAMEV0AMER0AcBQ1PE2qK5ukN8f3H/K26VLJ1VV1Qf1GNba4pqktrmutrgmqW2uKxzWFBHRTgkJ5/zl48eNrt/vgh7dluO0NW1xTVLbXFdbXJPUNtcV7mvi8gIAGCK6AGCI6AKAIaILAIaILgAYIroAYIjoAoAhogsAhoguABgiugBgiOgCgCGiCwCGiC4AGCK6AGCI6AKAIaILAIaILgAYIroAYOi4b9eDM8c77/xHpaXFIZ2hfftIHT7sC/pxamtrJElxcfFBP5bVmv5Kz569lJf375AdH7aIbhgpLS3Wzl3/VWSH4Ico1HzNR6K7/6A3xJMEV8s6cfYgumEmskO8YnsNCfUYQddYvEGS2vxaW9aJswfXdAHAENEFAENEFwAMEV0AMER0AcAQ0QUAQ0QXAAwRXQAwRHQBwBDRBQBDRBcADBFdADBEdAHAENEFAENEFwAMEV0AMER0AcAQ0QUAQ0QXAAwRXQAwRHQBwBDRBQBDRBcADBFdADBEdAHAENEFAENEFwAMEV0AMER0AcAQ0QUAQ0QXAAwRXQAwRHQBwBDRBQBDRBcADBFdADBEdAHAENEFAENEFwAMEV0AMBSU6G7Z8rW2bPk6GLsGgKALZsOigrHTzZu/kiRlZGQFY/cAEFTBbBiXFwDAENEFAENEFwAMEV0AMER0AcAQ0QUAQ0QXAAwRXQAwRHQBwBDRBQBDRBcADBFdADBEdAHAENEFAENEFwAMEV0AMER0AcAQ0QUAQ0QXAAwRXQAwRHQBwBDRBQBDRBcADBFdADBEdAHAENEFAENEFwAMEV0AMER0AcAQ0QUAQ0QXAAwRXQAwRHQBwBDRBQBDRBcADBFdADBEdAHAENEFAENEFwAMEV0AMER0AcAQ0QUAQ1HB2GltbY1qa2tVUDDrhLZv3z5Shw/7gjFKyARjTSUlxfL7Ik/rPhFafm+zSkqKT/hn5WTwc3XqSkqKFRcXF5R980oXAAwF5ZVuXFy84uLiNWXK9BPaPjHxX9q/vy4Yo4RMMNZUUDBL/y2tPK37RGhFRHXQ//U874R/Vk4GP1enLhi/ebTglS4AGCK6AGCI6AKAIaILAIaILgAYIroAYIjoAoAhogsAhoguABgiugBgiOgCgCGiCwCGiC4AGCK6AGCI6AKAIaILAIaILgAYIroAYIjoAoAhogsAhoguABgiugBgiOgCgCGiCwCGiC4AGCK6AGCI6AKAIaILAIaILgAYIroAYIjoAoAhogsAhoguABgiugBgiOgCgCGiCwCGiC4AGCK6AGCI6AKAIaILAIaILgAYigrGTgcNyg7GbgHARDAbFpToZmRkBWO3AGAimA3j8gIAGCK6AGCI6AKAIaILAIaILgAYIroAYIjoAoAhogsAhoguABgiugBgiOgCgCGiCwCGiC4AGCK6AGCI6AKAIaILAIaILgAYIroAYIjoAoAhogsAhoguABgiugBgiOgCgCGiCwCGiC4AGCK6AGCI6AKAIaILAIaILgAYIroAYIjoAoAhogsAhoguABgiugBgiOgCgCGiCwCGiC4AGCK6AGCI6AKAIaILAIaILgAYigr1ADg5vuYaNRZvCPUYQedrrpGkNr/WI+s8L9RjwBDRDSM9e/YK9Qhq3z5Shw/7gn6c2tojp2ZcXHzQj2W1pmM774x4XmGH6IaRvLx/h3oEJSb+S/v314V6jNOqLa4JZy6u6QKAIaILAIaILgAYIroAYIjoAoAhogsAhoguABgiugBgiOgCgCGiCwCGiC4AGCK6AGCI6AKAIaILAIaILgAYIroAYIjoAoAhogsAho77dj0REe0s5jA7jqW2uCapba6rLa5JapvrOtPXdLz52jnnnNEsAHDW4/ICABgiugBgiOgCgCGiCwCGiC4AGCK6AGCI6AKAIaILAIaILgAYOu5/Az6dqqurNXnyZJWUlCg6Olq9evXSzJkz1blzZ23fvl3PPvusPB6PLrjgAr344ovq0qWL5Xin7KGHHtLevXsVERGh2NhYTZ8+XcnJydqzZ4+mTp2qmpoaxcfHq6CgQL179w71uCdlyZIlWrx4sT755BP17ds3rJ8nScrJyVF0dLRiYmIkSZMmTVJmZmZYr8vj8Wju3Ln65ptvFBMTo/79+2vWrFlhff7t3btXDz/8cODjuro61dfX69tvvw3rdUmSnKHq6mq3devWwMfz5s1zTz31lPP7/W7o0KGuqKjIOefc0qVL3dSpUy1H+0cOHjwY+PPnn3/uRo4c6Zxz7p577nEfffSRc865jz76yN1zzz0hme9U7dixw40bN85de+21bufOnWH/PDnn3ODBg93OnTtbfS7c1zVr1iw3Z84c5/f7nXPO7d+/3zkX/uff0WbPnu2ef/5551z4r8s0uv9r/fr1bsyYMe7HH390N910U+DzVVVVrn///iGc7NStXr3a3XLLLa6ystKlpqY6r9frnHPO6/W61NRUV1VVFeIJT4zH43GjRo1yJSUlgVC1hefpWNEN53XV19e71NRUV19f3+rz4X7+Hc3j8bi0tDS3Y8eONrEu08sLR/P7/Xr33XeVk5Oj8vJyde/ePfBY586d5ff7A78+hINp06Zpy5Ytcs7pjTfeUHl5uc4//3xFRkZKkiIjI9W1a1eVl5erc+fOIZ72+F555RXl5uaqZ8+egc+1hedJOnJJwTmn1NRUPfHEE2G9rtLSUsXHx2vJkiUqLCzUOeeco8cee0wdOnQI6/PvaBs3btT555+vyy67TDt27Aj7dYXsL9JmzZql2NhY3X333aEa4bSaM2eOvvzySz3++OOaP39+qMf5R3744Qf9/PPPysvLC/Uop93bb7+tNWvWaOXKlXLOaebMmaEe6R/xer0qLS3VpZdeqlWrVmnSpEkaP368GhsbQz3aabNy5UrddtttoR7jtAlJdAsKClRcXKxFixYpIiJCSUlJKisrCzx+4MABtWvX7ox/lXEsI0eOVGFhobp166aKigr5fD5Jks/n0x9//KGkpKQQT3h8RUVF+u233zRkyBDl5ORo3759GjdunIqLi8P+eWr5/kdHRysvL0/btm0L6/Ove/fuioqK0vDhwyVJV1xxhRISEtShQ4ewPf+OVlFRoaKiIo0YMULSkecv3NdlHt2FCxdqx44dWrp0qaKjoyVJ/fr1U3Nzs7777jtJ0nvvvacbb7zRerRT0tDQoPLy8sDHGzduVFxcnLp06aLk5GStXbtWkrR27VolJyeHxa9A999/vzZv3qyNGzdq48aN6tatm958803dd999Yfs8SVJjY6Pq6uokSc45ffrpp0pOTg7r869z585KS0vTli1bJEl79uxRVVWVevfuHbbn39FWr16t7OxsJSQkSFJY/1y1ML2J+a5duzR8+HD17t1bHTp0kCT16NFDS5cu1bZt2zRjxoxW/2TnvPPOsxrtlFVWVuqhhx5SU1OTIiIiFBcXpylTpuiyyy7T7t27NXXqVB08eFDnnnuuCgoKdOGFF4Z65JOWk5Oj5cuXq2/fvmH7PElHrn+OHz9ePp9Pfr9fffr00TPPPKOuXbuG/bqefvpp1dTUKCoqShMmTFB2dnabOP+uv/56TZs2TVlZWYHPhfu6eOcIADDE/0gDAENEFwAMEV0AMER0AcAQ0QUAQ0QXAAwRXQTdqlWrdOedd4Z6DOCMQHRxVvF6vaEeAWc5ootjKi8v1yOPPKL09HSlpaVp5syZWrx4sSZNmhTYZu/evbrkkksCIVu1apWGDBmilJQU5eTkaM2aNdq9e7dmzJih7du3KyUlRQMGDJB05KbUkydPVnp6ugYPHqxly5bJ7/cH9jN69GjNnTtXAwYM0JAhQ7Rt2zatWrVK2dnZuvrqq7V69erAHIcOHVJBQYGuvfZaXXPNNXr22WfV3NwsSSosLFRWVpZef/11ZWRk6KmnnvrLNdfW1io/P1/p6em66qqrlJ+fr3379gUeLy0t1V133aWUlBSNHTtWzz//fKvvx/bt2zV69GgNGDBAubm5KiwsPA3PBNqc0N1VEmcqr9frRowY4ebMmeMaGhpcc3OzKyoqcq+++qqbOHFiYLvS0lLXt29fd/jwYdfQ0OBSUlLc7t27nXPOVVRUuF9//dU559zKlSvd6NGjWx3jySefdA888ICrq6tzpaWl7rrrrnPvv/9+YPvk5GT34YcfOq/X615++WWXnZ3tnnvuOefxeNymTZtc//79A/eQnT17tsvPz3fV1dWurq7O5efnuwULFjjnnNu6datLTk528+fPdx6PxzU1Nf3lug8cOODWr1/vGhsbXV1dnRs/frx78MEHA4+PGjXKzZs3z3k8HldUVORSUlIC3499+/a5gQMHui+//NL5fD63efNmN3DgwLC6zyts8EoXf/LTTz/pjz/+0OTJkxUbG6uYmJjAK9S/ExERoV27dqm5uVldu3bVxRdffMztfD6fPv30U02cOFGdOnVSjx49dO+992rNmjWBbXr06KHbbrtNkZGRGjZsmMrLy/Xwww8rOjpagwYNUnR0tEpKSuSc0wcffKCnn35a8fHx6tSpk/Lz87Vu3bpWcz366KOKjo4O3PPjWBISEnT99derY8eO6tSpkx588EEVFRVJksrKyvTzzz8H9jNgwADl5OQEvvbjjz9WVlaWsrOzFRERoYyMDPXr109fffXVcb9vOLuE7CbmOHO13NQ7KurET4/Y2FgtXLhQK1as0LRp03TllVdqypQp6tOnz5+2ra6u1uHDh1vdOLx79+6qqKgIfHz0+5O1hPLoG9DExMSooaFBBw4cUFNTk2699dbAY865wKUK6UhMW94T7e80NTXphRde0KZNm1RbWyvpyF3kWm4fGBcXp44dOwa2T0pKCtxhrqysTOvXr9cXX3wReNzr9SotLe24x8XZhejiT1pi4vV6W4W3Y8eOgWul0pE7rB0tMzNTmZmZam5u1qJFizR9+nS98847ateuXavtEhIS1L59e5WVlemiiy6SpMA7bZyslnvHrlu37i+//n+P/1dWrFihPXv26P3331diYqJ++eUXjRw5Us45JSYmqra2Vk1NTYHwHn1Lz6SkJN18882aPXv2Sa8BZxcuL+BPLr/8ciUmJuqll15SY2OjPB6Pvv/+eyUnJ6uoqEhlZWWqq6vTa6+9FviayspKbdiwQY2NjYqOjlZsbGzgLVW6dOmiiooKHTp0SNKRt1i54YYbtHDhQtXX1+v333/XW2+9pdzc3JOeNSIiQrfffrvmzp2rqqoqSUdufL1p06aT3ldDQ4NiYmJ07rnnqqamRkuWLAk8dsEFF6hfv35avHixDh06pB9++KHVq9rc3Fx98cUX2rRpk3w+nzwejwoLC1v9RRwgEV0cQ2RkpJYvX67i4mINHjxYWVlZ+uyzz5SRkaFhw4YpNzdXt956qwYPHhz4Gr/fr7feekuZmZkaOHCgioqKNGPGDElSenq6LrroIg0aNCjw6/b06dPVsWNHDR06VHl5eRo+fPgpvyXLk08+qV69emnUqFG68sorNXbsWO3Zs+ek9zNmzBh5PB6lp6frjjvuUGZmZqvHFyxYoO3btystLU2LFi3SsGHDAjfiT0pK0rJly/Taa6/p6quvVnZ2tt58881WlzkAifvpAqdswoQJuvDCC/Xoo4+GehSEEV7pAifop59+UklJifx+v77++mtt2LBBQ4cODfVYCDP8RRrOKsuXL291LbpFamqq3njjjb/92srKSo0fP141NTXq1q2bnnvuOV166aXBGhVtFJcXAMAQlxcAwBDRBQBDRBcADBFdADBEdAHAENEFAEP/D8+CO4eUy0IYAAAAAElFTkSuQmCC\n", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "sns.boxplot(x=df_customers[\"customer_age\"]);" ] @@ -457,20 +226,9 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "df_combined.groupby(\"customer_gender_female\").mean()[\"fraud\"].plot.bar()\n", "plt.xticks([0, 1], [\"Male\", \"Female\"])\n", @@ -486,20 +244,9 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "# Creating a correlation matrix of fraud, gender, months as customer, and number of different insurers\n", "cols = [\n", @@ -533,7 +280,7 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -544,230 +291,9 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
policy_idincident_type_theftpolicy_state_capolicy_deductablenum_witnessespolicy_state_orincident_monthcustomer_gender_femalenum_insurers_past_5_yearscustomer_gender_male...policy_state_idincident_hourvehicle_claimfraudincident_type_collisionpolicy_annual_premiumpolicy_state_azpolicy_state_wacollision_type_rearcollision_type_front
0167500750002010...02012000.00030001000
1900750009011...01518500.00130000000
2168701750007110...01617500.00130000000
3168701750007011...01617500.00130000000
4169200750206110...0821500.00128001001
\n", - "

5 rows × 47 columns

\n", - "
" - ], - "text/plain": [ - " policy_id incident_type_theft policy_state_ca policy_deductable \\\n", - "0 1675 0 0 750 \n", - "1 9 0 0 750 \n", - "2 1687 0 1 750 \n", - "3 1687 0 1 750 \n", - "4 1692 0 0 750 \n", - "\n", - " num_witnesses policy_state_or incident_month customer_gender_female \\\n", - "0 0 0 2 0 \n", - "1 0 0 9 0 \n", - "2 0 0 7 1 \n", - "3 0 0 7 0 \n", - "4 2 0 6 1 \n", - "\n", - " num_insurers_past_5_years customer_gender_male ... policy_state_id \\\n", - "0 1 0 ... 0 \n", - "1 1 1 ... 0 \n", - "2 1 0 ... 0 \n", - "3 1 1 ... 0 \n", - "4 1 0 ... 0 \n", - "\n", - " incident_hour vehicle_claim fraud incident_type_collision \\\n", - "0 20 12000.0 0 0 \n", - "1 15 18500.0 0 1 \n", - "2 16 17500.0 0 1 \n", - "3 16 17500.0 0 1 \n", - "4 8 21500.0 0 1 \n", - "\n", - " policy_annual_premium policy_state_az policy_state_wa \\\n", - "0 3000 1 0 \n", - "1 3000 0 0 \n", - "2 3000 0 0 \n", - "3 3000 0 0 \n", - "4 2800 1 0 \n", - "\n", - " collision_type_rear collision_type_front \n", - "0 0 0 \n", - "1 0 0 \n", - "2 0 0 \n", - "3 0 0 \n", - "4 0 1 \n", - "\n", - "[5 rows x 47 columns]" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "df_combined = df_combined.loc[:, ~df_combined.columns.str.contains(\"^Unnamed: 0\")]\n", "# get rid of an unwanted column\n", @@ -776,320 +302,9 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
policy_idincident_type_theftpolicy_state_capolicy_deductablenum_witnessespolicy_state_orincident_monthcustomer_gender_femalenum_insurers_past_5_yearscustomer_gender_male...policy_state_idincident_hourvehicle_claimfraudincident_type_collisionpolicy_annual_premiumpolicy_state_azpolicy_state_wacollision_type_rearcollision_type_front
count20000.0000020000.00000020000.000020000.0000020000.00000020000.00000020000.00000020000.00000020000.00000020000.000000...20000.0000020000.00000020000.00000020000.00000020000.00000020000.00000020000.00000020000.00000020000.00000020000.000000
mean2500.500000.0482000.6204751.130000.8661000.0700006.7132000.3724001.4122000.576500...0.0273011.78680017426.0837000.0300000.8572002925.4000000.1136000.1210000.2209000.425400
std1443.411730.2141940.485313.573221.0979210.2551533.6543960.4834560.8972910.494125...0.162965.33791810043.7735990.1705910.349878143.5160960.3173330.3261350.4148640.494416
min1.000000.0000000.0000750.000000.0000000.0000001.0000000.0000001.0000000.000000...0.000000.0000001000.0000000.0000000.0000002150.0000000.0000000.0000000.0000000.000000
25%1250.750000.0000000.0000750.000000.0000000.0000003.0000000.0000001.0000000.000000...0.000008.00000010474.2500000.0000001.0000002900.0000000.0000000.0000000.0000000.000000
50%2500.500000.0000001.0000750.000000.0000000.0000007.0000000.0000001.0000001.000000...0.0000012.00000015000.0000000.0000001.0000003000.0000000.0000000.0000000.0000000.000000
75%3750.250000.0000001.0000750.000002.0000000.00000010.0000001.0000001.0000001.000000...0.0000016.00000022005.5000000.0000001.0000003000.0000000.0000000.0000000.0000001.000000
max5000.000001.0000001.00001100.000005.0000001.00000012.0000001.0000005.0000001.000000...1.0000023.00000051051.0000001.0000001.0000003000.0000001.0000001.0000001.0000001.000000
\n", - "

8 rows × 47 columns

\n", - "
" - ], - "text/plain": [ - " policy_id incident_type_theft policy_state_ca policy_deductable \\\n", - "count 20000.00000 20000.000000 20000.0000 20000.00000 \n", - "mean 2500.50000 0.048200 0.6204 751.13000 \n", - "std 1443.41173 0.214194 0.4853 13.57322 \n", - "min 1.00000 0.000000 0.0000 750.00000 \n", - "25% 1250.75000 0.000000 0.0000 750.00000 \n", - "50% 2500.50000 0.000000 1.0000 750.00000 \n", - "75% 3750.25000 0.000000 1.0000 750.00000 \n", - "max 5000.00000 1.000000 1.0000 1100.00000 \n", - "\n", - " num_witnesses policy_state_or incident_month customer_gender_female \\\n", - "count 20000.000000 20000.000000 20000.000000 20000.000000 \n", - "mean 0.866100 0.070000 6.713200 0.372400 \n", - "std 1.097921 0.255153 3.654396 0.483456 \n", - "min 0.000000 0.000000 1.000000 0.000000 \n", - "25% 0.000000 0.000000 3.000000 0.000000 \n", - "50% 0.000000 0.000000 7.000000 0.000000 \n", - "75% 2.000000 0.000000 10.000000 1.000000 \n", - "max 5.000000 1.000000 12.000000 1.000000 \n", - "\n", - " num_insurers_past_5_years customer_gender_male ... policy_state_id \\\n", - "count 20000.000000 20000.000000 ... 20000.00000 \n", - "mean 1.412200 0.576500 ... 0.02730 \n", - "std 0.897291 0.494125 ... 0.16296 \n", - "min 1.000000 0.000000 ... 0.00000 \n", - "25% 1.000000 0.000000 ... 0.00000 \n", - "50% 1.000000 1.000000 ... 0.00000 \n", - "75% 1.000000 1.000000 ... 0.00000 \n", - "max 5.000000 1.000000 ... 1.00000 \n", - "\n", - " incident_hour vehicle_claim fraud incident_type_collision \\\n", - "count 20000.000000 20000.000000 20000.000000 20000.000000 \n", - "mean 11.786800 17426.083700 0.030000 0.857200 \n", - "std 5.337918 10043.773599 0.170591 0.349878 \n", - "min 0.000000 1000.000000 0.000000 0.000000 \n", - "25% 8.000000 10474.250000 0.000000 1.000000 \n", - "50% 12.000000 15000.000000 0.000000 1.000000 \n", - "75% 16.000000 22005.500000 0.000000 1.000000 \n", - "max 23.000000 51051.000000 1.000000 1.000000 \n", - "\n", - " policy_annual_premium policy_state_az policy_state_wa \\\n", - "count 20000.000000 20000.000000 20000.000000 \n", - "mean 2925.400000 0.113600 0.121000 \n", - "std 143.516096 0.317333 0.326135 \n", - "min 2150.000000 0.000000 0.000000 \n", - "25% 2900.000000 0.000000 0.000000 \n", - "50% 3000.000000 0.000000 0.000000 \n", - "75% 3000.000000 0.000000 0.000000 \n", - "max 3000.000000 1.000000 1.000000 \n", - "\n", - " collision_type_rear collision_type_front \n", - "count 20000.000000 20000.000000 \n", - "mean 0.220900 0.425400 \n", - "std 0.414864 0.494416 \n", - "min 0.000000 0.000000 \n", - "25% 0.000000 0.000000 \n", - "50% 0.000000 0.000000 \n", - "75% 0.000000 1.000000 \n", - "max 1.000000 1.000000 \n", - "\n", - "[8 rows x 47 columns]" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "df_combined.describe()" ] @@ -1103,523 +318,9 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
featureunique_valuespercent_missingpercent_largest_categorydatatype
3policy_deductable80.098.94int64
28authorities_contacted_ambulance20.097.45int64
37policy_state_id20.097.27int64
35authorities_contacted_fire20.097.20int64
40fraud20.097.00int64
36driver_relationship_other20.096.06int64
16driver_relationship_child20.095.49int64
27policy_state_nv20.095.23int64
1incident_type_theft20.095.18int64
23num_claims_past_year80.093.28int64
5policy_state_or20.093.00int64
17driver_relationship_spouse20.091.09int64
33incident_type_breakin20.090.54int64
43policy_state_az20.088.64int64
44policy_state_wa20.087.90int64
20collision_type_na20.085.72int64
32driver_relationship_na20.085.72int64
41incident_type_collision20.085.72int64
13collision_type_side20.078.91int64
45collision_type_rear20.077.91int64
8num_insurers_past_5_years50.077.68int64
34authorities_contacted_none20.075.86int64
42policy_annual_premium180.071.68int64
11authorities_contacted_police20.070.51int64
22driver_relationship_self20.068.36int64
29num_injuries50.067.46int64
7customer_gender_female20.062.76int64
2policy_state_ca20.062.04int64
9customer_gender_male20.057.65int64
46collision_type_front20.057.46int64
31police_report_available20.057.22int64
4num_witnesses60.051.58int64
26num_vehicles_involved70.046.32int64
15customer_education50.044.29int64
21incident_severity30.041.71int64
30policy_liability40.033.95int64
18injury_claim8900.033.75float64
19incident_dow70.016.87int64
25auto_year200.013.86int64
6incident_month120.010.67int64
38incident_hour240.06.87int64
12incident_day310.03.79int64
14customer_age580.03.09int64
39vehicle_claim46210.01.44float64
10total_claim_amount49780.01.29float64
24months_as_customer3870.00.77int64
0policy_id50000.00.02int64
\n", - "
" - ], - "text/plain": [ - " feature unique_values percent_missing \\\n", - "3 policy_deductable 8 0.0 \n", - "28 authorities_contacted_ambulance 2 0.0 \n", - "37 policy_state_id 2 0.0 \n", - "35 authorities_contacted_fire 2 0.0 \n", - "40 fraud 2 0.0 \n", - "36 driver_relationship_other 2 0.0 \n", - "16 driver_relationship_child 2 0.0 \n", - "27 policy_state_nv 2 0.0 \n", - "1 incident_type_theft 2 0.0 \n", - "23 num_claims_past_year 8 0.0 \n", - "5 policy_state_or 2 0.0 \n", - "17 driver_relationship_spouse 2 0.0 \n", - "33 incident_type_breakin 2 0.0 \n", - "43 policy_state_az 2 0.0 \n", - "44 policy_state_wa 2 0.0 \n", - "20 collision_type_na 2 0.0 \n", - "32 driver_relationship_na 2 0.0 \n", - "41 incident_type_collision 2 0.0 \n", - "13 collision_type_side 2 0.0 \n", - "45 collision_type_rear 2 0.0 \n", - "8 num_insurers_past_5_years 5 0.0 \n", - "34 authorities_contacted_none 2 0.0 \n", - "42 policy_annual_premium 18 0.0 \n", - "11 authorities_contacted_police 2 0.0 \n", - "22 driver_relationship_self 2 0.0 \n", - "29 num_injuries 5 0.0 \n", - "7 customer_gender_female 2 0.0 \n", - "2 policy_state_ca 2 0.0 \n", - "9 customer_gender_male 2 0.0 \n", - "46 collision_type_front 2 0.0 \n", - "31 police_report_available 2 0.0 \n", - "4 num_witnesses 6 0.0 \n", - "26 num_vehicles_involved 7 0.0 \n", - "15 customer_education 5 0.0 \n", - "21 incident_severity 3 0.0 \n", - "30 policy_liability 4 0.0 \n", - "18 injury_claim 890 0.0 \n", - "19 incident_dow 7 0.0 \n", - "25 auto_year 20 0.0 \n", - "6 incident_month 12 0.0 \n", - "38 incident_hour 24 0.0 \n", - "12 incident_day 31 0.0 \n", - "14 customer_age 58 0.0 \n", - "39 vehicle_claim 4621 0.0 \n", - "10 total_claim_amount 4978 0.0 \n", - "24 months_as_customer 387 0.0 \n", - "0 policy_id 5000 0.0 \n", - "\n", - " percent_largest_category datatype \n", - "3 98.94 int64 \n", - "28 97.45 int64 \n", - "37 97.27 int64 \n", - "35 97.20 int64 \n", - "40 97.00 int64 \n", - "36 96.06 int64 \n", - "16 95.49 int64 \n", - "27 95.23 int64 \n", - "1 95.18 int64 \n", - "23 93.28 int64 \n", - "5 93.00 int64 \n", - "17 91.09 int64 \n", - "33 90.54 int64 \n", - "43 88.64 int64 \n", - "44 87.90 int64 \n", - "20 85.72 int64 \n", - "32 85.72 int64 \n", - "41 85.72 int64 \n", - "13 78.91 int64 \n", - "45 77.91 int64 \n", - "8 77.68 int64 \n", - "34 75.86 int64 \n", - "42 71.68 int64 \n", - "11 70.51 int64 \n", - "22 68.36 int64 \n", - "29 67.46 int64 \n", - "7 62.76 int64 \n", - "2 62.04 int64 \n", - "9 57.65 int64 \n", - "46 57.46 int64 \n", - "31 57.22 int64 \n", - "4 51.58 int64 \n", - "26 46.32 int64 \n", - "15 44.29 int64 \n", - "21 41.71 int64 \n", - "30 33.95 int64 \n", - "18 33.75 float64 \n", - "19 16.87 int64 \n", - "25 13.86 int64 \n", - "6 10.67 int64 \n", - "38 6.87 int64 \n", - "12 3.79 int64 \n", - "14 3.09 int64 \n", - "39 1.44 float64 \n", - "10 1.29 float64 \n", - "24 0.77 int64 \n", - "0 0.02 int64 " - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "combined_stats = []\n", "\n", @@ -1644,20 +345,9 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", @@ -1689,172 +379,6 @@ "\n", "plt.show()" ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " \n", - "\n", - "## Solution Architecture\n", - "[overview](#overview-0)\n", - "\n", - "We will go through 5 stages of ML and explore the solution architecture of SageMaker. Each of the sequancial notebooks will dive deep into corresponding ML stage." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " \n", - "\n", - "### [Notebook 1](./1-data-prep-e2e.ipynb): Data Preparation, Ingest, Transform, Preprocess, and Store in SageMaker Feature Store\n", - "[overview](#nb0-solution)\n", - "\n", - "![Solution Architecture](images/e2e-1-pipeline-v3b.png)\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " \n", - "\n", - "### [Notebook 2](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb) and [Notebook 3](./3-mitigate-bias-train-model2-registry-e2e.ipynb) : Train, Tune, Check Pre- and Post- Training Bias, Mitigate Bias, Re-train, and Deposit the Best Model to SageMaker Model Registry\n", - "[overview](#nb0-solution)\n", - "\n", - "![Solution Architecture](images/e2e-2-pipeline-v3b.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " \n", - "\n", - "### [Notebooks 4](./4-deploy-run-inference-e2e.ipynb) : Load the Best Model from Registry, Deploy it to SageMaker Hosted Endpoint, and Make Predictions\n", - "[overview](#nb0-solution)\n", - "\n", - "![Solution Architecture](images/e2e-3-pipeline-v3b.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " \n", - "\n", - "### [Notebooks 5](./5-pipeline-e2e.ipynb): End-to-End Pipeline - MLOps Pipeline to run an end-to-end automated workflow with all the design decisions made during manual/exploratory steps in previous notebooks.\n", - "[overview](#nb0-solution) \n", - "\n", - "![Notebook5 Pipelines](images/e2e-5-pipeline-v3b.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " \n", - "\n", - "## Code Resources\n", - "\n", - "[overview](#nb0-solution)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Stages\n", - "\n", - "Our solution is split into the following stages of the [ML Lifecycle](#nb0-ml-lifecycle), and each stage has it's own notebook:\n", - "\n", - "* [Use-case and Architecture](./0-AutoClaimFraudDetection.ipynb): We take a high-level look at the use-case, solution components and architecture.\n", - "* [Data Prep and Store](./1-data-prep-e2e.ipynb): We prepare a dataset for machine learning using SageMaker DataWrangler, create and deposit the datasets in a SageMaker FeatureStore. [--> Architecture](#nb0-data-prep)\n", - "* [Train, Assess Bias, Establish Lineage, Register Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb): We detect possible pre-training and post-training bias, train and tune a XGBoost model using Amazon SageMaker, record Lineage in the Model Registry so we can later deploy it. [--> Architecture](#nb0-train-store)\n", - "* [Mitigate Bias, Re-train, Register New Model](./3-mitigate-bias-train-model2-registry-e2e.ipynb): We mitigate bias, retrain a less biased model, store it in a Model Registry. [--> Architecture](#nb0-train-store)\n", - "* [Deploy and Serve](./4-deploy-run-inference-e2e.ipynb): We deploy the model to a Amazon SageMaker Hosted Endpoint and run realtime inference via the SageMaker Online Feature Store . [--> Architecture](#nb0-deploy-predict)\n", - "* [Create and Run an MLOps Pipeline](./5-pipeline-e2e.ipynb): We then create a SageMaker Pipeline that ties together everything we have done so far, from outputs from Data Wrangler, Feature Store, Clarify , Model Registry and finally deployment to a SageMaker Hosted Endpoint. [--> Architecture](#nb0-pipeline)\n", - "* [Conclusion](./6-conclusion-e2e.ipynb): We wrap things up and discuss how to clean up the solution." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " \n", - "\n", - "## The Exploratory Data Science and ML Ops Workflows\n", - "\n", - "[overview](#overview-0)\n", - "\n", - "### Exploratory Data Science and Scalable MLOps\n", - "\n", - "Note that there are typically two workflows: a manual exploratory workflow and an automated workflow. \n", - "\n", - "The *exploratory, manual data science workflow* is where experiments are conducted and various techniques and strategies are tested. \n", - "\n", - "After you have established your data prep, transformations, featurizations and training algorithms, testing of various hyperparameters for model tuning, you can start with the automated workflow where you *rely on MLOps or the ML Engineering part of your team* to streamline the process, make it more repeatable and scalable by putting it into an automated pipeline. \n", - "\n", - "![the 2 flows](images/2-flows.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## The ML Life Cycle: Detailed View\n", - "[overview](#overview-0)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "![title](images/ML-Lifecycle-v5.png)\n", - "\n", - "The Red Boxes and Icons represent comparatively newer concepts and tasks that are now deemed important to include and execute, in a production-oriented (versus research-oriented) and scalable ML lifecycle.\n", - "\n", - " These newer lifecycle tasks and their corresponding, supporting AWS Services and features include:\n", - "\n", - "1. [*Data Wrangling*](): AWS Data Wrangler for cleaning, normalizing, transforming and encoding data, as well as join ing datasets. The outputs of Data Wrangler are code generated to work with SageMaker Processing, SageMaker Pipelines, SageMaker Feature Store or just a plain old python script with pandas,\n", - " 1. Feature Engineering has always been done, but now with AWS Data Wrangler we can use a GUI based tool to do so and generate code for the next phases of the life-cycle.\n", - "2. [*Detect Bias*](): Using AWS Clarify, in Data Prep or in Training we can detect pre-training and post-training bias, and eventually at Inference time provide Interpretability / Explainability of the inferences (e.g., which factors were most influential in coming up with the prediction)\n", - "3. [*Feature Store [Offline]*](): Once we have done all of our feature engineering, the encoding and transformations, we can then standardize features, offline in AWS Feature Store, to be used as input features for training models.\n", - "4. [*Artifact Lineage*](): Using AWS SageMaker’s Artifact Lineage features we can associate all the artifacts (data, models, parameters, etc.) with a trained model to produce meta data that can be stored in a Model Registry.\n", - "5. [*Model Registry*](): AWS Model Registry stores the meta data around all artifacts that you have chosen to include in the process of creating your models, along with the model(s) themselves in a Model Registry. Later a human approval can be used to note that the model is good to be put into production. This feeds into the next phase of deploy and monitor .\n", - "6. [*Inference and the Online Feature Store*](): For realtime inference, we can leverage a online AWS Feature Store we have created to get us single digit millisecond low latency and high throughput for serving our model with new incoming data.\n", - "7. [*Pipelines*](): Once we have experimented and decided on the various options in the lifecycle (which transforms to apply to our features, imbalance or bias in the data, which algorithms to choose to train with, which hyper-parameters are giving us the best performance metrics, etc.) we can now automate the various tasks across the lifecycle using SageMaker Pipelines. \n", - " 1. In this blog, we will show a pipeline that starts with the outputs of AWS Data Wrangler and ends with storing trained models in the Model Registry. \n", - " 2. Typically, you could have a pipeline for data prep, one for training until model registry (which we are showing in the code associated with this blog) , one for inference, and one for re-training using SageMaker Monitor to detect model drift and data drift and trigger a re-training using , say an AWS Lambda function.\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "[overview](#overview-0)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "___\n", - "\n", - "### Next Notebook: [Data Preparation, Process, and Store Features](./1-data-prep-e2e.ipynb)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { diff --git a/end_to_end/fraud_detection/1-data-prep-e2e.ipynb b/end_to_end/fraud_detection/1-data-prep-e2e.ipynb index b8faea4e3d..55771cd401 100644 --- a/end_to_end/fraud_detection/1-data-prep-e2e.ipynb +++ b/end_to_end/fraud_detection/1-data-prep-e2e.ipynb @@ -4,27 +4,30 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Part 1 : Data Preparation, Process, and Store Features" + "# Fraud Detection for Automobile Claims: Data Preparation, Process, and Store Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "\n", + "## Background\n", "\n", - "## [Overview](./0-AutoClaimFraudDetection.ipynb)\n", - "* [Notebook 0: Overview, Architecture and Data Exploration](./0-AutoClaimFraudDetection.ipynb)\n", - "* **[Notebook 1: Data Preparation, Process, and Store Features](./1-data-prep-e2e.ipynb)**\n", - " * **[Architecture](#arch)**\n", - " * **[Getting started](#aud-getting-started)**\n", - " * **[DataSets](#aud-datasets)**\n", - " * **[SageMaker Feature Store](#aud-feature-store)**\n", - " * **[Create train and test datasets](#aud-dataset)**\n", - "* [Notebook 2: Train, Check Bias, Tune, Record Lineage, and Register a Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb)\n", - "* [Notebook 3: Mitigate Bias, Train New Model, Store in Registry](./3-mitigate-bias-train-model2-registry-e2e.ipynb)\n", - "* [Notebook 4: Deploy Model, Run Predictions](./4-deploy-run-inference-e2e.ipynb)\n", - "* [Notebook 5: Create and Run an End-to-End Pipeline to Deploy the Model](./5-pipeline-e2e.ipynb)" + "This notebook is the second part of a series of notebooks that will demonstrate how to prepare, train, and deploy a model that detects fradulent auto claims. In this notebook, we will be preparing, processing, and storing features using SageMaker Feature Store. You can choose to run this notebook by itself or in sequence with the other notebooks listed below. Please see the [README.md](README.md) for more information about this use case implemented by this series of notebooks. \n", + "\n", + "\n", + "1. [Fraud Detection for Automobile Claims: Data Exploration](./0-AutoClaimFraudDetection.ipynb)\n", + "1. **[Fraud Detection for Automobile Claims: Data Preparation, Process, and Store Features](./1-data-prep-e2e.ipynb)**\n", + "1. [Fraud Detection for Automobile Claims: Train, Check Bias, Tune, Record Lineage, and Register a Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb)\n", + "1. [Fraud Detection for Automobile Claims: Mitigate Bias, Train, Register, and Deploy Unbiased Model](./3-mitigate-bias-train-model2-registry-e2e.ipynb)\n", + "\n", + "\n", + "## Contents\n", + "1. [Architecture for Data Prep, Process and Store Features](#Architecture-for-Data-Prep,-Process-and-Store-Features)\n", + "1. [Getting Started: Creating Resources](#Getting-Started:-Creating-Resources)\n", + "1. [Datasets and Feature Types](#Datasets-and-Feature-Types)\n", + "1. [SageMaker Feature Store](#SageMaker-Feature-Store)\n", + "1. [Create Train and Test Datasets](#Create-Train-and-Test-Datasets)" ] }, { @@ -40,10 +43,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - " \n", - "\n", "## Architecture for Data Prep, Process and Store Features\n", - "[overview](#all-up-overview)\n", "----\n", "![Data Prep and Store](./images/e2e-1-pipeline-v3b.png)" ] @@ -65,31 +65,6 @@ "!python -m pip install -q awswrangler==2.2.0 imbalanced-learn==0.7.0 sagemaker==2.41.0 boto3==1.17.70" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Loading stored variables\n", - "If you ran this notebook before, you may want to re-use the resources you aready created with AWS. Run the cell below to load any prevously created variables. You should see a print-out of the existing variables. If you don't see anything printed then it's probably the first time you are running the notebook! " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%store -r\n", - "%store" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Important: You must have run the previous sequential notebooks to retrieve variables using the StoreMagic command.**" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -118,11 +93,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", - "## Getting started: Creating Resources\n", - "\n", - "[overview](#all-up-overview)\n", + "## Getting Started: Creating Resources\n", "----\n", "In order to successfully run this notebook you will need to create some AWS resources. \n", "First, an S3 bucket will be created to store all the data for this tutorial. \n", @@ -226,12 +197,12 @@ "metadata": {}, "outputs": [], "source": [ - "if 'bucket' not in locals():\n", + "if \"bucket\" not in locals():\n", " bucket = sagemaker_session.default_bucket()\n", - " prefix = 'fraud-detect-demo'\n", + " prefix = \"fraud-detect-demo\"\n", " %store bucket\n", " %store prefix\n", - " print(f'Creating bucket: {bucket}...')" + " print(f\"Creating bucket: {bucket}...\")" ] }, { @@ -360,10 +331,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", - "## DataSets and Feature Types\n", - "[overview](#all-up-overview)\n", + "## Datasets and Feature Types\n", "----" ] }, @@ -437,7 +405,7 @@ "outputs": [], "source": [ "# ======> This is your DataFlow output path if you decide to redo the work in DataFlow on your own\n", - "#flow_output_path = \"YOUR_PATH_HERE\"\n", + "# flow_output_path = \n", "claims_flow_path = \"\"\n", "customers_flow_path = \"\"\n", "\n", @@ -446,9 +414,7 @@ " claims_s3_path = f\"{flow_output_path}/claims_output\"\n", " customers_s3_path = f\"{flow_output_path}/customers_output\"\n", "\n", - " claims_preprocessed = wr.s3.read_csv(\n", - " path=claims_s3_path, dataset=True, dtype=claims_dtypes\n", - " )\n", + " claims_preprocessed = wr.s3.read_csv(path=claims_s3_path, dataset=True, dtype=claims_dtypes)\n", "\n", " customers_preprocessed = wr.s3.read_csv(\n", " path=customers_s3_path, dataset=True, dtype=customers_dtypes\n", @@ -488,12 +454,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", "## SageMaker Feature Store\n", - "\n", - "[overview](#all-up-overview)\n", "----\n", + "\n", "Amazon SageMaker Feature Store is a purpose-built repository where you can store and access features so it’s much easier to name, organize, and reuse them across teams. SageMaker Feature Store provides a unified store for features during training and real-time inference without the need to write additional code or create manual processes to keep features consistent. SageMaker Feature Store keeps track of the metadata of stored features (e.g. feature name or version number) so that you can query the features for the right attributes in batches or in real time using Amazon Athena, an interactive query service. SageMaker Feature Store also keeps features updated, because as new data is generated during inference, the single repository is updated so new features are always available for models to use during training and inference.\n", "\n", "A feature store consists of an offline componet stored in S3 and an online component stored in a low-latency database. The online database is optional, but very useful if you need supplemental features to be available at inference. In this section, we will create a feature groups for our Claims and Customers datasets. After inserting the claims and customer data into their respective feature groups, you need to query the offline store with Athena to build the training dataset.\n", @@ -674,14 +637,14 @@ "metadata": {}, "outputs": [], "source": [ - "if 'claims_table' not in locals():\n", - " claims_table = (\n", - " claims_feature_group.describe()[\"OfflineStoreConfig\"][\"DataCatalogConfig\"][\"TableName\"]\n", - " )\n", - "if 'customers_table' not in locals():\n", - " customers_table = (\n", - " customers_feature_group.describe()[\"OfflineStoreConfig\"][\"DataCatalogConfig\"][\"TableName\"]\n", - " )\n", + "if \"claims_table\" not in locals():\n", + " claims_table = claims_feature_group.describe()[\"OfflineStoreConfig\"][\"DataCatalogConfig\"][\n", + " \"TableName\"\n", + " ]\n", + "if \"customers_table\" not in locals():\n", + " customers_table = customers_feature_group.describe()[\"OfflineStoreConfig\"][\"DataCatalogConfig\"][\n", + " \"TableName\"\n", + " ]\n", "\n", "claims_feature_group_s3_prefix = (\n", " f\"{prefix}/{account_id}/sagemaker/{region}/offline-store/{claims_table}/data\"\n", @@ -690,6 +653,8 @@ " f\"{prefix}/{account_id}/sagemaker/{region}/offline-store/{customers_table}/data\"\n", ")\n", "\n", + "print(claims_feature_group_s3_prefix)\n", + "\n", "offline_store_contents = None\n", "while offline_store_contents is None:\n", " objects_in_bucket = s3_client.list_objects(\n", @@ -704,16 +669,24 @@ "print(\"\\nData available.\")" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "claims_feature_group.describe()[\"OfflineStoreConfig\"][\n", + " \"DataCatalogConfig\"\n", + "], customers_feature_group.describe()[\"OfflineStoreConfig\"][\"DataCatalogConfig\"]" + ] + }, { "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", - "## Create train and test datasets\n", - "\n", - "[overview](#all-up-overview)\n", + "## Create Train and Test Datasets\n", "----\n", + "\n", "Once the data is available in the offline store, it will automatically be cataloged and loaded into an Athena table (this is done by default, but can be turned off). In order to build our training and test datasets, you will submit a SQL query to join the the Claims and Customers tables created in Athena." ] }, @@ -771,7 +744,6 @@ "outputs": [], "source": [ "col_order = [\"fraud\"] + list(dataset.drop([\"fraud\", \"policy_id\"], axis=1).columns)\n", - "%store col_order\n", "\n", "train = dataset.sample(frac=0.80, random_state=0)[col_order]\n", "test = dataset.drop(train.index)[col_order]" @@ -800,484 +772,19 @@ "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "s3_client.upload_file(\n", - " Filename=\"data/train.csv\", Bucket=bucket, Key=f\"{prefix}/data/train/train.csv\"\n", - ")\n", - "s3_client.upload_file(Filename=\"data/test.csv\", Bucket=bucket, Key=f\"{prefix}/data/test/test.csv\")\n", - "%store train_data_uri\n", - "%store test_data_uri" - ] - }, - { - "cell_type": "code", - "execution_count": 107, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
fraudvehicle_claimdriver_relationship_selfnum_witnessespolicy_deductableincident_daypolicy_state_nvpolicy_state_azauto_yearpolicy_state_or...authorities_contacted_policetotal_claim_amountincident_hourpolicy_state_cainjury_claimauthorities_contacted_ambulancepolicy_annual_premiumcustomer_gender_maledriver_relationship_othernum_claims_past_year
398021500.015750240120120...123000.02001500.002450100
3833116000.01175050020170...116000.0800.002600000
483604000.012750190020090...04000.0510.002450100
4572019500.01175040120180...119500.01300.003000100
63609500.010750220020120...09500.02010.002750100
\n", - "

5 rows × 46 columns

\n", - "
" - ], - "text/plain": [ - " fraud vehicle_claim driver_relationship_self num_witnesses \\\n", - "398 0 21500.0 1 5 \n", - "3833 1 16000.0 1 1 \n", - "4836 0 4000.0 1 2 \n", - "4572 0 19500.0 1 1 \n", - "636 0 9500.0 1 0 \n", - "\n", - " policy_deductable incident_day policy_state_nv policy_state_az \\\n", - "398 750 24 0 1 \n", - "3833 750 5 0 0 \n", - "4836 750 19 0 0 \n", - "4572 750 4 0 1 \n", - "636 750 22 0 0 \n", - "\n", - " auto_year policy_state_or ... authorities_contacted_police \\\n", - "398 2012 0 ... 1 \n", - "3833 2017 0 ... 1 \n", - "4836 2009 0 ... 0 \n", - "4572 2018 0 ... 1 \n", - "636 2012 0 ... 0 \n", - "\n", - " total_claim_amount incident_hour policy_state_ca injury_claim \\\n", - "398 23000.0 20 0 1500.0 \n", - "3833 16000.0 8 0 0.0 \n", - "4836 4000.0 5 1 0.0 \n", - "4572 19500.0 13 0 0.0 \n", - "636 9500.0 20 1 0.0 \n", - "\n", - " authorities_contacted_ambulance policy_annual_premium \\\n", - "398 0 2450 \n", - "3833 0 2600 \n", - "4836 0 2450 \n", - "4572 0 3000 \n", - "636 0 2750 \n", - "\n", - " customer_gender_male driver_relationship_other num_claims_past_year \n", - "398 1 0 0 \n", - "3833 0 0 0 \n", - "4836 1 0 0 \n", - "4572 1 0 0 \n", - "636 1 0 0 \n", - "\n", - "[5 rows x 46 columns]" - ] - }, - "execution_count": 107, - "metadata": {}, - "output_type": "execute_result" - } - ], "source": [ "train.head(5)" ] }, { "cell_type": "code", - "execution_count": 108, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
fraudvehicle_claimdriver_relationship_selfnum_witnessespolicy_deductableincident_daypolicy_state_nvpolicy_state_azauto_yearpolicy_state_or...authorities_contacted_policetotal_claim_amountincident_hourpolicy_state_cainjury_claimauthorities_contacted_ambulancepolicy_annual_premiumcustomer_gender_maledriver_relationship_othernum_claims_past_year
008500.000750270020140...08500.01510.003000100
7016000.01175020020141...141000.08025000.003000100
2107000.010750190020140...07000.0910.003000100
24017500.00075010020200...017500.0410.003000100
25017000.000750171020180...117000.01800.003000100
\n", - "

5 rows × 46 columns

\n", - "
" - ], - "text/plain": [ - " fraud vehicle_claim driver_relationship_self num_witnesses \\\n", - "0 0 8500.0 0 0 \n", - "7 0 16000.0 1 1 \n", - "21 0 7000.0 1 0 \n", - "24 0 17500.0 0 0 \n", - "25 0 17000.0 0 0 \n", - "\n", - " policy_deductable incident_day policy_state_nv policy_state_az \\\n", - "0 750 27 0 0 \n", - "7 750 2 0 0 \n", - "21 750 19 0 0 \n", - "24 750 1 0 0 \n", - "25 750 17 1 0 \n", - "\n", - " auto_year policy_state_or ... authorities_contacted_police \\\n", - "0 2014 0 ... 0 \n", - "7 2014 1 ... 1 \n", - "21 2014 0 ... 0 \n", - "24 2020 0 ... 0 \n", - "25 2018 0 ... 1 \n", - "\n", - " total_claim_amount incident_hour policy_state_ca injury_claim \\\n", - "0 8500.0 15 1 0.0 \n", - "7 41000.0 8 0 25000.0 \n", - "21 7000.0 9 1 0.0 \n", - "24 17500.0 4 1 0.0 \n", - "25 17000.0 18 0 0.0 \n", - "\n", - " authorities_contacted_ambulance policy_annual_premium \\\n", - "0 0 3000 \n", - "7 0 3000 \n", - "21 0 3000 \n", - "24 0 3000 \n", - "25 0 3000 \n", - "\n", - " customer_gender_male driver_relationship_other num_claims_past_year \n", - "0 1 0 0 \n", - "7 1 0 0 \n", - "21 1 0 0 \n", - "24 1 0 0 \n", - "25 1 0 0 \n", - "\n", - "[5 rows x 46 columns]" - ] - }, - "execution_count": 108, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "test.head(5)" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "----\n", - "\n", - "### Next Notebook: [Train, Check Bias, Tune, Record Lineage, Register Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb)" - ] - }, { "cell_type": "code", "execution_count": null, @@ -1287,11 +794,10 @@ } ], "metadata": { - "instance_type": "ml.t3.medium", "kernelspec": { - "display_name": "Python 3 (Data Science)", + "display_name": "conda_python3", "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" + "name": "conda_python3" }, "language_info": { "codemirror_mode": { @@ -1303,7 +809,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.10" + "version": "3.6.13" } }, "nbformat": 4, diff --git a/end_to_end/fraud_detection/2-lineage-train-assess-bias-tune-registry-e2e.ipynb b/end_to_end/fraud_detection/2-lineage-train-assess-bias-tune-registry-e2e.ipynb index 1f276f0eaa..b5ca147647 100644 --- a/end_to_end/fraud_detection/2-lineage-train-assess-bias-tune-registry-e2e.ipynb +++ b/end_to_end/fraud_detection/2-lineage-train-assess-bias-tune-registry-e2e.ipynb @@ -4,46 +4,37 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Part 2: Train, Check Bias, Tune, Record Lineage, and Register a Model" + "# Fraud Detection for Automobile Claims: Train, Check Bias, Tune, Record Lineage, and Register a Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - " \n", + "## Background\n", + "\n", + "This notebook is the third part of a series of notebooks that will demonstrate how to prepare, train, and deploy a model that detects fradulent auto claims. In this notebook, we will show how you can assess pre-training and post-training bias with SageMaker Clarify, Train the Model using XGBoost on SageMaker, and then finally deposit it in the Model Registry, along with the Lineage of Artifacts that were created along the way: data, code and model metadata. You can choose to run this notebook by itself or in sequence with the other notebooks listed below. Please see the [README.md](README.md) for more information about this use case implemented by this series of notebooks. \n", "\n", - "## [Overview](./0-AutoClaimFraudDetection.ipynb)\n", - "* [Notebook 0 : Overview, Architecture and Data Exploration](./0-AutoClaimFraudDetection.ipynb)\n", - "* [Notebook 1: Data Prep, Process, Store Features](./1-data-prep-e2e.ipynb)\n", - "* **[Notebook 2: Train, Check Bias, Tune, Record Lineage, and Register a Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb)**\n", - " * **[Architecture](#train)**\n", - " * **[Train a model using XGBoost](#aud-train-model)**\n", - " * **[Model lineage with artifacts and associations](#model-lineage)**\n", - " * **[Evaluate the model for bias with Clarify](#check-bias)**\n", - " * **[Deposit Model and Lineage in SageMaker Model Registry](#model-registry)**\n", - "* [Notebook 3: Mitigate Bias, Train New Model, Store in Registry](./3-mitigate-bias-train-model2-registry-e2e.ipynb)\n", - "* [Notebook 4: Deploy Model, Run Predictions](./4-deploy-run-inference-e2e.ipynb)\n", - "* [Notebook 5 : Create and Run an End-to-End Pipeline to Deploy the Model](./5-pipeline-e2e.ipynb)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this section we will show how you can assess pre-training and post-training bias with SageMaker Clarify, Train the Model using XGBoost on SageMaker, and then finally deposit it in the Model Registry, along with the Lineage of Artifacts that were created along the way: data, code and model metadata.\n", "\n", - "In this second model, you will fix the gender imbalance in the dataset using SMOTE and train another model using XGBoost. This model will also be saved to our registry and eventually approved for deployment." + "1. [Fraud Detection for Automobile Claims: Data Exploration](./0-AutoClaimFraudDetection.ipynb)\n", + "1. [Fraud Detection for Automobile Claims: Data Preparation, Process, and Store Features](./1-data-prep-e2e.ipynb)\n", + "1. **[Fraud Detection for Automobile Claims: Train, Check Bias, Tune, Record Lineage, and Register a Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb)**\n", + "1. [Fraud Detection for Automobile Claims: Mitigate Bias, Train, Register, and Deploy Unbiased Model](./3-mitigate-bias-train-model2-registry-e2e.ipynb)\n", + "\n", + "## Contents\n", + "\n", + "1. [Architecture for the ML Lifecycle Stage: Train, Check Bias, Tune, Record Lineage, Register Model](#Architecture-for-the-ML-Lifecycle-Stage:-Train,-Check-Bias,-Tune,-Record-Lineage,-Register-Model)\n", + "1. [Train a Model using XGBoost](#Train-a-Model-using-XGBoost)\n", + "1. [Model Lineage with Artifacts and Associations](#Model-Lineage-with-Artifacts-and-Associations)\n", + "1. [Evaluate Model for Bias with Clarify](#Evaluate-Model-for-Bias-with-Clarify)\n", + "1. [Deposit Model and Lineage in SageMaker Model Registry](#Deposit-Model-and-Lineage-in-SageMaker-Model-Registry)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - " \n", - "\n", "## Architecture for the ML Lifecycle Stage: Train, Check Bias, Tune, Record Lineage, Register Model\n", - "[overview](#overview)\n", "----\n", "\n", "![train-assess-tune-register](./images/e2e-2-pipeline-v3b.png)" @@ -66,49 +57,6 @@ "!python -m pip install -q awswrangler==2.2.0 imbalanced-learn==0.7.0 sagemaker==2.41.0 boto3==1.17.70" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To apply the update to the current kernel, run the following code to refresh the kernel." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import IPython\n", - "\n", - "IPython.Application.instance().kernel.do_shutdown(True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Load stored variables\n", - "Run the cell below to load any prevously created variables. You should see a print-out of the existing variables. If you don't see anything you may need to create them again or it may be your first time running this notebook." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%store -r\n", - "%store" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Important: You must have run the previous sequential notebooks to retrieve variables using the StoreMagic command.**" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -184,6 +132,9 @@ "outputs": [], "source": [ "# variables used for parameterizing the notebook run\n", + "bucket = sagemaker_session.default_bucket()\n", + "prefix = \"fraud-detect-demo\"\n", + "\n", "estimator_output_path = f\"s3://{bucket}/{prefix}/training_jobs\"\n", "train_instance_count = 1\n", "train_instance_type = \"ml.m4.xlarge\"\n", @@ -206,12 +157,32 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", + "### Store Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "train_data_uri = f\"s3://{bucket}/{prefix}/data/train/train.csv\"\n", + "test_data_uri = f\"s3://{bucket}/{prefix}/data/test/test.csv\"\n", "\n", - "## Train a model using XGBoost\n", "\n", - "[overview](#overview)\n", + "s3_client.upload_file(\n", + " Filename=\"data/train.csv\", Bucket=bucket, Key=f\"{prefix}/data/train/train.csv\"\n", + ")\n", + "s3_client.upload_file(Filename=\"data/test.csv\", Bucket=bucket, Key=f\"{prefix}/data/test/test.csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Train a Model using XGBoost\n", "----\n", + "\n", "Once the training and test datasets have been persisted in S3, you can start training a model by defining which SageMaker Estimator you'd like to use. For this guide, you will use the [XGBoost Open Source Framework](https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/xgboost.html) to train your model. This estimator is accessed via the SageMaker SDK, but mirrors the open source version of the [XGBoost Python package](https://xgboost.readthedocs.io/en/latest/python/index.html). Any functioanlity provided by the XGBoost Python package can be implemented in your training script." ] }, @@ -234,8 +205,7 @@ " \"eta\": \"0.2\",\n", " \"objective\": \"binary:logistic\",\n", " \"num_round\": \"100\",\n", - "}\n", - "%store hyperparameters" + "}" ] }, { @@ -275,26 +245,22 @@ }, "outputs": [], "source": [ - "if 'training_job_1_name' not in locals():\n", - " \n", - " xgb_estimator.fit(inputs = {'train': train_data_uri})\n", + "if \"training_job_1_name\" not in locals():\n", + "\n", + " xgb_estimator.fit(inputs={\"train\": train_data_uri})\n", " training_job_1_name = xgb_estimator.latest_training_job.job_name\n", - " %store training_job_1_name\n", - " \n", + "\n", "else:\n", - " print(f'Using previous training job: {training_job_1_name}')" + " print(f\"Using previous training job: {training_job_1_name}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", - "## Model lineage with artifacts and associations\n", - "\n", - "[Overview](#aud-overview)\n", + "## Model Lineage with Artifacts and Associations\n", "----\n", + "\n", "Amazon SageMaker ML Lineage Tracking creates and stores information about the steps of a machine learning (ML) workflow from data preparation to model deployment. With the tracking information you can reproduce the workflow steps, track model and dataset lineage, and establish model governance and audit standards. With SageMaker Lineage Tracking data scientists and model builders can do the following:\n", "* Keep a running history of model discovery experiments.\n", "* Establish model governance by tracking model lineage artifacts for auditing and compliance verification.\n", @@ -308,8 +274,6 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", "### Register artifacts" ] }, @@ -444,8 +408,6 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", "### Set artifact associations" ] }, @@ -521,12 +483,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", - "## Evaluate model for bias with Clarify\n", - "\n", - "[overview](#aud-overview)\n", + "## Evaluate Model for Bias with Clarify\n", "----\n", + "\n", "Amazon SageMaker Clarify helps improve your machine learning (ML) models by detecting potential bias and helping explain the predictions that models make. It helps you identify various types of bias in pretraining data and in posttraining that can emerge during model training or when the model is in production. SageMaker Clarify helps explain how these models make predictions using a feature attribution approach. It also monitors inferences models make in production for bias or feature attribution drift. The fairness and explainability functionality provided by SageMaker Clarify provides components that help AWS customers build less biased and more understandable machine learning models. It also provides tools to help you generate model governance reports which you can use to inform risk and compliance teams, and external regulators. \n", "\n", "You can reference the [SageMaker Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html) for more information about SageMaker Clarify." @@ -546,7 +505,6 @@ "outputs": [], "source": [ "model_1_name = f\"{prefix}-xgboost-pre-smote\"\n", - "%store model_1_name\n", "model_matches = sagemaker_boto_client.list_models(NameContains=model_1_name)[\"Models\"]\n", "\n", "if not model_matches:\n", @@ -566,8 +524,6 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", "### Check for data set bias and model bias\n", "\n", "With SageMaker, we can check for pre-training and post-training bias. Pre-training metrics show pre-existing bias in that data, while post-training metrics show bias in the predictions from the model. Using the SageMaker SDK, we can specify which groups we want to check bias across and which metrics we'd like to show. \n", @@ -705,12 +661,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", "## Deposit Model and Lineage in SageMaker Model Registry\n", - "\n", - "[overview](#aud-overview)\n", "----\n", + "\n", "Once a useful model has been trained and its artifacts properly associated, the next step is to save the model in a registry for future reference and possible deployment.\n" ] }, @@ -728,10 +681,9 @@ "metadata": {}, "outputs": [], "source": [ - "if 'mpg_name' not in locals():\n", + "if \"mpg_name\" not in locals():\n", " mpg_name = prefix\n", - " %store mpg_name\n", - " print(f'Model Package Group name: {mpg_name}')" + " print(f\"Model Package Group name: {mpg_name}\")" ] }, { @@ -911,37 +863,13 @@ "source": [ "sagemaker_boto_client.list_model_packages(ModelPackageGroupName=mpg_name)[\"ModelPackageSummaryList\"]" ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "----\n", - "\n", - "### Next Notebook: [Mitigate Bias, Train New Model, Store in Registry](./3-mitigate-bias-train-model2-registry-e2e.ipynb)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To handle the imbalance, in the next notebook, we over-sample (i.e. upsample) the minority class using [SMOTE (Synthetic Minority Over-sampling Technique)](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { - "instance_type": "ml.t3.medium", "kernelspec": { - "display_name": "Python 3 (Data Science)", + "display_name": "conda_python3", "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" + "name": "conda_python3" }, "language_info": { "codemirror_mode": { @@ -953,7 +881,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.10" + "version": "3.6.13" } }, "nbformat": 4, diff --git a/end_to_end/fraud_detection/3-mitigate-bias-train-model2-registry-e2e.ipynb b/end_to_end/fraud_detection/3-mitigate-bias-train-model2-registry-e2e.ipynb index 4e3d804f85..68e0a90fd3 100644 --- a/end_to_end/fraud_detection/3-mitigate-bias-train-model2-registry-e2e.ipynb +++ b/end_to_end/fraud_detection/3-mitigate-bias-train-model2-registry-e2e.ipynb @@ -4,35 +4,34 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Part 3 : Mitigate Bias, Train another unbiased Model and Put in the Model Registry" + "# Fraud Detection for Automobile Claims: Mitigate Bias, Train, Register, and Deploy Unbiased Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "\n", + "## Background\n", "\n", - "## [Overview](./0-AutoClaimFraudDetection.ipynb)\n", - "* [Notebook 0 : Overview, Architecture and Data Exploration](./0-AutoClaimFraudDetection.ipynb)\n", - "* [Notebook 1: Data Prep, Process, Store Features](./1-data-prep-e2e.ipynb)\n", - "* [Notebook 2: Train, Check Bias, Tune, Record Lineage, and Register a Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb)\n", - "* **[Notebook 3: Mitigate Bias, Train New Model, Store in Registry](./3-mitigate-bias-train-model2-registry-e2e.ipynb)**\n", - " * **[Architecture](#train2)**\n", - " * **[Develop a second model](#second-model)**\n", - " * **[Analyze the Second Model for Bias](#analyze-second-model)**\n", - " * **[View Results of Clarify Bias Detection Job](#view-second-clarify-job)**\n", - " * **[Configure and Run Clarify Explainability Job](#explainability)**\n", - " * **[Create Model Package for the Second Trained Model](#model-package)**\n", - "* [Notebook 4: Deploy Model, Run Predictions](./4-deploy-run-inference-e2e.ipynb)\n", - "* [Notebook 5 : Create and Run an End-to-End Pipeline to Deploy the Model](./5-pipeline-e2e.ipynb)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this notebook, we will describe how to detect bias using Clarify, Mitigate it with SMOTE, train another model, put it in the Model Registry along with all the Lineage of the Artifacts created along the way: data, code and model metadata." + "This notebook is the fourth part of a series of notebooks that will demonstrate how to prepare, train, and deploy a model that detects fradulent auto claims. In this notebook, we will describe how to detect bias using Clarify, mitigate it with SMOTE, train another model, put it in the Model Registry along with all the Lineage of the Artifacts created along the way: data, code and model metadata. You can choose to run this notebook by itself or in sequence with the other notebooks listed below. Please see the [README.md](README.md) for more information about this use case implemented by this series of notebooks. \n", + "\n", + "\n", + "1. [Fraud Detection for Automobile Claims: Data Exploration](./0-AutoClaimFraudDetection.ipynb)\n", + "1. [Fraud Detection for Automobile Claims: Data Preparation, Process, and Store Features](./1-data-prep-e2e.ipynb)\n", + "1. [Fraud Detection for Automobile Claims: Train, Check Bias, Tune, Record Lineage, and Register a Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb)\n", + "1. **[Fraud Detection for Automobile Claims: Mitigate Bias, Train, Register, and Deploy Unbiased Model](./3-mitigate-bias-train-model2-registry-e2e.ipynb)**\n", + "\n", + "\n", + "## Contents\n", + "1. [Architecture: Train, Check Bias, Tune, Record Lineage, Register Model](#Architecture:-Train,-Check-Bias,-Tune,-Record-Lineage,-Register-Model)\n", + "1. [Develop an Unbiased Model](#Develop-an-Unbiased-Model)\n", + "1. [Analyze Model for Bias and Explainability](#Analyze-Model-for-Bias-and-Explainability)\n", + "1. [View Results of Clarify Job](#View-Results-of-Clarify-Job)\n", + "1. [Configure and Run Explainability Job](#Configure-and-Run-Explainability-Job)\n", + "1. [Create Model Package for the Trained Model](#Create-Model-Package-for-the-Trained-Model)\n", + "1. [Architecture: Deploy and Serve Model](#Architecture:-Deploy-and-Serve-Model)\n", + "1. [Deploy an Approved Model](#Deploy-an-Approved-Model)\n", + "1. [Run Predictions on Claims](#Run-Predictions-on-Claims)" ] }, { @@ -49,7 +48,7 @@ "outputs": [], "source": [ "!python -m pip install -Uq pip\n", - "!python -m pip install -q awswrangler==2.2.0 imbalanced-learn==0.7.0 sagemaker==2.41.0 boto3==1.17.70" + "!python -m pip install -q awswrangler==2.2.0 imbalanced-learn==0.7.0 sagemaker==2.23.0 boto3==1.17.70" ] }, { @@ -81,31 +80,6 @@ "%matplotlib inline" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Load stored variables\n", - "Run the cell below to load any prevously created variables. You should see a print-out of the existing variables. If you don't see anything you may need to create them again or it may be your first time running this notebook." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%store -r\n", - "%store" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Important: You must have run the previous sequential notebooks to retrieve variables using the StoreMagic command.**" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -140,7 +114,7 @@ "\n", "sagemaker_boto_client = boto_session.client(\"sagemaker\")\n", "\n", - "sagemaker_session = sagemaker.session.Session(\n", + "sagemaker_session = sagemaker.Session(\n", " boto_session=boto_session, sagemaker_client=sagemaker_boto_client\n", ")\n", "\n", @@ -156,6 +130,12 @@ "outputs": [], "source": [ "# variables used for parameterizing the notebook run\n", + "bucket = sagemaker_session.default_bucket()\n", + "prefix = \"fraud-detect-demo\"\n", + "\n", + "claims_fg_name = f\"{prefix}-claims\"\n", + "customers_fg_name = f\"{prefix}-customers\"\n", + "\n", "model_2_name = f\"{prefix}-xgboost-post-smote\"\n", "\n", "train_data_upsampled_s3_path = f\"s3://{bucket}/{prefix}/data/train/upsampled/train.csv\"\n", @@ -173,10 +153,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - " \n", - "\n", - "## Architecture for this ML Lifecycle Stage : Train, Check Bias, Tune, Record Lineage, Register Model\n", - "[overview](#aup-overview)\n", + "## Architecture: Train, Check Bias, Tune, Record Lineage, Register Model\n", "----\n", "\n", "![train-assess-tune-register](./images/e2e-2-pipeline-v3b.png)" @@ -186,12 +163,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", - "## Develop a second model\n", - "\n", - "[overview](#aup-overview)\n", + "## Develop an Unbiased Model\n", "----\n", + "\n", "In this second model, you will fix the gender imbalance in the dataset using SMOTE and train another model using XGBoost. This model will also be saved to our registry and eventually approved for deployment." ] }, @@ -205,826 +179,6 @@ "test = pd.read_csv(\"data/test.csv\")" ] }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
fraudvehicle_claimdriver_relationship_selfnum_witnessespolicy_deductableincident_daypolicy_state_nvpolicy_state_azauto_yearpolicy_state_or...authorities_contacted_policetotal_claim_amountincident_hourpolicy_state_cainjury_claimauthorities_contacted_ambulancepolicy_annual_premiumcustomer_gender_maledriver_relationship_othernum_claims_past_year
0021500.015750240120120...123000.02001500.002450100
1116000.01175050020170...116000.0800.002600000
204000.012750190020090...04000.0510.002450100
3019500.01175040120180...119500.01300.003000100
409500.010750220020120...09500.02010.002750100
..................................................................
399509500.003750220020140...19500.0910.003000010
399608500.013750121020150...18500.02200.003000000
3997012000.011750150020160...112000.0810.003000100
3998033000.01375040020150...136000.02013000.003000000
3999014000.01075030020080...014000.01410.012600100
\n", - "

4000 rows × 46 columns

\n", - "
" - ], - "text/plain": [ - " fraud vehicle_claim driver_relationship_self num_witnesses \\\n", - "0 0 21500.0 1 5 \n", - "1 1 16000.0 1 1 \n", - "2 0 4000.0 1 2 \n", - "3 0 19500.0 1 1 \n", - "4 0 9500.0 1 0 \n", - "... ... ... ... ... \n", - "3995 0 9500.0 0 3 \n", - "3996 0 8500.0 1 3 \n", - "3997 0 12000.0 1 1 \n", - "3998 0 33000.0 1 3 \n", - "3999 0 14000.0 1 0 \n", - "\n", - " policy_deductable incident_day policy_state_nv policy_state_az \\\n", - "0 750 24 0 1 \n", - "1 750 5 0 0 \n", - "2 750 19 0 0 \n", - "3 750 4 0 1 \n", - "4 750 22 0 0 \n", - "... ... ... ... ... \n", - "3995 750 22 0 0 \n", - "3996 750 12 1 0 \n", - "3997 750 15 0 0 \n", - "3998 750 4 0 0 \n", - "3999 750 3 0 0 \n", - "\n", - " auto_year policy_state_or ... authorities_contacted_police \\\n", - "0 2012 0 ... 1 \n", - "1 2017 0 ... 1 \n", - "2 2009 0 ... 0 \n", - "3 2018 0 ... 1 \n", - "4 2012 0 ... 0 \n", - "... ... ... ... ... \n", - "3995 2014 0 ... 1 \n", - "3996 2015 0 ... 1 \n", - "3997 2016 0 ... 1 \n", - "3998 2015 0 ... 1 \n", - "3999 2008 0 ... 0 \n", - "\n", - " total_claim_amount incident_hour policy_state_ca injury_claim \\\n", - "0 23000.0 20 0 1500.0 \n", - "1 16000.0 8 0 0.0 \n", - "2 4000.0 5 1 0.0 \n", - "3 19500.0 13 0 0.0 \n", - "4 9500.0 20 1 0.0 \n", - "... ... ... ... ... \n", - "3995 9500.0 9 1 0.0 \n", - "3996 8500.0 22 0 0.0 \n", - "3997 12000.0 8 1 0.0 \n", - "3998 36000.0 20 1 3000.0 \n", - "3999 14000.0 14 1 0.0 \n", - "\n", - " authorities_contacted_ambulance policy_annual_premium \\\n", - "0 0 2450 \n", - "1 0 2600 \n", - "2 0 2450 \n", - "3 0 3000 \n", - "4 0 2750 \n", - "... ... ... \n", - "3995 0 3000 \n", - "3996 0 3000 \n", - "3997 0 3000 \n", - "3998 0 3000 \n", - "3999 1 2600 \n", - "\n", - " customer_gender_male driver_relationship_other num_claims_past_year \n", - "0 1 0 0 \n", - "1 0 0 0 \n", - "2 1 0 0 \n", - "3 1 0 0 \n", - "4 1 0 0 \n", - "... ... ... ... \n", - "3995 0 1 0 \n", - "3996 0 0 0 \n", - "3997 1 0 0 \n", - "3998 0 0 0 \n", - "3999 1 0 0 \n", - "\n", - "[4000 rows x 46 columns]" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "train" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
fraudvehicle_claimdriver_relationship_selfnum_witnessespolicy_deductableincident_daypolicy_state_nvpolicy_state_azauto_yearpolicy_state_or...authorities_contacted_policetotal_claim_amountincident_hourpolicy_state_cainjury_claimauthorities_contacted_ambulancepolicy_annual_premiumcustomer_gender_maledriver_relationship_othernum_claims_past_year
008500.000750270020140...08500.01510.003000100
1016000.01175020020141...141000.08025000.003000100
207000.010750190020140...07000.0910.003000100
3017500.00075010020200...017500.0410.003000100
4017000.000750171020180...117000.01800.003000100
..................................................................
995011000.01075040020141...111000.01600.002850100
996014000.010750170120190...114000.02200.003000100
997040000.01175070020190...155000.018115000.003000100
998040000.01175030020180...040000.01800.013000101
999135000.012750211020160...135000.01600.003000101
\n", - "

1000 rows × 46 columns

\n", - "
" - ], - "text/plain": [ - " fraud vehicle_claim driver_relationship_self num_witnesses \\\n", - "0 0 8500.0 0 0 \n", - "1 0 16000.0 1 1 \n", - "2 0 7000.0 1 0 \n", - "3 0 17500.0 0 0 \n", - "4 0 17000.0 0 0 \n", - ".. ... ... ... ... \n", - "995 0 11000.0 1 0 \n", - "996 0 14000.0 1 0 \n", - "997 0 40000.0 1 1 \n", - "998 0 40000.0 1 1 \n", - "999 1 35000.0 1 2 \n", - "\n", - " policy_deductable incident_day policy_state_nv policy_state_az \\\n", - "0 750 27 0 0 \n", - "1 750 2 0 0 \n", - "2 750 19 0 0 \n", - "3 750 1 0 0 \n", - "4 750 17 1 0 \n", - ".. ... ... ... ... \n", - "995 750 4 0 0 \n", - "996 750 17 0 1 \n", - "997 750 7 0 0 \n", - "998 750 3 0 0 \n", - "999 750 21 1 0 \n", - "\n", - " auto_year policy_state_or ... authorities_contacted_police \\\n", - "0 2014 0 ... 0 \n", - "1 2014 1 ... 1 \n", - "2 2014 0 ... 0 \n", - "3 2020 0 ... 0 \n", - "4 2018 0 ... 1 \n", - ".. ... ... ... ... \n", - "995 2014 1 ... 1 \n", - "996 2019 0 ... 1 \n", - "997 2019 0 ... 1 \n", - "998 2018 0 ... 0 \n", - "999 2016 0 ... 1 \n", - "\n", - " total_claim_amount incident_hour policy_state_ca injury_claim \\\n", - "0 8500.0 15 1 0.0 \n", - "1 41000.0 8 0 25000.0 \n", - "2 7000.0 9 1 0.0 \n", - "3 17500.0 4 1 0.0 \n", - "4 17000.0 18 0 0.0 \n", - ".. ... ... ... ... \n", - "995 11000.0 16 0 0.0 \n", - "996 14000.0 22 0 0.0 \n", - "997 55000.0 18 1 15000.0 \n", - "998 40000.0 18 0 0.0 \n", - "999 35000.0 16 0 0.0 \n", - "\n", - " authorities_contacted_ambulance policy_annual_premium \\\n", - "0 0 3000 \n", - "1 0 3000 \n", - "2 0 3000 \n", - "3 0 3000 \n", - "4 0 3000 \n", - ".. ... ... \n", - "995 0 2850 \n", - "996 0 3000 \n", - "997 0 3000 \n", - "998 1 3000 \n", - "999 0 3000 \n", - "\n", - " customer_gender_male driver_relationship_other num_claims_past_year \n", - "0 1 0 0 \n", - "1 1 0 0 \n", - "2 1 0 0 \n", - "3 1 0 0 \n", - "4 1 0 0 \n", - ".. ... ... ... \n", - "995 1 0 0 \n", - "996 1 0 0 \n", - "997 1 0 0 \n", - "998 1 0 1 \n", - "999 1 0 1 \n", - "\n", - "[1000 rows x 46 columns]" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "test" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -1071,6 +225,28 @@ "train_data_upsampled[\"customer_gender_female\"].value_counts()" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Set the hyperparameters\n", + "These are the parameters which will be sent to our training script in order to train the model. Although they are all defined as \"hyperparameters\" here, they can encompass XGBoost's [Learning Task Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters), [Tree Booster Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster), or any other parameters you'd like to configure for XGBoost." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "hyperparameters = {\n", + " \"max_depth\": \"3\",\n", + " \"eta\": \"0.2\",\n", + " \"objective\": \"binary:logistic\",\n", + " \"num_round\": \"100\",\n", + "}" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -1116,15 +292,14 @@ }, "outputs": [], "source": [ - "if 'training_job_2_name' not in locals():\n", - " \n", - " xgb_estimator.fit(inputs = {'train': train_data_upsampled_s3_path})\n", + "if \"training_job_2_name\" not in locals():\n", + "\n", + " xgb_estimator.fit(inputs={\"train\": train_data_upsampled_s3_path})\n", " training_job_2_name = xgb_estimator.latest_training_job.job_name\n", - " %store training_job_2_name\n", - " \n", + "\n", "else:\n", - " \n", - " print(f'Using previous training job: {training_job_2_name}')" + "\n", + " print(f\"Using previous training job: {training_job_2_name}\")" ] }, { @@ -1330,21 +505,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "
\n",
-    "\n",
-    "
" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Analyze the second model for bias and explainability\n", - "\n", - "[overview](#aup-overview)\n", + "## Analyze Model for Bias and Explainability\n", "----\n", + "\n", "Amazon SageMaker Clarify provides tools to help explain how machine learning (ML) models make predictions. These tools can help ML modelers and developers and other internal stakeholders understand model characteristics as a whole prior to deployment and to debug predictions provided by the model after it's deployed. Transparency about how ML models arrive at their predictions is also critical to consumers and regulators who need to trust the model predictions if they are going to accept the decisions based on them. SageMaker Clarify uses a model-agnostic feature attribution approach, which you can used to understand why a model made a prediction after training and to provide per-instance explanation during inference. The implementation includes a scalable and efficient implementation of SHAP ([see paper](https://papers.nips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf)), based on the concept of a Shapley value from the field of cooperative game theory that assigns each feature an importance value for a particular prediction. " ] }, @@ -1449,10 +612,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", - "## View results of Clarify job\n", - "[overview](#aup-overview)\n", + "## View Results of Clarify Job\n", "----\n", "\n", "Running Clarify on your dataset or model can take ~15 minutes. If you don't have time to run the job, you can view the pre-generated results included with this demo. Otherwise, you can run the job by un-commenting the code in the cell above." @@ -1495,11 +655,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", - "## Configure and run explainability job\n", - "[overview](#aup-overview)\n", + "## Configure and Run Explainability Job\n", "----\n", + "\n", "To run the full Clarify job, you must un-comment the code in the cell below. Running the job will take ~15 minutes. If you wish to save time, you can view the results in the next cell after which loads a pre-generated output if no explainability job was run." ] }, @@ -1530,7 +688,7 @@ "\n", "# un-comment the code below to run the whole job\n", "\n", - "# if 'clarify_expl_job_name' not in locals():\n", + "# if \"clarify_expl_job_name\" not in locals():\n", "\n", "# clarify_processor.run_explainability(\n", "# data_config=explainability_data_config,\n", @@ -1554,30 +712,9 @@ }, { "cell_type": "code", - "execution_count": 32, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Loading pre-generated analysis file...\n", - "\n" - ] - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "if \"clarify_expl_job_name\" in locals():\n", " s3_client.download_file(\n", @@ -1615,31 +752,9 @@ }, { "cell_type": "code", - "execution_count": 102, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'Click link below to view the SageMaker Clarify report'" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "clarify_output/report.pdf
" - ], - "text/plain": [ - "/root/amazon-sagemaker-examples/end_to_end/clarify_output/report.pdf" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "from IPython.display import FileLink, FileLinks\n", "\n", @@ -1680,11 +795,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", - "## Create Model Package for the Second Trained Model\n", - "[overview](#aup-overview)\n", - "----" + "## Create Model Package for the Trained Model\n", + "----\n" ] }, { @@ -1781,6 +893,7 @@ "metadata": {}, "outputs": [], "source": [ + "mpg_name = prefix\n", "mp_input_dict = {\n", " \"ModelPackageGroupName\": mpg_name,\n", " \"ModelPackageDescription\": \"XGBoost classifier to detect insurance fraud with SMOTE.\",\n", @@ -1790,8 +903,7 @@ "\n", "mp_input_dict.update(mp_inference_spec)\n", "mp2_response = sagemaker_boto_client.create_model_package(**mp_input_dict)\n", - "mp2_arn = mp2_response[\"ModelPackageArn\"]\n", - "%store mp2_arn" + "mp2_arn = mp2_response[\"ModelPackageArn\"]" ] }, { @@ -1842,9 +954,271 @@ "cell_type": "markdown", "metadata": {}, "source": [ + " \n", + "\n", + "## Architecture: Deploy and Serve Model\n", "----\n", "\n", - "### Next Notebook: [Deploy Model, Run Predictions](./4-deploy-run-inference-e2e.ipynb)" + "Now that we have trained a model, we can deploy and serve it. The follwoing picture shows the architecture for doing so.\n", + "\n", + "![train-assess-tune-register](./images/e2e-3-pipeline-v3b.png)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# variables used for parameterizing the notebook run\n", + "endpoint_name = f\"{model_2_name}-endpoint\"\n", + "endpoint_instance_count = 1\n", + "endpoint_instance_type = \"ml.m4.xlarge\"\n", + "\n", + "predictor_instance_count = 1\n", + "predictor_instance_type = \"ml.c5.xlarge\"\n", + "batch_transform_instance_count = 1\n", + "batch_transform_instance_type = \"ml.c5.xlarge\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Deploy an Approved Model and Make a Prediction\n", + "----" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Approve the second model\n", + "In the real-life MLOps lifecycle, a model package gets approved after evaluation by data scientists, subject matter experts and auditors." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "second_model_package = sagemaker_boto_client.list_model_packages(ModelPackageGroupName=mpg_name)[\n", + " \"ModelPackageSummaryList\"\n", + "][0]\n", + "model_package_update = {\n", + " \"ModelPackageArn\": second_model_package[\"ModelPackageArn\"],\n", + " \"ModelApprovalStatus\": \"Approved\",\n", + "}\n", + "\n", + "update_response = sagemaker_boto_client.update_model_package(**model_package_update)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Create an endpoint config and an endpoint\n", + "Deploy the endpoint. This might take about 8minutes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "primary_container = {\"ModelPackageName\": second_model_package[\"ModelPackageArn\"]}\n", + "endpoint_config_name = f\"{model_2_name}-endpoint-config\"\n", + "existing_configs = len(\n", + " sagemaker_boto_client.list_endpoint_configs(NameContains=endpoint_config_name, MaxResults=30)[\n", + " \"EndpointConfigs\"\n", + " ]\n", + ")\n", + "\n", + "if existing_configs == 0:\n", + " create_ep_config_response = sagemaker_boto_client.create_endpoint_config(\n", + " EndpointConfigName=endpoint_config_name,\n", + " ProductionVariants=[\n", + " {\n", + " \"InstanceType\": endpoint_instance_type,\n", + " \"InitialVariantWeight\": 1,\n", + " \"InitialInstanceCount\": endpoint_instance_count,\n", + " \"ModelName\": model_2_name,\n", + " \"VariantName\": \"AllTraffic\",\n", + " }\n", + " ],\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "existing_endpoints = sagemaker_boto_client.list_endpoints(\n", + " NameContains=endpoint_name, MaxResults=30\n", + ")[\"Endpoints\"]\n", + "if not existing_endpoints:\n", + " create_endpoint_response = sagemaker_boto_client.create_endpoint(\n", + " EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n", + " )\n", + "\n", + "endpoint_info = sagemaker_boto_client.describe_endpoint(EndpointName=endpoint_name)\n", + "endpoint_status = endpoint_info[\"EndpointStatus\"]\n", + "\n", + "while endpoint_status == \"Creating\":\n", + " endpoint_info = sagemaker_boto_client.describe_endpoint(EndpointName=endpoint_name)\n", + " endpoint_status = endpoint_info[\"EndpointStatus\"]\n", + " print(\"Endpoint status:\", endpoint_status)\n", + " if endpoint_status == \"Creating\":\n", + " time.sleep(60)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Run Predictions on Claims\n", + "----" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " \n", + "\n", + "### Create a predictor" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "predictor = sagemaker.predictor.Predictor(\n", + " endpoint_name=endpoint_name, sagemaker_session=sagemaker_session\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Sample a claim from the test data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dataset = pd.read_csv(\"data/dataset.csv\")\n", + "train = dataset.sample(frac=0.8, random_state=0)\n", + "test = dataset.drop(train.index)\n", + "sample_policy_id = int(test.sample(1)[\"policy_id\"])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "test.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Get Multiple Claims" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dataset = pd.read_csv(\"./data/claims_customer.csv\")\n", + "col_order = [\"fraud\"] + list(dataset.drop([\"fraud\", \"Unnamed: 0\", \"policy_id\"], axis=1).columns)\n", + "col_order" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "col_order" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Pull customer data and format the datapoint\n", + "When a customer submits an insurance claim online for instant approval, the insurance company will need to pull customer-specific data. You can do it either using the customer data we have stored in a CSV files or an online feature store to add to the claim data. The pulled data will serve as input for a model prediction.\n", + "\n", + "Then, the datapoint must match the exact input format as the model was trained--with all features in the correct order. In this example, the `col_order` variable was saved when you created the train and test datasets earlier in the guide." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sample_policy_id = int(test.sample(1)[\"policy_id\"])\n", + "pull_from_feature_store = False\n", + "\n", + "if pull_from_feature_store:\n", + " customers_response = featurestore_runtime.get_record(\n", + " FeatureGroupName=customers_fg_name, RecordIdentifierValueAsString=str(sample_policy_id)\n", + " )\n", + "\n", + " customer_record = customers_response[\"Record\"]\n", + " customer_df = pd.DataFrame(customer_record).set_index(\"FeatureName\")\n", + "\n", + " claims_response = featurestore_runtime.get_record(\n", + " FeatureGroupName=claims_fg_name, RecordIdentifierValueAsString=str(sample_policy_id)\n", + " )\n", + "\n", + " claims_record = claims_response[\"Record\"]\n", + " claims_df = pd.DataFrame(claims_record).set_index(\"FeatureName\")\n", + "\n", + " blended_df = pd.concat([claims_df, customer_df]).loc[col_order].drop(\"fraud\")\n", + "else:\n", + " customer_claim_df = dataset[dataset[\"policy_id\"] == sample_policy_id].sample(1)\n", + " blended_df = customer_claim_df.loc[:, col_order].drop(\"fraud\", axis=1).T.reset_index()\n", + " blended_df.columns = [\"FeatureName\", \"ValueAsString\"]\n", + "\n", + "data_input = \",\".join([str(x) for x in blended_df[\"ValueAsString\"]])\n", + "data_input" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Make prediction" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "results = predictor.predict(data_input, initial_args={\"ContentType\": \"text/csv\"})\n", + "prediction = json.loads(results)\n", + "print(f\"Probablitity the claim from policy {int(sample_policy_id)} is fraudulent:\", prediction)" ] }, { @@ -1858,9 +1232,9 @@ "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { - "display_name": "Python 3 (Data Science)", + "display_name": "conda_python3", "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" + "name": "conda_python3" }, "language_info": { "codemirror_mode": { @@ -1872,7 +1246,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.10" + "version": "3.6.13" } }, "nbformat": 4, diff --git a/end_to_end/fraud_detection/4-deploy-run-inference-e2e.ipynb b/end_to_end/fraud_detection/4-deploy-run-inference-e2e.ipynb deleted file mode 100644 index 8c0ede1bbd..0000000000 --- a/end_to_end/fraud_detection/4-deploy-run-inference-e2e.ipynb +++ /dev/null @@ -1,555 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Part 4 : Deploy, Run Inference, Interpret Inference" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## [Overview](./0-AutoClaimFraudDetection.ipynb)\n", - "* [Notebook 0 : Overview, Architecture and Data Exploration](./0-AutoClaimFraudDetection.ipynb)\n", - "* [Notebook 1: Data Prep, Process, Store Features](./1-data-prep-e2e.ipynb)\n", - "* [Notebook 2: Train, Check Bias, Tune, Record Lineage, and Register a Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb)\n", - "* [Notebook 3: Mitigate Bias, Train New Model, Store in Registry](./3-mitigate-bias-train-model2-registry-e2e.ipynb)\n", - "* **[Notebook 4: Deploy Model, Run Predictions](./4-deploy-run-inference-e2e.ipynb)**\n", - " * **[Architecture](#deploy)**\n", - " * **[Deploy an approved model and Run Inference via Feature Store](#deploy-model)**\n", - " * **[Create a Predictor](#predictor)**\n", - " * **[Run Predictions from Online FeatureStore](#run-predictions)**\n", - "* [Notebook 5 : Create and Run an End-to-End Pipeline to Deploy the Model](./5-pipeline-e2e.ipynb)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this section of the end to end use case, we will deploy the mitigated model that is the end-product of this fraud detection use-case. We will show how to run inference and also how to use Clarify to interpret or \"explain\" the model." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Install required and/or update third-party libraries" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!python -m pip install -Uq pip\n", - "!python -m pip install -q awswrangler==2.2.0 imbalanced-learn==0.7.0 sagemaker==2.41.0 boto3==1.17.70" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Load stored variables\n", - "Run the cell below to load any prevously created variables. You should see a print-out of the existing variables. If you don't see anything you may need to create them again or it may be your first time running this notebook." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%store -r\n", - "%store" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Important: You must have run the previous sequential notebooks to retrieve variables using the StoreMagic command.**" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Import libraries" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import json\n", - "import time\n", - "import boto3\n", - "import sagemaker\n", - "import numpy as np\n", - "import pandas as pd\n", - "import awswrangler as wr" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Set region, boto3 and SageMaker SDK variables" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# You can change this to a region of your choice\n", - "import sagemaker\n", - "\n", - "region = sagemaker.Session().boto_region_name\n", - "print(\"Using AWS Region: {}\".format(region))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "boto3.setup_default_session(region_name=region)\n", - "\n", - "boto_session = boto3.Session(region_name=region)\n", - "\n", - "s3_client = boto3.client(\"s3\", region_name=region)\n", - "\n", - "sagemaker_boto_client = boto_session.client(\"sagemaker\")\n", - "\n", - "sagemaker_session = sagemaker.session.Session(\n", - " boto_session=boto_session, sagemaker_client=sagemaker_boto_client\n", - ")\n", - "\n", - "sagemaker_role = sagemaker.get_execution_role()\n", - "\n", - "account_id = boto3.client(\"sts\").get_caller_identity()[\"Account\"]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# variables used for parameterizing the notebook run\n", - "endpoint_name = f\"{model_2_name}-endpoint\"\n", - "endpoint_instance_count = 1\n", - "endpoint_instance_type = \"ml.m4.xlarge\"\n", - "\n", - "predictor_instance_count = 1\n", - "predictor_instance_type = \"ml.c5.xlarge\"\n", - "batch_transform_instance_count = 1\n", - "batch_transform_instance_type = \"ml.c5.xlarge\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " \n", - "\n", - "## Architecture for this ML Lifecycle Stage : Train, Check Bias, Tune, Record Lineage, Register Model\n", - "[overview](#overview-4)\n", - "\n", - "![train-assess-tune-register](./images/e2e-3-pipeline-v3b.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Deploy an approved model and make prediction via Feature Store\n", - "\n", - "[overview](#overview-4)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Approve the second model\n", - "In the real-life MLOps lifecycle, a model package gets approved after evaluation by data scientists, subject matter experts and auditors." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "second_model_package = sagemaker_boto_client.list_model_packages(ModelPackageGroupName=mpg_name)[\n", - " \"ModelPackageSummaryList\"\n", - "][0]\n", - "model_package_update = {\n", - " \"ModelPackageArn\": second_model_package[\"ModelPackageArn\"],\n", - " \"ModelApprovalStatus\": \"Approved\",\n", - "}\n", - "\n", - "update_response = sagemaker_boto_client.update_model_package(**model_package_update)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Create an endpoint config and an endpoint\n", - "Deploy the endpoint. This might take about 8minutes." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "primary_container = {'ModelPackageName': second_model_package['ModelPackageArn']}\n", - "endpoint_config_name=f'{model_2_name}-endpoint-config'\n", - "existing_configs = len(sagemaker_boto_client.list_endpoint_configs(NameContains=endpoint_config_name, MaxResults = 30)['EndpointConfigs'])\n", - "\n", - "if existing_configs == 0:\n", - " create_ep_config_response = sagemaker_boto_client.create_endpoint_config(\n", - " EndpointConfigName=endpoint_config_name,\n", - " ProductionVariants=[{\n", - " 'InstanceType': endpoint_instance_type,\n", - " 'InitialVariantWeight': 1,\n", - " 'InitialInstanceCount': endpoint_instance_count,\n", - " 'ModelName': model_2_name,\n", - " 'VariantName': 'AllTraffic'\n", - " }]\n", - " )\n", - " %store endpoint_config_name" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "existing_endpoints = sagemaker_boto_client.list_endpoints(NameContains=endpoint_name, MaxResults = 30)['Endpoints']\n", - "if not existing_endpoints:\n", - " create_endpoint_response = sagemaker_boto_client.create_endpoint(\n", - " EndpointName=endpoint_name,\n", - " EndpointConfigName=endpoint_config_name)\n", - " %store endpoint_name\n", - "\n", - "endpoint_info = sagemaker_boto_client.describe_endpoint(EndpointName=endpoint_name)\n", - "endpoint_status = endpoint_info['EndpointStatus']\n", - "\n", - "while endpoint_status == 'Creating':\n", - " endpoint_info = sagemaker_boto_client.describe_endpoint(EndpointName=endpoint_name)\n", - " endpoint_status = endpoint_info['EndpointStatus']\n", - " print('Endpoint status:', endpoint_status)\n", - " if endpoint_status == 'Creating':\n", - " time.sleep(60)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " \n", - "\n", - "### Create a predictor" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "predictor = sagemaker.predictor.Predictor(\n", - " endpoint_name=endpoint_name, sagemaker_session=sagemaker_session\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Sample a claim from the test data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "dataset = pd.read_csv(\"data/dataset.csv\")\n", - "train = dataset.sample(frac=0.8, random_state=0)\n", - "test = dataset.drop(train.index)\n", - "sample_policy_id = int(test.sample(1)[\"policy_id\"])" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "test.info()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Get sample's claim data from online feature store\n", - "This will simulate getting data in real-time from a customer's insurance claim submission." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "featurestore_runtime = boto_session.client(\n", - " service_name=\"sagemaker-featurestore-runtime\", region_name=region\n", - ")\n", - "\n", - "feature_store_session = sagemaker.Session(\n", - " boto_session=boto_session,\n", - " sagemaker_client=sagemaker_boto_client,\n", - " sagemaker_featurestore_runtime_client=featurestore_runtime,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - " \n", - "\n", - "## Run Predictions on Multiple Claims\n", - "\n", - "[overview](#overview-4)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "import datetime as datetime\n", - "\n", - "timer = []\n", - "MAXRECS = 100\n", - "\n", - "\n", - "def barrage_of_inference():\n", - " sample_policy_id = int(test.sample(1)[\"policy_id\"])\n", - "\n", - " temp_fg_name = \"fraud-detect-demo-claims\"\n", - "\n", - " claims_response = featurestore_runtime.get_record(\n", - " FeatureGroupName=temp_fg_name, RecordIdentifierValueAsString=str(sample_policy_id)\n", - " )\n", - "\n", - " if claims_response.get(\"Record\"):\n", - " claims_record = claims_response[\"Record\"]\n", - " claims_df = pd.DataFrame(claims_record).set_index(\"FeatureName\")\n", - " else:\n", - " print(\"No Record returned / Record Key \\n\")\n", - "\n", - " t0 = datetime.datetime.now()\n", - "\n", - " customers_response = featurestore_runtime.get_record(\n", - " FeatureGroupName=customers_fg_name, RecordIdentifierValueAsString=str(sample_policy_id)\n", - " )\n", - "\n", - " t1 = datetime.datetime.now()\n", - "\n", - " customer_record = customers_response[\"Record\"]\n", - " customer_df = pd.DataFrame(customer_record).set_index(\"FeatureName\")\n", - "\n", - " blended_df = pd.concat([claims_df, customer_df]).loc[col_order].drop(\"fraud\")\n", - " data_input = \",\".join(blended_df[\"ValueAsString\"])\n", - "\n", - " results = predictor.predict(data_input, initial_args={\"ContentType\": \"text/csv\"})\n", - " prediction = json.loads(results)\n", - " # print (f'Probablitity the claim from policy {int(sample_policy_id)} is fraudulent:', prediction)\n", - "\n", - " arr = t1 - t0\n", - " minutes, seconds = divmod(arr.total_seconds(), 60)\n", - "\n", - " timer.append(seconds)\n", - " # print (prediction, \" done in {} \".format(seconds))\n", - "\n", - " return sample_policy_id, prediction\n", - "\n", - "\n", - "for i in range(MAXRECS):\n", - " sample_policy_id, prediction = barrage_of_inference()\n", - " print(f\"Probablitity the claim from policy {int(sample_policy_id)} is fraudulent:\", prediction)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "timer" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note: the above \"timer\" records the first call and then subsequent calls to the online Feature Store" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import statistics\n", - "import numpy as np\n", - "\n", - "statistics.mean(timer)\n", - "\n", - "\n", - "arr = np.array(timer)\n", - "print(\n", - " \"p95: {}, p99: {}, mean: {} for {} distinct feature store gets\".format(\n", - " np.percentile(arr, 95), np.percentile(arr, 99), np.mean(arr), MAXRECS\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Pull customer data from Customers feature group\n", - "When a customer submits an insurance claim online for instant approval, the insurance company will need to pull customer-specific data from the online feature store to add to the claim data as input for a model prediction." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "customers_response = featurestore_runtime.get_record(\n", - " FeatureGroupName=customers_fg_name, RecordIdentifierValueAsString=str(sample_policy_id)\n", - ")\n", - "\n", - "customer_record = customers_response[\"Record\"]\n", - "customer_df = pd.DataFrame(customer_record).set_index(\"FeatureName\")\n", - "\n", - "\n", - "claims_response = featurestore_runtime.get_record(\n", - " FeatureGroupName=claims_fg_name, RecordIdentifierValueAsString=str(sample_policy_id)\n", - ")\n", - "\n", - "claims_record = claims_response[\"Record\"]\n", - "claims_df = pd.DataFrame(claims_record).set_index(\"FeatureName\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Format the datapoint\n", - "The datapoint must match the exact input format as the model was trained--with all features in the correct order. In this example, the `col_order` variable was saved when you created the train and test datasets earlier in the guide." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "blended_df = pd.concat([claims_df, customer_df]).loc[col_order].drop(\"fraud\")\n", - "data_input = \",\".join(blended_df[\"ValueAsString\"])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Make prediction" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "results = predictor.predict(data_input, initial_args={\"ContentType\": \"text/csv\"})\n", - "prediction = json.loads(results)\n", - "print(f\"Probablitity the claim from policy {int(sample_policy_id)} is fraudulent:\", prediction)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "----\n", - "\n", - "\n", - "\n", - "### Next Notebook: [Create and Run an End-to-End Pipeline to Deploy the Model](./5-pipeline-e2e.ipynb)\n", - "Now that as a Data Scientist, you've manually experimented with each step in our machine learning workflow, you can take certain steps to allow for faster model creation and deployment without sacrificing transparency and tracking via model lineage. In the next section you will create a pipeline which trains a new model on SageMaker, persists the model in SageMaker and then adds the model to the registry and deploys it as a SageMaker hosted endpoint." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "instance_type": "ml.t3.medium", - "kernelspec": { - "display_name": "Python 3 (Data Science)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.10" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/end_to_end/fraud_detection/README.md b/end_to_end/fraud_detection/README.md index bd76cf5aa0..73b2804bb6 100644 --- a/end_to_end/fraud_detection/README.md +++ b/end_to_end/fraud_detection/README.md @@ -1,3 +1,149 @@ -# Architect and Build an End to End Workflow for Auto Claim Fraud Detection with SageMaker Services +# Architect and Build an End-to-End Workflow for Auto Claim Fraud Detection with SageMaker Services + +The purpose of this end-to-end example is to demonstrate how to prepare, train, and deploy a model that detects auto insurance claims. + +## Contents +1. [Business Problem](#business-problem) +2. [Technical Solution](#nb0-solution) +3. [Solution Components](#nb0-components) +4. [Solution Architecture](#nb0-architecture) +5. [Code Resources](#nb0-code) +6. [Exploratory Data Science and Operational ML workflows](#nb0-workflows) +7. [The ML Life Cycle: Detailed View](#nb0-ml-lifecycle) + + + + +## Business Problem + + "Auto insurance fraud ranges from misrepresenting facts on insurance applications and inflating insurance claims to staging accidents and submitting claim forms for injuries or damage that never occurred, to false reports of stolen vehicles. +Fraud accounted for between 15 percent and 17 percent of total claims payments for auto insurance bodily injury in 2012, according to an Insurance Research Council (IRC) study. The study estimated that between \$5.6 billion and \$7.7 billion was fraudulently added to paid claims for auto insurance bodily injury payments in 2012, compared with a range of \$4.3 billion to \$5.8 billion in 2002. " [source: Insurance Information Institute](https://www.iii.org/article/background-on-insurance-fraud) + +In this example, we will use an *auto insurance domain* to detect claims that are possibly fraudulent. +more precisely we address the use-case: "what is the likelihood that a given auto claim is fraudulent?" , and explore the technical solution. + +As you review the notebooks and the [architectures](#nb0-architecture) presented at each stage of the ML life cycle, you will see how you can leverage SageMaker services and features to enhance your effectiveness as a data scientist, as a machine learning engineer, and as an ML Ops Engineer. + +We then perform data exploration on the synthetically generated datasets for Customers and Claims. + +Then, we provide an overview of the technical solution by examining the [Solution Components](#nb0-components) and the [Solution Architecture](#nb0-architecture). +We are motivated by the need to accomplish new tasks in ML by examining a [detailed view of the Machine Learning Lifecycle](#nb0-ml-lifecycle), recognizing the [separation of exploratory data science and operationalizing an ML worklfow](#nb0-workflows). + +### Car Insurance Claims: Data Sets and Problem Domain + +The inputs for building our model and workflow are two tables of insurance data: a claims table and a customers table. This data was synthetically generated is provided to you in its raw state for pre-processing with SageMaker Data Wrangler. However, completing the SageMaker Data Wrangler step is not required to continue with the rest of this notebook. If you wish, you may use the `claims_preprocessed.csv` and `customers_preprocessed.csv` in the `data` directory as they are exact copies of what SageMaker Data Wrangler would output. + + + + + +## Technical Solution + +In this introduction, you will look at the technical architecture and solution components to build a solution for predicting fraudulent insurance claims and deploy it using SageMaker for real-time predictions. While a deployed model is the end-product of this notebook series, the purpose of this guide is to walk you through all the detailed stages of the [machine learning (ML) lifecycle](#ml-lifecycle) and show you what SageMaker services and features are there to support your activities in each stage. + + + + + +## Solution Components + +The following [SageMaker](https://sagemaker.readthedocs.io/en/stable/v2.html) Services are used in this solution: + + 1. [SageMaker DataWrangler](https://aws.amazon.com/sagemaker/data-wrangler/) - [docs](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler.html) + 1. [SageMaker Processing](https://aws.amazon.com/blogs/aws/amazon-sagemaker-processing-fully-managed-data-processing-and-model-evaluation/) - [docs](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_processing.html) + 1. [SageMaker Feature Store](https://aws.amazon.com/sagemaker/feature-store/)- [docs](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_featurestore.html) + 1. [SageMaker Clarify](https://aws.amazon.com/sagemaker/clarify/)- [docs](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-processing-job-run.html) + 1. [SageMaker Training with XGBoost Algorithm and Hyperparameter Optimization](https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/using_xgboost.html)- [docs](https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/index.html) + 1. [SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html)- [docs](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-deploy.html#model-registry-deploy-api) + 1. [SageMaker Hosted Endpoints]()- [predictors - docs](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html) + 1. [SageMaker Pipelines]()- [docs](https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/index.html) + +![Solution Components](images/solution-components-e2e.png) + + + + + +## Solution Architecture + +The overall architecture is shown in the diagram below. +1[end to end](./images/ML-Lifecycle-v5.png) + +We will go through 5 stages of ML and explore the solution architecture of SageMaker. Each of the sequancial notebooks will dive deep into corresponding ML stage. + +### [Notebook 1](./0-AutoClaimFraudDetection.ipynb): Data Exploration + +### [Notebook 2](./1-data-prep-e2e.ipynb): Data Preparation, Ingest, Transform, Preprocess, and Store in SageMaker Feature Store + +![Solution Architecture](images/e2e-1-pipeline-v3b.png) + +### [Notebook 3](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb) and [Notebook 4](./3-mitigate-bias-train-model2-registry-e2e.ipynb) : Train, Tune, Check Pre- and Post-Training Bias, Mitigate Bias, Re-train, Deposit, and Deploy the Best Model to SageMaker Model Registry + +![Solution Architecture](images/e2e-2-pipeline-v3b.png) + +This is the architecture for model deployment. + +![Solution Architecture](images/e2e-3-pipeline-v3b.png) + +### [Pipeline Notebook](./pipeline-e2e.ipynb): End-to-End Pipeline - MLOps Pipeline to run an end-to-end automated workflow with all the design decisions made during manual/exploratory steps in previous notebooks. + +![Pipelines Solution Architecture](images/e2e-5-pipeline-v3b.png) + + + + + +## Code Resources + +### Stages + +Our solution is split into the following stages of the [ML Lifecycle](#nb0-ml-lifecycle), and each stage has its own notebook: + +* [Notebook 1: Data Exploration](./0-AutoClaimFraudDetection.ipynb): We first explore the data. +* [Notebook 2: Data Prep and Store](./1-data-prep-e2e.ipynb): We prepare a dataset for machine learning using SageMaker Data Wrangler, create and deposit the datasets in a SageMaker Feature Store. +* [Notebook 3: Train, Assess Bias, Establish Lineage, Register Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb): We detect possible pre-training and post-training bias, train and tune a XGBoost model using Amazon SageMaker, record Lineage in the Model Registry so we can later deploy it. +* [Notebook 4: Mitigate Bias, Re-train, Register, Deploy Unbiased Model](./3-mitigate-bias-train-model2-registry-e2e.ipynb): We mitigate bias, retrain a less biased model, store it in a Model Registry. We then deploy the model to a Amazon SageMaker Hosted Endpoint and run real-time inference via the SageMaker Online Feature Store. +* [Pipeline Notebook: Create and Run an MLOps Pipeline](./pipeline-e2e.ipynb): We then create a SageMaker Pipeline that ties together everything we have done so far, from outputs from Data Wrangler, Feature Store, Clarify, Model Registry and finally deployment to a SageMaker Hosted Endpoint. [--> Architecture](#nb0-pipeline) + + + + + +## The Exploratory Data Science and ML Ops Workflows + +### Exploratory Data Science and Scalable MLOps + +Note that there are typically two workflows: a manual exploratory workflow and an automated workflow. + +The *exploratory, manual data science workflow* is where experiments are conducted and various techniques and strategies are tested. + +After you have established your data prep, transformations, featurizations and training algorithms, testing of various hyperparameters for model tuning, you can start with the automated workflow where you *rely on MLOps or the ML Engineering part of your team* to streamline the process, make it more repeatable and scalable by putting it into an automated pipeline. + +![the 2 flows](images/2-flows.png) + + + + + + +## The ML Life Cycle: Detailed View + +![title](images/ML-Lifecycle-v5.png) + +The Red Boxes and Icons represent comparatively newer concepts and tasks that are now deemed important to include and execute, in a production-oriented (versus research-oriented) and scalable ML lifecycle. + + These newer lifecycle tasks and their corresponding, supporting AWS Services and features include: + +1. *Data Wrangling*: AWS Data Wrangler for cleaning, normalizing, transforming and encoding data, as well as joining datasets. The outputs of Data Wrangler are code generated to work with SageMaker Processing, SageMaker Pipelines, SageMaker Feature Store or just a plain old python script with pandas. + 1. Feature Engineering has always been done, but now with AWS Data Wrangler we can use a GUI based tool to do so and generate code for the next phases of the lifecycle. +2. *Detect Bias*: Using AWS Clarify, in Data Prep or in Training we can detect pre-training and post-training bias, and eventually at Inference time provide Interpretability / Explainability of the inferences (e.g., which factors were most influential in coming up with the prediction) +3. *Feature Store (Offline)*: Once we have done all of our feature engineering, the encoding and transformations, we can then standardize features, offline in AWS Feature Store, to be used as input features for training models. +4. *Artifact Lineage*: Using AWS SageMaker’s Artifact Lineage features we can associate all the artifacts (data, models, parameters, etc.) with a trained model to produce metadata that can be stored in a Model Registry. +5. *Model Registry*: AWS Model Registry stores the metadata around all artifacts that you have chosen to include in the process of creating your models, along with the model(s) themselves in a Model Registry. Later a human approval can be used to note that the model is good to be put into production. This feeds into the next phase of deploy and monitor. +6. *Inference and the Online Feature Store*: For real-time inference, we can leverage an online AWS Feature Store we have created to get us single digit millisecond low latency and high throughput for serving our model with new incoming data. +7. *Pipelines*: Once we have experimented and decided on the various options in the lifecycle (which transforms to apply to our features, imbalance or bias in the data, which algorithms to choose to train with, which hyper-parameters are giving us the best performance metrics, etc.) we can now automate the various tasks across the lifecycle using SageMaker Pipelines. + 1. In this notebook, we will show a pipeline that starts with the outputs of AWS Data Wrangler and ends with storing trained models in the Model Registry. + 2. Typically, you could have a pipeline for data prep, one for training until model registry (which we are showing in the code associated with this blog), one for inference, and one for re-training using SageMaker Model Monitor to detect model drift and data drift and trigger a re-training using an AWS Lambda function. + + -1[end to end](./images/ML-Lifecycle-v5.png) \ No newline at end of file diff --git a/end_to_end/fraud_detection/create_dataset.py b/end_to_end/fraud_detection/create_dataset.py index a7732a7ea4..d338b7a860 100644 --- a/end_to_end/fraud_detection/create_dataset.py +++ b/end_to_end/fraud_detection/create_dataset.py @@ -1,9 +1,15 @@ +import sys +import subprocess +subprocess.check_call([sys.executable, "-m", "pip", "install", "sagemaker"]) + import argparse import pathlib import time import boto3 import pandas as pd +import sagemaker +from sagemaker.feature_store.feature_group import FeatureGroup # Parse argument variables passed via the CreateDataset processing step parser = argparse.ArgumentParser() @@ -23,8 +29,25 @@ account_id = boto3.client("sts").get_caller_identity()["Account"] now = pd.to_datetime("now") -claims_feature_group_s3_prefix = f'{args.bucket_prefix}/{account_id}/sagemaker/{region}/offline-store/{args.claims_table_name}/data/year={now.year}/month={now.strftime("%m")}/day={now.strftime("%d")}' -customers_feature_group_s3_prefix = f'{args.bucket_prefix}/{account_id}/sagemaker/{region}/offline-store/{args.customers_table_name}/data/year={now.year}/month={now.strftime("%m")}/day={now.strftime("%d")}' +feature_store_session = sagemaker.Session() +claims_feature_group = FeatureGroup(name=args.claims_feature_group_name, sagemaker_session=feature_store_session) +customers_feature_group = FeatureGroup( + name=args.customers_feature_group_name, sagemaker_session=feature_store_session +) + +claims_table_name = ( + claims_feature_group.describe()["OfflineStoreConfig"]["DataCatalogConfig"]["TableName"] +) +customers_table_name = ( + customers_feature_group.describe()["OfflineStoreConfig"]["DataCatalogConfig"]["TableName"] +) +athena_database_name = customers_feature_group.describe()["OfflineStoreConfig"]["DataCatalogConfig"]["Database"] + +print(f'claims_table_name: {claims_table_name}') +print(f'customers_table_name: {customers_table_name}') + +claims_feature_group_s3_prefix = f'{args.bucket_prefix}/{account_id}/sagemaker/{region}/offline-store/{claims_table_name}/data/year={now.year}/month={now.strftime("%m")}/day={now.strftime("%d")}' +customers_feature_group_s3_prefix = f'{args.bucket_prefix}/{account_id}/sagemaker/{region}/offline-store/{customers_table_name}/data/year={now.year}/month={now.strftime("%m")}/day={now.strftime("%d")}' print(f'claims_feature_group_s3_prefix: {claims_feature_group_s3_prefix}') print(f'customers_feature_group_s3_prefix: {customers_feature_group_s3_prefix}') @@ -110,7 +133,7 @@ query_string = f""" SELECT DISTINCT {training_columns_string} -FROM "{args.claims_table_name}" claims LEFT JOIN "{args.customers_table_name}" customers +FROM "{claims_table_name}" claims LEFT JOIN "{customers_table_name}" customers ON claims.policy_id = customers.policy_id """ @@ -118,7 +141,7 @@ query_execution = athena.start_query_execution( QueryString=query_string, - QueryExecutionContext={"Database": args.athena_database_name}, + QueryExecutionContext={"Database": athena_database_name}, ResultConfiguration={"OutputLocation": f"s3://{args.bucket_name}/query_results/"}, ) diff --git a/end_to_end/fraud_detection/index.rst b/end_to_end/fraud_detection/index.rst index c4f8d93740..f84b4ac680 100644 --- a/end_to_end/fraud_detection/index.rst +++ b/end_to_end/fraud_detection/index.rst @@ -6,7 +6,7 @@ Build end-to-end Examples with SageMaker Services and Features. Fraud Detection System for Auto Claims --------------------------------------------------------------------- Architect and build an end to end auto claims fraud detection example. -This section consists of 4 notebooks and the next one consists of 1 notebook that ties all the steps together in an automated Pipeline +This section consists of 3 notebooks and the next one consists of 1 notebook that ties all the steps together in an automated Pipeline Fraud Detection System for Auto Claims using Exploratory Data Science @@ -19,7 +19,6 @@ Fraud Detection System for Auto Claims using Exploratory Data Science 1-data-prep-e2e 2-lineage-train-assess-bias-tune-registry-e2e 3-mitigate-bias-train-model2-registry-e2e - 4-deploy-run-inference-e2e Fraud Detection System for Auto Claims using an automated Pipeline @@ -28,4 +27,4 @@ Fraud Detection System for Auto Claims using an automated Pipeline .. toctree:: :maxdepth: 1 - 5-pipeline-e2e + pipeline-e2e diff --git a/end_to_end/fraud_detection/5-pipeline-e2e.ipynb b/end_to_end/fraud_detection/pipeline-e2e.ipynb similarity index 94% rename from end_to_end/fraud_detection/5-pipeline-e2e.ipynb rename to end_to_end/fraud_detection/pipeline-e2e.ipynb index 82fe6d70ae..e7893b81b6 100644 --- a/end_to_end/fraud_detection/5-pipeline-e2e.ipynb +++ b/end_to_end/fraud_detection/pipeline-e2e.ipynb @@ -4,32 +4,37 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Part 5 : Create an End to End Pipeline" + "# Fraud Detection for Automobile Claims: Create an End to End Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "\n", + "## Background\n", "\n", - "## [Overview](./0-AutoClaimFraudDetection.ipynb)\n", - "* [Notebook 0 : Overview, Architecture and Data Exploration](./0-AutoClaimFraudDetection.ipynb)\n", - "* [Notebook 1: Data Prep, Process, Store Features](./1-data-prep-e2e.ipynb)\n", - "* [Notebook 2: Train, Check Bias, Tune, Record Lineage, and Register a Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb)\n", - "* [Notebook 3: Mitigate Bias, Train New Model, Store in Registry](./3-mitigate-bias-train-model2-registry-e2e.ipynb)\n", - "* [Notebook 4: Deploy Model, Run Predictions](./4-deploy-run-inference-e2e.ipynb)\n", - "* **[Notebook 5 : Create and Run an End-to-End Pipeline to Deploy the Model](./5-pipeline-e2e.ipynb)**\n", - " * **[Architecture](#arch-5)**\n", - " * **[Create an Automated Pipeline](#pipelines)**\n", - " * **[Clean up](#cleanup)**" + "In this notebook, we will build a SageMaker Pipeline that automates the entire end-to-end process of preparing, training, and deploying a model that detects automobile claim fraud. For a more detailed explanation of each step of the pipeline, you can look the series of notebooks (listed below) that implements this same process using a manual approach. Please see the [README.md](README.md) for more information about this use case implemented by this series of notebooks. \n", + "\n", + "\n", + "1. [Fraud Detection for Automobile Claims: Data Exploration](./0-AutoClaimFraudDetection.ipynb)\n", + "1. [Fraud Detection for Automobile Claims: Data Preparation, Process, and Store Features](./1-data-prep-e2e.ipynb)\n", + "1. [Fraud Detection for Automobile Claims: Train, Check Bias, Tune, Record Lineage, and Register a Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb)\n", + "1. [Fraud Detection for Automobile Claims: Mitigate Bias, Train, Register, and Deploy Unbiased Model](./3-mitigate-bias-train-model2-registry-e2e.ipynb)\n", + "\n", + "\n", + "## Contents\n", + "1. [Prerequisites](#Prerequisites)\n", + "1. [Architecture: Create a SageMaker Pipeline to Automate All the Steps from Data Prep to Model Deployment](#Architecture:-Create-a-SageMaker-Pipeline-to-Automate-All-the-Steps-from-Data-Prep-to-Model-Deployment)\n", + "1. [Creating an Automated Pipeline using SageMaker Pipeline](#Creating-an-Automated-Pipeline-using-SageMaker-Pipeline)\n", + "1. [Clean-Up](#Clean-Up)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "In this notebook, we will build a SageMaker Pipeline that automates the entire end to end process. Recall that we initially did all the steps in a manual way, and experimented as a data scientist: testing each segment, hands on, and determine for example, which transformations should be applied to the features, which algorithm should be selected, which hyperparamneters, etc. Now we will automate these steps, and perhaps pass on the responsibility to an ML Engineer or MLOps role." + "## Prerequisites\n", + "----" ] }, { @@ -49,31 +54,6 @@ "!python -m pip install -q awswrangler==2.2.0 imbalanced-learn==0.7.0 sagemaker==2.41.0 boto3==1.17.70" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Load stored variables\n", - "Run the cell below to load any prevously created variables. You should see a print-out of the existing variables. If you don't see anything you may need to create them again or it may be your first time running this notebook." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%store -r\n", - "%store" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Important: You must have run the previous sequential notebooks to retrieve variables using the StoreMagic command.**" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -94,6 +74,7 @@ "import numpy as np\n", "import pandas as pd\n", "import awswrangler as wr\n", + "import string\n", "\n", "import demo_helpers\n", "\n", @@ -136,7 +117,13 @@ ")\n", "sagemaker_role = sagemaker.get_execution_role()\n", "\n", - "account_id = boto3.client(\"sts\").get_caller_identity()[\"Account\"]" + "account_id = boto3.client(\"sts\").get_caller_identity()[\"Account\"]\n", + "\n", + "bucket = sagemaker_session.default_bucket()\n", + "prefix = \"fraud-detect-demo\"\n", + "\n", + "claims_fg_name = f\"{prefix}-claims\"\n", + "customers_fg_name = f\"{prefix}-customers\"" ] }, { @@ -173,10 +160,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - " \n", - "\n", - "### Architecture : Create a SageMaker Pipeline to Automate All the Steps from Data Prep to Model Deployment\n", - "[overview](#overview-5)\n", + "## Architecture: Create a SageMaker Pipeline to Automate All the Steps from Data Prep to Model Deployment\n", + "----\n", "\n", "![End to end pipeline architecture](./images/e2e-5-pipeline-v3b.png)" ] @@ -185,21 +170,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", + "## Creating an Automated Pipeline using SageMaker Pipeline\n", "\n", - "## SageMaker Pipeline\n", - "\n", - "- [Step 1: Claims Data Wrangler Preprocessing Step](#claims-data-wrangler)\n", - "- [Step 2: Customers Data Wrangler Preprocessing step](#data-wrangler)\n", - "- [Step 3: Dataset and train test split](#dataset-train-test)\n", - "- [Step 4: Train XGboost Model](#pipe-train-xgb)\n", - "- [Step 5: Model Pre-deployment](#pipe-pre-deploy)\n", - "- [Step 6: Use Clarify to Detect Bias](#pipe-detect-bias)\n", - "- [Step 7: Register Model](#pipe-Register-Model)\n", - "- [Step 8: Combine the Pipeline Steps and Run](#define-pipeline)\n", - "\n", - "\n", - "[back to overview](#overview-5)\n", + "- [Step 1: Claims Data Wrangler Preprocessing Step](#Step-1:-Claims-Data-Wrangler-Preprocessing-Step)\n", + "- [Step 2: Customers Data Wrangler Preprocessing Step](#Step-2:-Customers-Data-Wrangler-Preprocessing-Step)\n", + "- [Step 3: Create Dataset and Train/Test Split](#Step-3:-Create-Dataset-and-Train/Test-Split)\n", + "- [Step 4: Train XGBoost Model](#Step-4:-Train-XGBoost-Model)\n", + "- [Step 5: Model Pre-Deployment Step](#Step-5:-Model-Pre-Deployment-Step)\n", + "- [Step 6: Run Bias Metrics with Clarify](#Step-6:-Run-Bias-Metrics-with-Clarify)\n", + "- [Step 7: Register Model](#Step-7:-Register-Model)\n", + "- [Step 8: Deploy Model](#Step-8:-Deploy-Model)\n", + "- [Step 9: Combine and Run the Pipeline Steps](#Step-9:-Combine-and-Run-the-Pipeline-Steps)\n", "\n" ] }, @@ -240,10 +221,68 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", + "### Step 1: Claims Data Wrangler Preprocessing Step" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Upload raw data to S3\n", + "Before you can preprocess the raw data with Data Wrangler, it must exist in S3." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s3_client.upload_file(\n", + " Filename=\"data/claims.csv\", Bucket=bucket, Key=f\"{prefix}/data/raw/claims.csv\"\n", + ")\n", + "s3_client.upload_file(\n", + " Filename=\"data/customers.csv\", Bucket=bucket, Key=f\"{prefix}/data/raw/customers.csv\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Update attributes within the `.flow` file \n", + "Data Wrangler will generate a .flow file. It contains a reference to an S3 bucket used during the Wrangling. This may be different from the one you have as a default in this notebook eg if the Wrangling was done by someone else, you will probably not have access to their bucket and you now need to point to your own S3 bucket so you can actually load the .flow file into Data Wrangler or access the data.\n", "\n", - "### Step 1: Claims Data Wranger Preprocessing Step\n", - "[pipeline](#pipelines)" + "After running the cell below you can open the `claims.flow` and `customers.flow` files and export the data to S3 or you can continue the guide using the provided `data/claims_preprocessed.csv` and `data/customers_preprocessed.csv` files." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "claims_flow_template_file = \"claims_flow_template\"\n", + "\n", + "with open(claims_flow_template_file, \"r\") as f:\n", + " variables = {\"bucket\": bucket, \"prefix\": prefix}\n", + " template = string.Template(f.read())\n", + " claims_flow = template.substitute(variables)\n", + " claims_flow = json.loads(claims_flow)\n", + "\n", + "with open(\"claims.flow\", \"w\") as f:\n", + " json.dump(claims_flow, f)\n", + "\n", + "customers_flow_template_file = \"customers_flow_template\"\n", + "\n", + "with open(customers_flow_template_file, \"r\") as f:\n", + " variables = {\"bucket\": bucket, \"prefix\": prefix}\n", + " template = string.Template(f.read())\n", + " customers_flow = template.substitute(variables)\n", + " customers_flow = json.loads(customers_flow)\n", + "\n", + "with open(\"customers.flow\", \"w\") as f:\n", + " json.dump(customers_flow, f)" ] }, { @@ -352,7 +391,7 @@ "\n", "# Pulls the latest data-wrangler container tag, i.e. \"1.x\"\n", "# The latest tested container version was \"1.11.0\"\n", - "image_uri = image_uris.retrieve(framework='data-wrangler',region=region)\n", + "image_uri = image_uris.retrieve(framework=\"data-wrangler\", region=region)\n", "\n", "print(\"image_uri: {}\".format(image_uri))\n", "\n", @@ -366,12 +405,8 @@ "\n", "output_content_type = \"CSV\"\n", "\n", - "# Output configuration used as processing job container arguments \n", - "claims_output_config = {\n", - " claims_output_name: {\n", - " \"content_type\": output_content_type\n", - " }\n", - "}\n", + "# Output configuration used as processing job container arguments\n", + "claims_output_config = {claims_output_name: {\"content_type\": output_content_type}}\n", "\n", "claims_flow_step = ProcessingStep(\n", " name=\"ClaimsDataWranglerProcessingStep\",\n", @@ -386,11 +421,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", - "### Step 2: Customers Data Wrangler preprocessing step\n", - "\n", - "[pipeline](#pipelines)" + "### Step 2: Customers Data Wrangler Preprocessing Step" ] }, { @@ -461,12 +492,8 @@ "\n", "output_content_type = \"CSV\"\n", "\n", - "# Output configuration used as processing job container arguments \n", - "customers_output_config = {\n", - " customers_output_name: {\n", - " \"content_type\": output_content_type\n", - " }\n", - "}\n", + "# Output configuration used as processing job container arguments\n", + "customers_output_config = {customers_output_name: {\"content_type\": output_content_type}}\n", "\n", "customers_flow_step = ProcessingStep(\n", " name=\"CustomersDataWranglerProcessingStep\",\n", @@ -481,11 +508,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", - "### Step 3: Create Dataset and Train/Test Split\n", - "\n", - "[pipeline](#pipelines)" + "### Step 3: Create Dataset and Train/Test Split" ] }, { @@ -527,12 +550,6 @@ " bucket,\n", " \"--bucket-prefix\",\n", " prefix,\n", - " \"--athena-database-name\",\n", - " database_name,\n", - " \"--claims-table-name\",\n", - " claims_table,\n", - " \"--customers-table-name\",\n", - " customers_table,\n", " \"--region\",\n", " region,\n", " ],\n", @@ -545,12 +562,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", "### Step 4: Train XGBoost Model\n", - "In this step we use the ParameterString `train_instance_param` defined at the beginning of the pipeline.\n", - "\n", - "[pipeline](#pipelines)" + "In this step we use the ParameterString `train_instance_param` defined at the beginning of the pipeline.\n" ] }, { @@ -594,11 +607,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", - "### Step 5: Model Pre-Deployment Step\n", - "\n", - "[pipeline](#pipelines)" + "### Step 5: Model Pre-Deployment Step\n" ] }, { @@ -624,10 +633,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", - "### Step 6: Run Bias Metrics with Clarify\n", - "[pipeline](#pipelines)" + "### Step 6: Run Bias Metrics with Clarify\n" ] }, { @@ -726,12 +732,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", "### Step 7: Register Model\n", - "In this step you will use the ParameterString `model_approval_status` defined at the outset of the pipeline code.\n", - "\n", - "\n", - "[pipeline](#pipelines)" + "In this step you will use the ParameterString `model_approval_status` defined at the outset of the pipeline code.\n" ] }, { @@ -740,6 +742,8 @@ "metadata": {}, "outputs": [], "source": [ + "mpg_name = prefix\n", + "\n", "model_metrics = demo_helpers.ModelMetrics(\n", " bias=sagemaker.model_metrics.MetricsSource(\n", " s3_uri=clarify_step.properties.ProcessingOutputConfig.Outputs[\n", @@ -767,11 +771,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "### Step 8: Deploy Model\n", - "\n", - "\n", - "[pipeline](#pipelines)" + "### Step 8: Deploy Model" ] }, { @@ -814,10 +814,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", - "### Combine the Pipeline Steps and Run\n", - "[pipeline](#overview-5)\n", + "### Step 9: Combine and Run the Pipeline Steps\n", "\n", "Though easier to reason with, the parameters and steps don't need to be in order. The pipeline DAG will parse it out properly." ] @@ -880,7 +877,7 @@ }, "outputs": [], "source": [ - "json.loads(pipeline.describe()['PipelineDefinition'])" + "json.loads(pipeline.describe()[\"PipelineDefinition\"])" ] }, { @@ -928,7 +925,7 @@ "metadata": {}, "outputs": [], "source": [ - "start_response.wait()\n", + "start_response.wait(delay=60, max_attempts=500)\n", "start_response.describe()" ] }, @@ -964,10 +961,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "## Clean up\n", - "\n", - "[overview](#overview-5)\n", + "## Clean Up\n", "----\n", "After running the demo, you should remove the resources which were created. You can also delete all the objects in the project's S3 directory by passing the keyword argument `delete_s3_objects=True`." ] @@ -987,25 +981,22 @@ "metadata": {}, "outputs": [], "source": [ - "\"\"\"\n", "delete_project_resources(\n", " sagemaker_boto_client=sagemaker_boto_client,\n", - " endpoint_name=endpoint_name, \n", - " pipeline_name=pipeline_name, \n", - " mpg_name=mpg_name, \n", + " pipeline_name=pipeline_name,\n", + " mpg_name=mpg_name,\n", " prefix=prefix,\n", " delete_s3_objects=False,\n", - " bucket_name=bucket)\n", - "\"\"\"" + " bucket_name=bucket,\n", + ")" ] } ], "metadata": { - "instance_type": "ml.t3.medium", "kernelspec": { - "display_name": "Python 3 (Data Science)", + "display_name": "conda_python3", "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" + "name": "conda_python3" }, "language_info": { "codemirror_mode": { @@ -1017,7 +1008,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.10" + "version": "3.6.13" } }, "nbformat": 4,