Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,556 changes: 40 additions & 1,516 deletions end_to_end/fraud_detection/0-AutoClaimFraudDetection.ipynb

Large diffs are not rendered by default.

600 changes: 53 additions & 547 deletions end_to_end/fraud_detection/1-data-prep-e2e.ipynb

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -4,46 +4,37 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Part 2: Train, Check Bias, Tune, Record Lineage, and Register a Model"
"# Fraud Detection for Automobile Claims: Train, Check Bias, Tune, Record Lineage, and Register a Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='aud-overview'> </a>\n",
"## Background\n",
"\n",
"This notebook is the third part of a series of notebooks that will demonstrate how to prepare, train, and deploy a model that detects fradulent auto claims. In this notebook, we will show how you can assess pre-training and post-training bias with SageMaker Clarify, Train the Model using XGBoost on SageMaker, and then finally deposit it in the Model Registry, along with the Lineage of Artifacts that were created along the way: data, code and model metadata. You can choose to run this notebook by itself or in sequence with the other notebooks listed below. Please see the [README.md](README.md) for more information about this use case implemented by this series of notebooks. \n",
"\n",
"## [Overview](./0-AutoClaimFraudDetection.ipynb)\n",
"* [Notebook 0 : Overview, Architecture and Data Exploration](./0-AutoClaimFraudDetection.ipynb)\n",
"* [Notebook 1: Data Prep, Process, Store Features](./1-data-prep-e2e.ipynb)\n",
"* **[Notebook 2: Train, Check Bias, Tune, Record Lineage, and Register a Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb)**\n",
" * **[Architecture](#train)**\n",
" * **[Train a model using XGBoost](#aud-train-model)**\n",
" * **[Model lineage with artifacts and associations](#model-lineage)**\n",
" * **[Evaluate the model for bias with Clarify](#check-bias)**\n",
" * **[Deposit Model and Lineage in SageMaker Model Registry](#model-registry)**\n",
"* [Notebook 3: Mitigate Bias, Train New Model, Store in Registry](./3-mitigate-bias-train-model2-registry-e2e.ipynb)\n",
"* [Notebook 4: Deploy Model, Run Predictions](./4-deploy-run-inference-e2e.ipynb)\n",
"* [Notebook 5 : Create and Run an End-to-End Pipeline to Deploy the Model](./5-pipeline-e2e.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section we will show how you can assess pre-training and post-training bias with SageMaker Clarify, Train the Model using XGBoost on SageMaker, and then finally deposit it in the Model Registry, along with the Lineage of Artifacts that were created along the way: data, code and model metadata.\n",
"\n",
"In this second model, you will fix the gender imbalance in the dataset using SMOTE and train another model using XGBoost. This model will also be saved to our registry and eventually approved for deployment."
"1. [Fraud Detection for Automobile Claims: Data Exploration](./0-AutoClaimFraudDetection.ipynb)\n",
"1. [Fraud Detection for Automobile Claims: Data Preparation, Process, and Store Features](./1-data-prep-e2e.ipynb)\n",
"1. **[Fraud Detection for Automobile Claims: Train, Check Bias, Tune, Record Lineage, and Register a Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb)**\n",
"1. [Fraud Detection for Automobile Claims: Mitigate Bias, Train, Register, and Deploy Unbiased Model](./3-mitigate-bias-train-model2-registry-e2e.ipynb)\n",
"\n",
"## Contents\n",
"\n",
"1. [Architecture for the ML Lifecycle Stage: Train, Check Bias, Tune, Record Lineage, Register Model](#Architecture-for-the-ML-Lifecycle-Stage:-Train,-Check-Bias,-Tune,-Record-Lineage,-Register-Model)\n",
"1. [Train a Model using XGBoost](#Train-a-Model-using-XGBoost)\n",
"1. [Model Lineage with Artifacts and Associations](#Model-Lineage-with-Artifacts-and-Associations)\n",
"1. [Evaluate Model for Bias with Clarify](#Evaluate-Model-for-Bias-with-Clarify)\n",
"1. [Deposit Model and Lineage in SageMaker Model Registry](#Deposit-Model-and-Lineage-in-SageMaker-Model-Registry)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id ='train'> </a>\n",
"\n",
"## Architecture for the ML Lifecycle Stage: Train, Check Bias, Tune, Record Lineage, Register Model\n",
"[overview](#overview)\n",
"----\n",
"\n",
"![train-assess-tune-register](./images/e2e-2-pipeline-v3b.png)"
Expand All @@ -66,49 +57,6 @@
"!python -m pip install -q awswrangler==2.2.0 imbalanced-learn==0.7.0 sagemaker==2.41.0 boto3==1.17.70"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To apply the update to the current kernel, run the following code to refresh the kernel."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import IPython\n",
"\n",
"IPython.Application.instance().kernel.do_shutdown(True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load stored variables\n",
"Run the cell below to load any prevously created variables. You should see a print-out of the existing variables. If you don't see anything you may need to create them again or it may be your first time running this notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%store -r\n",
"%store"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**<font color='red'>Important</font>: You must have run the previous sequential notebooks to retrieve variables using the StoreMagic command.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -184,6 +132,9 @@
"outputs": [],
"source": [
"# variables used for parameterizing the notebook run\n",
"bucket = sagemaker_session.default_bucket()\n",
"prefix = \"fraud-detect-demo\"\n",
"\n",
"estimator_output_path = f\"s3://{bucket}/{prefix}/training_jobs\"\n",
"train_instance_count = 1\n",
"train_instance_type = \"ml.m4.xlarge\"\n",
Expand All @@ -206,12 +157,32 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='aud-train-model'></a>\n",
"### Store Data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_data_uri = f\"s3://{bucket}/{prefix}/data/train/train.csv\"\n",
"test_data_uri = f\"s3://{bucket}/{prefix}/data/test/test.csv\"\n",
"\n",
"## Train a model using XGBoost\n",
"\n",
"[overview](#overview)\n",
"s3_client.upload_file(\n",
" Filename=\"data/train.csv\", Bucket=bucket, Key=f\"{prefix}/data/train/train.csv\"\n",
")\n",
"s3_client.upload_file(Filename=\"data/test.csv\", Bucket=bucket, Key=f\"{prefix}/data/test/test.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train a Model using XGBoost\n",
"----\n",
"\n",
"Once the training and test datasets have been persisted in S3, you can start training a model by defining which SageMaker Estimator you'd like to use. For this guide, you will use the [XGBoost Open Source Framework](https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/xgboost.html) to train your model. This estimator is accessed via the SageMaker SDK, but mirrors the open source version of the [XGBoost Python package](https://xgboost.readthedocs.io/en/latest/python/index.html). Any functioanlity provided by the XGBoost Python package can be implemented in your training script."
]
},
Expand All @@ -234,8 +205,7 @@
" \"eta\": \"0.2\",\n",
" \"objective\": \"binary:logistic\",\n",
" \"num_round\": \"100\",\n",
"}\n",
"%store hyperparameters"
"}"
]
},
{
Expand Down Expand Up @@ -275,26 +245,22 @@
},
"outputs": [],
"source": [
"if 'training_job_1_name' not in locals():\n",
" \n",
" xgb_estimator.fit(inputs = {'train': train_data_uri})\n",
"if \"training_job_1_name\" not in locals():\n",
"\n",
" xgb_estimator.fit(inputs={\"train\": train_data_uri})\n",
" training_job_1_name = xgb_estimator.latest_training_job.job_name\n",
" %store training_job_1_name\n",
" \n",
"\n",
"else:\n",
" print(f'Using previous training job: {training_job_1_name}')"
" print(f\"Using previous training job: {training_job_1_name}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='model-lineage'></a>\n",
"\n",
"## Model lineage with artifacts and associations\n",
"\n",
"[Overview](#aud-overview)\n",
"## Model Lineage with Artifacts and Associations\n",
"----\n",
"\n",
"Amazon SageMaker ML Lineage Tracking creates and stores information about the steps of a machine learning (ML) workflow from data preparation to model deployment. With the tracking information you can reproduce the workflow steps, track model and dataset lineage, and establish model governance and audit standards. With SageMaker Lineage Tracking data scientists and model builders can do the following:\n",
"* Keep a running history of model discovery experiments.\n",
"* Establish model governance by tracking model lineage artifacts for auditing and compliance verification.\n",
Expand All @@ -308,8 +274,6 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='register-artifacts'></a>\n",
"\n",
"### Register artifacts"
]
},
Expand Down Expand Up @@ -444,8 +408,6 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='Set-artifact-associations'></a>\n",
"\n",
"### Set artifact associations"
]
},
Expand Down Expand Up @@ -521,12 +483,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='check-bias'></a>\n",
"\n",
"## Evaluate model for bias with Clarify\n",
"\n",
"[overview](#aud-overview)\n",
"## Evaluate Model for Bias with Clarify\n",
"----\n",
"\n",
"Amazon SageMaker Clarify helps improve your machine learning (ML) models by detecting potential bias and helping explain the predictions that models make. It helps you identify various types of bias in pretraining data and in posttraining that can emerge during model training or when the model is in production. SageMaker Clarify helps explain how these models make predictions using a feature attribution approach. It also monitors inferences models make in production for bias or feature attribution drift. The fairness and explainability functionality provided by SageMaker Clarify provides components that help AWS customers build less biased and more understandable machine learning models. It also provides tools to help you generate model governance reports which you can use to inform risk and compliance teams, and external regulators. \n",
"\n",
"You can reference the [SageMaker Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html) for more information about SageMaker Clarify."
Expand All @@ -546,7 +505,6 @@
"outputs": [],
"source": [
"model_1_name = f\"{prefix}-xgboost-pre-smote\"\n",
"%store model_1_name\n",
"model_matches = sagemaker_boto_client.list_models(NameContains=model_1_name)[\"Models\"]\n",
"\n",
"if not model_matches:\n",
Expand All @@ -566,8 +524,6 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='bias-v1'></a>\n",
"\n",
"### Check for data set bias and model bias\n",
"\n",
"With SageMaker, we can check for pre-training and post-training bias. Pre-training metrics show pre-existing bias in that data, while post-training metrics show bias in the predictions from the model. Using the SageMaker SDK, we can specify which groups we want to check bias across and which metrics we'd like to show. \n",
Expand Down Expand Up @@ -705,12 +661,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='model-registry'></a>\n",
"\n",
"## Deposit Model and Lineage in SageMaker Model Registry\n",
"\n",
"[overview](#aud-overview)\n",
"----\n",
"\n",
"Once a useful model has been trained and its artifacts properly associated, the next step is to save the model in a registry for future reference and possible deployment.\n"
]
},
Expand All @@ -728,10 +681,9 @@
"metadata": {},
"outputs": [],
"source": [
"if 'mpg_name' not in locals():\n",
"if \"mpg_name\" not in locals():\n",
" mpg_name = prefix\n",
" %store mpg_name\n",
" print(f'Model Package Group name: {mpg_name}')"
" print(f\"Model Package Group name: {mpg_name}\")"
]
},
{
Expand Down Expand Up @@ -911,37 +863,13 @@
"source": [
"sagemaker_boto_client.list_model_packages(ModelPackageGroupName=mpg_name)[\"ModelPackageSummaryList\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----\n",
"\n",
"### Next Notebook: [Mitigate Bias, Train New Model, Store in Registry](./3-mitigate-bias-train-model2-registry-e2e.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To handle the imbalance, in the next notebook, we over-sample (i.e. upsample) the minority class using [SMOTE (Synthetic Minority Over-sampling Technique)](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"instance_type": "ml.t3.medium",
"kernelspec": {
"display_name": "Python 3 (Data Science)",
"display_name": "conda_python3",
"language": "python",
"name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
"name": "conda_python3"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -953,7 +881,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
"version": "3.6.13"
}
},
"nbformat": 4,
Expand Down
Loading