diff --git a/use-cases/customer_churn/0_cust_churn_overview_dw.ipynb b/use-cases/customer_churn/0_cust_churn_overview_dw.ipynb deleted file mode 100644 index 2f2a2f6e51..0000000000 --- a/use-cases/customer_churn/0_cust_churn_overview_dw.ipynb +++ /dev/null @@ -1,1012 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Build a Customer Churn Model for Music Streaming App Users: Overview and Data Preparation\n", - "\n", - "In this demo, you are going to learn how to use various SageMaker functionalities to build, train, and deploy the model from end to end, including data pre-processing steps like ingestion, cleaning and processing, feature engineering, training and hyperparameter tuning, model explainability, and eventually deploy the model. There are two parts of the demo: in part 1: Prepare Data, you will process the data with the help of Data Wrangler, then create features from the cleaned data. By the end of part 1, you will have a complete feature data set that contains all attributes built for each user, and it is ready for modeling. Then in part 2: Modeling and Reference, you will use the data set built from part 1 to find an optimal model for the use case, then test the model predictability with the test data. To start with Part 2, you can either read in data from the output of your Part 1 results, or use the provided 'data/full_feature_data.csv' as the input for the next steps.\n", - "\n", - "\n", - "For how to set up the SageMaker Studio Notebook environment, please check the [onboarding video]( https://www.youtube.com/watch?v=wiDHCWVrjCU&feature=youtu.be). And for a list of services covered in the use case demo, please check the documentation linked in each section.\n", - "\n", - "\n", - "## Content\n", - "* [Overview](#Overview)\n", - "* [Data Selection](#2)\n", - "* [Ingest Data](#4)\n", - "* [Data Cleaning and Data Exploration](#5)\n", - "* [Pre-processing with SageMaker Data Wrangler](#7)\n", - "* [Feature Engineering with SageMaker Processing](#6)\n", - "* [Data Splitting](#8)\n", - "* [Model Selection](#9)\n", - "* [Training with SageMaker Estimator and Experiment](#10)\n", - "* [Hyperparameter Tuning with SageMaker Hyperparameter Tuning Job](#11)\n", - "* [Deploy the model with SageMaker Batch-transform](#12)\n", - "* [Model Explainability with SageMaker Clarify](#15)\n", - "* [Optional: Automate your training and model selection with SageMaker Autopilot (Console)](#13)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overview\n", - "\n", - "### What is Customer Churn and why is it important for businesses?\n", - "Customer churn, or customer retention/attrition, means a customer has the tendency to leave and stop paying for a business. It is one of the primary metrics companies want to track to get a sense of their customer satisfaction, especially for a subscription-based business model. The company can track churn rate (defined as the percentage of customers churned during a period) as a health indicator for the business, but we would love to identify the at-risk customers before they churn and offer appropriate treatment to keep them with the business, and this is where machine learning comes into play.\n", - "\n", - "### Use Cases for Customer Churn\n", - "\n", - "Any subscription-based business would track customer churn as one of the most critical Key Performance Indicators (KPIs). Such companies and industries include Telecom companies (cable, cell phone, internet, etc.), digital subscriptions of media (news, forums, blogposts platforms, etc.), music and video streaming services, and other Software as a Service (SaaS) providers (e-commerce, CRM, Mar-Tech, cloud computing, video conference provider, and visualization and data science tools, etc.)\n", - "\n", - "### Define Business problem\n", - "\n", - "To start with, here are some common business problems to consider depending on your specific use cases and your focus:\n", - "\n", - " * Will this customer churn (cancel the plan, cancel the subscription)?\n", - " * Will this customer downgrade a pricing plan?\n", - " * For a subscription business model, will a customer renew his/her subscription?\n", - "\n", - "### Machine learning problem formulation\n", - "\n", - "#### Classification: will this customer churn?\n", - "\n", - "To goal of classification is to identify the at-risk customers and sometimes their unusual behavior, such as: will this customer churn or downgrade their plan? Is there any unusual behavior for a customer? The latter question can be formulated as an anomaly detection problem.\n", - "\n", - "#### Time Series: will this customer churn in the next X months? When will this customer churn?\n", - "\n", - "You can further explore your users by formulating the problem as a time series one and detect when will the customer churn.\n", - "\n", - "### Data Requirements\n", - "\n", - "#### Data collection Sources\n", - "\n", - "Some most common data sources used to construct a data set for churn analysis are:\n", - "\n", - "* Customer Relationship Management platform (CRM), \n", - "* engagement and usage data (analytics services), \n", - "* passive feedback (ratings based on your request), and active feedback (customer support request, feedback on social media and review platforms).\n", - "\n", - "#### Construct a Data Set for Churn Analysis\n", - "\n", - "Most raw data collected from the sources mentioned above are huge and often needs a lot of cleaning and pre-processing. For example, usage data is usually event-based log data and can be more than a few gigabytes every day; you can aggregate the data to user-level daily for further analysis. Feedback and review data are mostly text data, so you would need to clean and pre-process the natural language data to be normalized, machine-readable data. If you are joining multiple data sources (especially from different platforms) together, you would want to make sure all data points are consistent, and the user identity can be matched across different platforms.\n", - " \n", - "#### Challenges with Customer Churn\n", - "\n", - "* Business related\n", - " * Importance of domain knowledge: this is critical when you start building features for the machine learning model. It is important to understand the business enough to decide which features would trigger retention.\n", - "* Data issues\n", - " * fewer churn data available (imbalanced classes): data for churn analysis is often very imbalanced as most of the customers of a business are happy customers (usually).\n", - " * User identity mapping problem: if you are joining data from different platforms (CRM, email, feedback, mobile app, and website usage data), you would want to make sure user A is recognized as the same user across multiple platforms. There are third-party solutions that help you tackle this problem.\n", - " * Not collecting the right data for the use case or Lacking enough data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Data Selection\n", - "\n", - "You will use generated music streaming data that is simulated to imitate music streaming user behaviors. The data simulated contains 1100 users and their user behavior for one year (2019/10/28 - 2020/10/28). Data is simulated using the [EventSim](https://github.com/Interana/eventsim) and does not contain any real user data.\n", - "\n", - "* Observation window: you will use 1 year of data to generate predictions.\n", - "* Explanation of fields:\n", - " * `ts`: event UNIX timestamp\n", - " * `userId`: a randomly assigned unique user id\n", - " * `sessionId`: a randomly assigned session id unique to each user\n", - " * `page`: event taken by the user, e.g. \"next song\", \"upgrade\", \"cancel\"\n", - " * `auth`: whether the user is a logged-in user\n", - " * `method`: request method, GET or PUT\n", - " * `status`: request status\n", - " * `level`: if the user is a free or paid user\n", - " * `itemInSession`: event happened in the session\n", - " * `location`: location of the user's IP address\n", - " * `userAgent`: agent of the user's device\n", - " * `lastName`: user's last name\n", - " * `firstName`: user's first name\n", - " * `registration`: user's time of registration\n", - " * `gender`: gender of the user\n", - " * `artist`: artist of the song the user is playing at the event\n", - " * `song`: song title the user is playing at the event\n", - " * `length`: length of the session\n", - " \n", - " \n", - " * the data will be downloaded from Github and contained in an [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3) bucket." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For this specific use case, you will focus on a solution to predict whether a customer will cancel the subscription. Some possible expansion of the work includes:\n", - "\n", - "* predict plan downgrading\n", - "* when a user will churn\n", - "* add song attributes (genre, playlist, charts) and user attributes (demographics) to the data\n", - "* add user feedback and customer service requests to the data\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Architecture Diagram\n", - "\n", - "The services covered in the use case and an architecture diagram is shown below.\n", - "\n", - "
\n", - " \n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## PART 1: Prepare Data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Set Up Notebook" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pip install -q 'sagemaker==2.19.0' 'botocore == 1.19.4' 's3fs==0.4.2' 'sagemaker-experiments' 'boto3 == 1.16.4'\n", - "# s3fs is needed for pandas to read files from S3" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import sagemaker\n", - "import json\n", - "import pandas as pd\n", - "import glob\n", - "import s3fs\n", - "import boto3" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Parameters \n", - "The following lists configurable parameters that are used throughout the whole notebook." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sagemaker_session = sagemaker.Session()\n", - "bucket = sagemaker_session.default_bucket() # replace with your own bucket name if you have one\n", - "s3 = sagemaker_session.boto_session.resource(\"s3\")\n", - "\n", - "region = boto3.Session().region_name\n", - "role = sagemaker.get_execution_role()\n", - "smclient = boto3.Session().client(\"sagemaker\")\n", - "\n", - "prefix = \"music-streaming\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%store -r\n", - "%store" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%store bucket\n", - "%store prefix" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Ingest Data\n", - "\n", - "We ingest the simulated data from the public SageMaker S3 training database." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "##### Alternative: copy data from a public S3 bucket to your own bucket\n", - "##### data file should include full_data.csv and sample.json\n", - "#### cell 5 - 7 is not needed; the processing job before data wrangler screenshots is not needed\n", - "!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/customer-churn/customer-churn-data.zip ./data/raw/customer-churn-data.zip" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!unzip -o ./data/raw/customer-churn-data.zip -d ./data" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# unzip the partitioned data files into the same folder\n", - "!unzip -o data/simu-1.zip -d data/raw\n", - "!unzip -o data/simu-2.zip -d data/raw\n", - "!unzip -o data/simu-3.zip -d data/raw\n", - "!unzip -o data/simu-4.zip -d data/raw" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!rm ./data/raw/*.zip" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!unzip -o data/sample.zip -d data/raw" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!aws s3 cp ./data/raw s3://$bucket/$prefix/data/json/ --recursive" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Data Cleaning\n", - "\n", - "Due to the size of the data (~2GB), you will start exploring our data starting with a smaller sample, decide which pre-processing steps are necessary, and apply them to the whole dataset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "\n", - "# if your SageMaker Studio notebook's memory is getting full, you can run the following command to remove the raw data files from the instance and free up some memory.\n", - "# You will read data from your S3 bucket onwards and will not need the raw data stored in the instance.\n", - "os.remove(\"data/simu-1.zip\")\n", - "os.remove(\"data/simu-2.zip\")\n", - "os.remove(\"data/simu-3.zip\")\n", - "os.remove(\"data/simu-4.zip\")\n", - "os.remove(\"data/sample.zip\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sample_file_name = \"./data/raw/sample.json\"\n", - "# s3_sample_file_name = \"data/json/sample.json\"\n", - "# sample_path = \"s3://{}/{}/{}\".format(bucket, prefix, s3_sample_file_name)\n", - "sample = pd.read_json(sample_file_name, lines=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sample.head(2)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Remove irrelevant columns\n", - "\n", - "From the first look of data, you can notice that columns `lastName`, `firstName`, `method` and `status` are not relevant features. These will be dropped from the data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "columns_to_remove = [\"method\", \"status\", \"lastName\", \"firstName\"]\n", - "sample = sample.drop(columns=columns_to_remove)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Check for null values\n", - "\n", - "You are going to remove all events without an `userId` assigned since you are predicting which recognized user will churn from our service. In this case, all the rows(events) have a `userId` and `sessionId` assigned, but you will still run this step for the full dataset. For other columns, there are ~3% of data that are missing some demographic information of the users, and ~20% missing the song attributes, which is because the events contain not only playing a song, but also other actions including login and log out, downgrade, cancellation, etc. There are ~3% of users that do not have a registration time, so you will remove these anonymous users from the record." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(\"percentage of the value missing in each column is: \")\n", - "sample.isnull().sum() / len(sample)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sample = sample[~sample[\"userId\"].isnull()]\n", - "sample = sample[~sample[\"registration\"].isnull()]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Data Exploration\n", - "\n", - "Let's take a look at our categorical columns first: `page`, `auth`, `level`, `location`, `userAgent`, `gender`, `artist`, and `song`, and start with looking at unique values for `page`, `auth`, `level`, and `gender` since the other three have many unique values and you will take a different approach." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "cat_columns = [\"page\", \"auth\", \"level\", \"gender\"]\n", - "cat_columns_long = [\"location\", \"userAgent\", \"artist\", \"song\", \"userId\"]\n", - "for col in cat_columns:\n", - " print(\"The unique values in column {} are: {}\".format(col, sample[col].unique()))\n", - "for col in cat_columns_long:\n", - " print(\"There are {} unique values in column {}\".format(sample[col].nunique(), col))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Key observations from the above information\n", - "\n", - "* There are 101 unique users with 72 unique locations, this information may not be useful as a categorical feature. You can parse this field and only keep State information, but even that will give us 50 unique values in this category, so you can either remove this column or bucket it to a higher level (NY --> Northeast).\n", - "* Artist and song details might not be helpful as categorical features as there are too many categories; you can quantify these to a user level, i.e. how many artists this user has listened to in total, how many songs this user has played in the last week, last month, in 180 days, in 365 days. You can also bring in external data to get song genres and other artist attributes to enrich this feature.\n", - "* In the column `page`, 'Thumbs Down', 'Thumbs Up', 'Add to Playlist', 'Roll Advert','Help', 'Add Friend', 'Downgrade', 'Upgrade', and 'Error' can all be great features to churn analysis. You will aggregate them to user-level later. There is a \"cancellation confirmation\" value that can be used for the churn indicator.\n", - "\n", - "* Let's take a look at the column `userAgent`:\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "UserAgent contains little useful information, but if you care about the browser type and mac/windows difference, you can parse the text and extract the information. Sometimes businesses would love to analyze user behavior based on their App version and device type (iOS v.s. Android), so these could be useful information. In this use case, for modeling purpose, we will remove this column. but you can keep it as a filter for data visualization." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "columns_to_remove = [\"location\", \"userAgent\"]\n", - "sample = sample.drop(columns=columns_to_remove)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's take a closer look at the timestamp columns `ts` and `registration`. We can convert the event timestamp `ts` to year, month, week, day, day of the week, and hour of the day. The registration time should be the same for the same user, so we can aggregate this value to user-level and create a time delta column to calculate the time between registration and the newest event." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sample[\"date\"] = pd.to_datetime(sample[\"ts\"], unit=\"ms\")\n", - "sample[\"ts_year\"] = sample[\"date\"].dt.year\n", - "sample[\"ts_month\"] = sample[\"date\"].dt.month\n", - "sample[\"ts_week\"] = sample[\"date\"].dt.week\n", - "sample[\"ts_day\"] = sample[\"date\"].dt.day\n", - "sample[\"ts_dow\"] = sample[\"date\"].dt.weekday\n", - "sample[\"ts_hour\"] = sample[\"date\"].dt.hour\n", - "sample[\"ts_date_day\"] = sample[\"date\"].dt.date\n", - "sample[\"ts_is_weekday\"] = [1 if x in [0, 1, 2, 3, 4] else 0 for x in sample[\"ts_dow\"]]\n", - "sample[\"registration_ts\"] = pd.to_datetime(sample[\"registration\"], unit=\"ms\").dt.date" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Define Churn\n", - "\n", - "In this use case, you will use `page == \"Cancellation Confirmation\"` as the indicator of a user churn. You can also use `page == 'downgrade` if you are interested in users downgrading their payment plan. There are ~13% users churned, so you will need to up-sample or down-sample the full dataset to deal with the imbalanced class, or carefully choose your algorithms." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(\n", - " \"There are {:.2f}% of users churned in this dataset\".format(\n", - " (\n", - " (sample[sample[\"page\"] == \"Cancellation Confirmation\"][\"userId\"].nunique())\n", - " / sample[\"userId\"].nunique()\n", - " )\n", - " * 100\n", - " )\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can label a user by adding a churn label at a event level then aggregate this value to user level. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sample[\"churned_event\"] = [1 if x == \"Cancellation Confirmation\" else 0 for x in sample[\"page\"]]\n", - "sample[\"user_churned\"] = sample.groupby(\"userId\")[\"churned_event\"].transform(\"max\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Imbalanced Class\n", - "\n", - "Imbalanced class (much more positive cases than negative cases) is very common in churn analysis. It can be misleading for some machine learning model as the accuracy will be biased towards the majority class. Some useful tactics to deal with imbalanced class are [SMOTE](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html), use algorithms that are less sensitive to imbalanced class like a tree-based algorithm or use a cost-sensitive algorithm that penalizes wrongly classified minority class." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "To Summarize every pre-processing steps you have covered:\n", - "* null removals\n", - "* drop irrelevant columns\n", - "* convert event timestamps to features used for analysis and modeling: year, month, week, day, day of week, hour, date, if the day is weekday or weekend, and convert registration timestamp to UTC.\n", - "* create labels (whether the user churned eventually), which is calculated by if one churn event happened in the user's history, you can label the user as a churned user (1). " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Exploring Data\n", - "\n", - "Based on the available data, look at every column, and decide if you can create a feature from it. For all the columns, here are some directions to explore:\n", - "\n", - " * `ts`: distribution of activity time: time of the day, day of the week\n", - " * `sessionId`: average number of sessions per user\n", - " * `page`: number of thumbs up/thumbs down, added to the playlist, ads, add friend, if the user has downgrade or upgrade the plan, how many errors the user has encountered.\n", - " * `level`: if the user is a free or paid user\n", - " * `registration`: days the user being active, time the user joined the service\n", - " * `gender`: gender of the user\n", - " * `artist`: average number of artists the user listened to\n", - " * `song`: average number of songs listened per user\n", - " * `length`: average time spent per day per user\n", - " \n", - "**Activity Time**\n", - "\n", - "1. Weekday v.s. weekend trends for churned users and active users. It seems like churned users are more active on weekdays than weekends whereas active users do not show a strong difference between weekday v.s. weekends. You can create some features from here: for each user, average events per day -- weekends, average events per day -- weekdays. You can also create features - average events per day of the week, but that will be converted to 7 features after one-hot-encoding, which may be less informative than the previous method.\n", - "2. In terms of hours active during a day, our simulated data did not show a significant difference between day and night for both sets of users. You can have it on your checklist for your analysis, and similarly for the day of the month, the month of the year when you have more than a year of data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import seaborn as sns\n", - "import matplotlib.pyplot as plt\n", - "\n", - "events_per_day_per_user = (\n", - " sample.groupby([\"userId\", \"ts_date_day\", \"ts_is_weekday\", \"user_churned\"])\n", - " .agg({\"page\": \"count\"})\n", - " .reset_index()\n", - ")\n", - "events_dist = (\n", - " events_per_day_per_user.groupby([\"userId\", \"ts_is_weekday\", \"user_churned\"])\n", - " .agg({\"page\": \"mean\"})\n", - " .reset_index()\n", - ")\n", - "\n", - "\n", - "def trend_plot(\n", - " df, plot_type, x, y, hue=None, title=None, x_axis=None, y_axis=None, xticks=None, yticks=None\n", - "):\n", - " if plot_type == \"box\":\n", - " fig = sns.boxplot(x=\"page\", y=y, data=df, hue=hue, orient=\"h\")\n", - " elif plot_type == \"bar\":\n", - " fig = sns.barplot(x=x, y=y, data=df, hue=hue)\n", - "\n", - " sns.set(rc={\"figure.figsize\": (12, 3)})\n", - " sns.set_palette(\"Set2\")\n", - " sns.set_style(\"darkgrid\")\n", - " plt.title(title)\n", - " plt.xlabel(x_axis)\n", - " plt.ylabel(y_axis)\n", - " plt.yticks([0, 1], yticks)\n", - " return plt.show(fig)\n", - "\n", - "\n", - "trend_plot(\n", - " events_dist,\n", - " \"box\",\n", - " \"page\",\n", - " \"user_churned\",\n", - " \"ts_is_weekday\",\n", - " \"Weekday V.S. Weekends - Average events per day per user\",\n", - " \"average events per user per day\",\n", - " yticks=[\"active users\", \"churned users\"],\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "events_per_hour_per_user = (\n", - " sample.groupby([\"userId\", \"ts_date_day\", \"ts_hour\", \"user_churned\"])\n", - " .agg({\"page\": \"count\"})\n", - " .reset_index()\n", - ")\n", - "events_dist = (\n", - " events_per_hour_per_user.groupby([\"userId\", \"ts_hour\", \"user_churned\"])\n", - " .agg({\"page\": \"mean\"})\n", - " .reset_index()\n", - " .groupby([\"ts_hour\", \"user_churned\"])\n", - " .agg({\"page\": \"mean\"})\n", - " .reset_index()\n", - ")\n", - "trend_plot(\n", - " events_dist,\n", - " \"bar\",\n", - " \"ts_hour\",\n", - " \"page\",\n", - " \"user_churned\",\n", - " \"Hourly activity - Average events per hour of day per user\",\n", - " \"hour of the day\",\n", - " \"average events per user per hour\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Listening Behavior**\n", - "\n", - "You can look at some basic stats for a user's listening habits. Churned users generally listen to a wider variety of songs and artists and spend more time on the App/be with the App longer.\n", - "* Average total: number of sessions, App usage length, number of songs listened, number of artists listened per user, number of ad days active\n", - "* Average daily: number of sessions, App usage length, number of songs listened, number of artists listened per user\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "stats_per_user = (\n", - " sample.groupby([\"userId\", \"user_churned\"])\n", - " .agg(\n", - " {\n", - " \"sessionId\": \"count\",\n", - " \"song\": \"nunique\",\n", - " \"artist\": \"nunique\",\n", - " \"length\": \"sum\",\n", - " \"ts_date_day\": \"count\",\n", - " }\n", - " )\n", - " .reset_index()\n", - ")\n", - "avg_stats_group = (\n", - " stats_per_user.groupby([\"user_churned\"])\n", - " .agg(\n", - " {\n", - " \"sessionId\": \"mean\",\n", - " \"song\": \"mean\",\n", - " \"artist\": \"mean\",\n", - " \"length\": \"mean\",\n", - " \"ts_date_day\": \"mean\",\n", - " }\n", - " )\n", - " .reset_index()\n", - ")\n", - "\n", - "print(\n", - " \"Average total: number of sessions, App usage length, number of songs listened, number of artists listened per user, days active: \"\n", - ")\n", - "avg_stats_group" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "stats_per_user = (\n", - " sample.groupby([\"userId\", \"ts_date_day\", \"user_churned\"])\n", - " .agg({\"sessionId\": \"count\", \"song\": \"nunique\", \"artist\": \"nunique\", \"length\": \"sum\"})\n", - " .reset_index()\n", - ")\n", - "avg_stats_group = (\n", - " stats_per_user.groupby([\"user_churned\"])\n", - " .agg({\"sessionId\": \"mean\", \"song\": \"mean\", \"artist\": \"mean\", \"length\": \"mean\"})\n", - " .reset_index()\n", - ")\n", - "print(\n", - " \"Average daily: number of sessions, App usage length, number of songs listened, number of artists listened per user: \"\n", - ")\n", - "avg_stats_group" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**App Usage Behavior**\n", - "\n", - "You can further explore how the users are using the App besides just listening: number of thumbs up/thumbs down, added to playlist, ads, add friend, if the user has downgrade or upgrade the plan, how many errors the user has encountered. Churned users are slightly more active than other users, and also encounter more errors, listened to more ads, and more downgrade and upgrade. These can be numerical features (number of total events per type per user), or more advanced time series numerical features (errors in last 7 days, errors in last month, etc.)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "events_list = [\n", - " \"NextSong\",\n", - " \"Thumbs Down\",\n", - " \"Thumbs Up\",\n", - " \"Add to Playlist\",\n", - " \"Roll Advert\",\n", - " \"Add Friend\",\n", - " \"Downgrade\",\n", - " \"Upgrade\",\n", - " \"Error\",\n", - "]\n", - "usage_column_name = []\n", - "for event in events_list:\n", - " event_name = \"_\".join(event.split()).lower()\n", - " usage_column_name.append(event_name)\n", - " sample[event_name] = [1 if x == event else 0 for x in sample[\"page\"]]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "app_use_per_user = sample.groupby([\"userId\", \"user_churned\"])[usage_column_name].sum().reset_index()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "app_use_group = app_use_per_user.groupby([\"user_churned\"])[usage_column_name].mean().reset_index()\n", - "app_use_group" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Pre-processing with SageMaker Data Wrangler\n", - "\n", - "Now that you have a good understanding of your data and decided which steps are needed to pre-process your data, you can utilize the new Amazon SageMaker GUI tool **Data Wrangler**, without writing all the code for the SageMaker Processing Job.\n", - "\n", - "* Here we used a Processing Job to convert the raw streaming data files downloaded from the github repo (`simu-*.zip` files) to a full, CSV formatted file for Data Wrangler Ingestion purpose.\n", - "you are importing the raw streaming data files downloaded from the github repo (`simu-*.zip` files). The raw JSON files were converted to CSV format and combined to one file for Data Wrangler Ingestion purpose." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%writefile preprocessing_predw.py\n", - "\n", - "import argparse\n", - "import os\n", - "import warnings\n", - "import glob\n", - "import time\n", - "import pandas as pd\n", - "import json\n", - "import argparse\n", - "\n", - "from sklearn.exceptions import DataConversionWarning\n", - "\n", - "warnings.filterwarnings(action=\"ignore\", category=DataConversionWarning)\n", - "start_time = time.time()\n", - "\n", - "if __name__ == \"__main__\":\n", - " parser = argparse.ArgumentParser()\n", - " parser.add_argument(\"--processing-output-filename\")\n", - "\n", - " args, _ = parser.parse_known_args()\n", - " print(\"Received arguments {}\".format(args))\n", - "\n", - " input_jsons = glob.glob(\"/opt/ml/processing/input/data/**/*.json\", recursive=True)\n", - "\n", - " df_all = pd.DataFrame()\n", - " for name in input_jsons:\n", - " print(\"\\nStarting file: {}\".format(name))\n", - " df = pd.read_json(name, lines=True)\n", - " df_all = df_all.append(df)\n", - "\n", - " output_filename = args.processing_output_filename\n", - " final_features_output_path = os.path.join(\"/opt/ml/processing/output\", output_filename)\n", - " print(\"Saving processed data to {}\".format(final_features_output_path))\n", - " df_all.to_csv(final_features_output_path, header=True, index=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sagemaker.sklearn.processing import SKLearnProcessor\n", - "\n", - "sklearn_processor = SKLearnProcessor(\n", - " framework_version=\"0.23-1\", role=role, instance_type=\"ml.m5.xlarge\", instance_count=1\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "s3_client = boto3.client(\"s3\")\n", - "list_response = s3_client.list_objects_v2(Bucket=bucket, Prefix=f\"{prefix}/data/json\")\n", - "s3_input_uris = [f\"s3://{bucket}/{i['Key']}\" for i in list_response[\"Contents\"]]\n", - "s3_input_uris" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", - "\n", - "processing_inputs = []\n", - "for i in s3_input_uris:\n", - " name = i.split(\"/\")[-1].split(\".\")[0]\n", - " processing_input = ProcessingInput(\n", - " source=i, input_name=name, destination=f\"/opt/ml/processing/input/data/{name}\"\n", - " )\n", - " processing_inputs.append(processing_input)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "processing_output_path = f\"s3://{bucket}/{prefix}/data/processing\"\n", - "final_features_filename = \"full_data.csv\"\n", - "\n", - "sklearn_processor.run(\n", - " code=\"preprocessing_predw.py\",\n", - " inputs=processing_inputs,\n", - " outputs=[\n", - " ProcessingOutput(\n", - " output_name=\"processed_data\",\n", - " source=\"/opt/ml/processing/output\",\n", - " destination=processing_output_path,\n", - " )\n", - " ],\n", - " arguments=[\"--processing-output-filename\", final_features_filename],\n", - ")\n", - "\n", - "preprocessing_job_description = sklearn_processor.jobs[-1].describe()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now you can initiate a Data Wrangler flow. An example flow (`dw_example.flow`) is provided in the github repo. \n", - "\n", - "From the SageMaker Studio launcher page, choose **New data flow**, then choose **import from S3** and select processing_output_filename. \n", - "\n", - "
\n", - "\n", - "
\n", - " \n", - "You can import any .csv format file with SageMaker Data Wrangler, preview your data, and decide what pre-processing steps are needed.\n", - "
\n", - "\n", - "
\n", - "You can choose your pre-processing steps, including drop columns and rename columns from the pre-built solutions, also customize processing and feature engineering code in the custom Pandas code block.\n", - "
\n", - "\n", - "\n", - "
\n", - "After everything run through, it will create a Processing job notebook for you. You can run through the notebook to kick off the Processing Job and check the status in the console.\n", - "\n", - "
\n", - "\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Find the data path of the SageMaker Data Wrangler Job\n", - "\n", - "You can get the results from your Data Wrangler Job, check the results, and use it as input for your feature engineering processing job." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "processing_output_filename = f\"{processing_output_path}/{final_features_filename}\"\n", - "%store processing_output_filename\n", - "%store -r" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "flow_file = \"dw_example.flow\"\n", - "\n", - "# read flow file and change the s3 location to our `processing_output_filename`\n", - "with open(flow_file, \"r\") as f:\n", - " flow = f.read()\n", - "\n", - " flow = json.loads(flow)\n", - " flow[\"nodes\"][0][\"parameters\"][\"dataset_definition\"][\"s3ExecutionContext\"][\n", - " \"s3Uri\"\n", - " ] = processing_output_filename\n", - "\n", - "with open(\"dw_example.flow\", \"w\") as f:\n", - " json.dump(flow, f)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "flow" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Citation\n", - "The data used in this notebook is simulated using the [EventSim](https://github.com/Interana/eventsim)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "instance_type": "ml.t3.medium", - "kernelspec": { - "display_name": "Python 3 (Data Science)", - "language": "python", - "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.10" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/use-cases/customer_churn/1_cust_churn_dataprep.ipynb b/use-cases/customer_churn/1_cust_churn_dataprep.ipynb index 7221e8c63e..169b3f1168 100644 --- a/use-cases/customer_churn/1_cust_churn_dataprep.ipynb +++ b/use-cases/customer_churn/1_cust_churn_dataprep.ipynb @@ -4,29 +4,26 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Build a Customer Churn Model for Music Streaming App Users: Date Pre-processing with SageMaker Data Wrangler and Processing Job\n", + "# Build a Customer Churn Model for Music Streaming App Users: Overview and Data Preparation\n", "\n", - "In this demo, you are going to learn how to use various SageMaker functionalities to build, train, and deploy the model from end to end, including data pre-processing steps like ingestion, cleaning and processing, feature engineering, training and hyperparameter tuning, model explainability, and eventually deploy the model. There are two parts of the demo: in part 1: Prepare Data, you will process the data with the help of Data Wrangler, then create features from the cleaned data. By the end of part 1, you will have a complete feature data set that contains all attributes built for each user, and it is ready for modeling. Then in part 2: Modeling and Reference, you will use the data set built from part 1 to find an optimal model for the use case, then test the model predictability with the test data. To start with Part 2, you can either read in data from the output of your Part 1 results, or use the provided 'data/full_feature_data.csv' as the input for the next steps.\n", + "## Background\n", "\n", + "This notebook is one of a sequence of notebooks that show you how to use various SageMaker functionalities to build, train, and deploy the model from end to end, including data pre-processing steps like ingestion, cleaning and processing, feature engineering, training and hyperparameter tuning, model explainability, and eventually deploy the model. There are two parts of the demo: \n", + "\n", + "1. Build a Customer Churn Model for Music Streaming App Users: Overview and Data Preparation (current notebook) - you will process the data with the help of Data Wrangler, then create features from the cleaned data. By the end of part 1, you will have a complete feature data set that contains all attributes built for each user, and it is ready for modeling.\n", + "1. Build a Customer Churn Model for Music Streaming App Users: Model Selection and Model Explainability - you will use the data set built from part 1 to find an optimal model for the use case, then test the model predictability with the test data. \n", "\n", "For how to set up the SageMaker Studio Notebook environment, please check the [onboarding video]( https://www.youtube.com/watch?v=wiDHCWVrjCU&feature=youtu.be). And for a list of services covered in the use case demo, please check the documentation linked in each section.\n", "\n", "\n", "## Content\n", - "\n", "* [Overview](#Overview)\n", - "* [Data Selection](#2)\n", - "* [Ingest Data](#4)\n", - "* [Data Cleaning and Data Exploration](#5)\n", - "* [Pre-processing with SageMaker Data Wrangler](#7)\n", - "* [Feature Engineering with SageMaker Processing](#6)\n", - "* [Data Splitting](#8)\n", - "* [Model Selection](#9)\n", - "* [Training with SageMaker Estimator and Experiment](#10)\n", - "* [Hyperparameter Tuning with SageMaker Hyperparameter Tuning Job](#11)\n", - "* [Deploy the model with SageMaker Batch-transform](#12)\n", - "* [Model Explainability with SageMaker Clarify](#15)\n", - "* [Optional: Automate your training and model selection with SageMaker Autopilot (Console)](#13)" + "* [Data Selection](#Data-Selection)\n", + "* [Ingest Data](#Ingest-Data)\n", + "* [Data Cleaning and Data Exploration](#Data-Cleaning)\n", + "* [Pre-processing with SageMaker Data Wrangler](#Pre-processing-with-SageMaker-Data-Wrangler)\n", + "* [Feature Engineering with SageMaker Processing](#Feature-Engineering-with-SageMaker-Processing)\n", + "* [Data Splitting](#Data-Splitting)" ] }, { @@ -36,132 +33,898 @@ "## Overview\n", "\n", "### What is Customer Churn and why is it important for businesses?\n", - "\n", "Customer churn, or customer retention/attrition, means a customer has the tendency to leave and stop paying for a business. It is one of the primary metrics companies want to track to get a sense of their customer satisfaction, especially for a subscription-based business model. The company can track churn rate (defined as the percentage of customers churned during a period) as a health indicator for the business, but we would love to identify the at-risk customers before they churn and offer appropriate treatment to keep them with the business, and this is where machine learning comes into play.\n", "\n", "### Use Cases for Customer Churn\n", "\n", - "Any subscription-based business would track customer churn as one of the most critical Key Performance Indicators (KPIs). Such companies and industries include Telecom companies (cable, cell phone, internet, etc.), digital subscriptions of media (news, forums, blogposts platforms, etc.), music and video streaming services, and other Software as a Service (SaaS) providers (e-commerce, CRM, Mar-Tech, cloud computing, video conference provider, and visualization and data science tools, etc.)\n", + "Any subscription-based business would track customer churn as one of the most critical Key Performance Indicators (KPIs). Such companies and industries include Telecom companies (cable, cell phone, internet, etc.), digital subscriptions of media (news, forums, blogposts platforms, etc.), music and video streaming services, and other Software as a Service (SaaS) providers (e-commerce, CRM, Mar-Tech, cloud computing, video conference provider, and visualization and data science tools, etc.)\n", + "\n", + "### Define Business problem\n", + "\n", + "To start with, here are some common business problems to consider depending on your specific use cases and your focus:\n", + "\n", + " * Will this customer churn (cancel the plan, cancel the subscription)?\n", + " * Will this customer downgrade a pricing plan?\n", + " * For a subscription business model, will a customer renew his/her subscription?\n", + "\n", + "### Machine learning problem formulation\n", + "\n", + "#### Classification: will this customer churn?\n", + "\n", + "To goal of classification is to identify the at-risk customers and sometimes their unusual behavior, such as: will this customer churn or downgrade their plan? Is there any unusual behavior for a customer? The latter question can be formulated as an anomaly detection problem.\n", + "\n", + "#### Time Series: will this customer churn in the next X months? When will this customer churn?\n", + "\n", + "You can further explore your users by formulating the problem as a time series one and detect when will the customer churn.\n", + "\n", + "### Data Requirements\n", + "\n", + "#### Data collection Sources\n", + "\n", + "Some most common data sources used to construct a data set for churn analysis are:\n", + "\n", + "* Customer Relationship Management platform (CRM), \n", + "* engagement and usage data (analytics services), \n", + "* passive feedback (ratings based on your request), and active feedback (customer support request, feedback on social media and review platforms).\n", + "\n", + "#### Construct a Data Set for Churn Analysis\n", + "\n", + "Most raw data collected from the sources mentioned above are huge and often needs a lot of cleaning and pre-processing. For example, usage data is usually event-based log data and can be more than a few gigabytes every day; you can aggregate the data to user-level daily for further analysis. Feedback and review data are mostly text data, so you would need to clean and pre-process the natural language data to be normalized, machine-readable data. If you are joining multiple data sources (especially from different platforms) together, you would want to make sure all data points are consistent, and the user identity can be matched across different platforms.\n", + " \n", + "#### Challenges with Customer Churn\n", + "\n", + "* Business related\n", + " * Importance of domain knowledge: this is critical when you start building features for the machine learning model. It is important to understand the business enough to decide which features would trigger retention.\n", + "* Data issues\n", + " * fewer churn data available (imbalanced classes): data for churn analysis is often very imbalanced as most of the customers of a business are happy customers (usually).\n", + " * User identity mapping problem: if you are joining data from different platforms (CRM, email, feedback, mobile app, and website usage data), you would want to make sure user A is recognized as the same user across multiple platforms. There are third-party solutions that help you tackle this problem.\n", + " * Not collecting the right data for the use case or Lacking enough data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Selection\n", + "\n", + "You will use generated music streaming data that is simulated to imitate music streaming user behaviors. The data simulated contains 1100 users and their user behavior for one year (2019/10/28 - 2020/10/28). Data is simulated using the [EventSim](https://github.com/Interana/eventsim) and does not contain any real user data.\n", + "\n", + "* Observation window: you will use 1 year of data to generate predictions.\n", + "* Explanation of fields:\n", + " * `ts`: event UNIX timestamp\n", + " * `userId`: a randomly assigned unique user id\n", + " * `sessionId`: a randomly assigned session id unique to each user\n", + " * `page`: event taken by the user, e.g. \"next song\", \"upgrade\", \"cancel\"\n", + " * `auth`: whether the user is a logged-in user\n", + " * `method`: request method, GET or PUT\n", + " * `status`: request status\n", + " * `level`: if the user is a free or paid user\n", + " * `itemInSession`: event happened in the session\n", + " * `location`: location of the user's IP address\n", + " * `userAgent`: agent of the user's device\n", + " * `lastName`: user's last name\n", + " * `firstName`: user's first name\n", + " * `registration`: user's time of registration\n", + " * `gender`: gender of the user\n", + " * `artist`: artist of the song the user is playing at the event\n", + " * `song`: song title the user is playing at the event\n", + " * `length`: length of the session\n", + " \n", + " \n", + " * the data will be downloaded from Github and contained in an [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3) bucket." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For this specific use case, you will focus on a solution to predict whether a customer will cancel the subscription. Some possible expansion of the work includes:\n", + "\n", + "* predict plan downgrading\n", + "* when a user will churn\n", + "* add song attributes (genre, playlist, charts) and user attributes (demographics) to the data\n", + "* add user feedback and customer service requests to the data\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## PART 1: Prepare Data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Set Up Notebook" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -q 's3fs==0.4.2' 'sagemaker-experiments'\n", + "!pip install --upgrade sagemaker boto3\n", + "# s3fs is needed for pandas to read files from S3" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import sagemaker\n", + "import json\n", + "import pandas as pd\n", + "import glob\n", + "import s3fs\n", + "import boto3\n", + "import numpy as np" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Parameters \n", + "The following lists configurable parameters that are used throughout the whole notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sagemaker_session = sagemaker.Session()\n", + "bucket = sagemaker_session.default_bucket() # replace with your own bucket name if you have one\n", + "s3 = sagemaker_session.boto_session.resource(\"s3\")\n", + "\n", + "region = boto3.Session().region_name\n", + "role = sagemaker.get_execution_role()\n", + "smclient = boto3.Session().client(\"sagemaker\")\n", + "\n", + "prefix = \"music-streaming\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Ingest Data\n", + "\n", + "We ingest the simulated data from the public SageMaker S3 training database." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "##### Alternative: copy data from a public S3 bucket to your own bucket\n", + "##### data file should include full_data.csv and sample.json\n", + "#### cell 5 - 7 is not needed; the processing job before data wrangler screenshots is not needed\n", + "!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/customer-churn/customer-churn-data-v2.zip ./data/raw/customer-churn-data.zip" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!unzip -o ./data/raw/customer-churn-data.zip -d ./data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# unzip the partitioned data files into the same folder\n", + "!unzip -o data/simu-1.zip -d data/raw\n", + "!unzip -o data/simu-2.zip -d data/raw\n", + "!unzip -o data/simu-3.zip -d data/raw\n", + "!unzip -o data/simu-4.zip -d data/raw" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!rm ./data/raw/*.zip" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!unzip -o data/sample.zip -d data/raw" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!aws s3 cp ./data/raw s3://$bucket/$prefix/data/json/ --recursive" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Cleaning\n", + "\n", + "Due to the size of the data (~2GB), you will start exploring our data starting with a smaller sample, decide which pre-processing steps are necessary, and apply them to the whole dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "# if your SageMaker Studio notebook's memory is getting full, you can run the following command to remove the raw data files from the instance and free up some memory.\n", + "# You will read data from your S3 bucket onwards and will not need the raw data stored in the instance.\n", + "os.remove(\"data/simu-1.zip\")\n", + "os.remove(\"data/simu-2.zip\")\n", + "os.remove(\"data/simu-3.zip\")\n", + "os.remove(\"data/simu-4.zip\")\n", + "os.remove(\"data/sample.zip\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sample_file_name = \"./data/raw/sample.json\"\n", + "# s3_sample_file_name = \"data/json/sample.json\"\n", + "# sample_path = \"s3://{}/{}/{}\".format(bucket, prefix, s3_sample_file_name)\n", + "sample = pd.read_json(sample_file_name, lines=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sample.head(2)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Remove irrelevant columns\n", + "\n", + "From the first look of data, you can notice that columns `lastName`, `firstName`, `method` and `status` are not relevant features. These will be dropped from the data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "columns_to_remove = [\"method\", \"status\", \"lastName\", \"firstName\"]\n", + "sample = sample.drop(columns=columns_to_remove)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Check for null values\n", + "\n", + "You are going to remove all events without an `userId` assigned since you are predicting which recognized user will churn from our service. In this case, all the rows(events) have a `userId` and `sessionId` assigned, but you will still run this step for the full dataset. For other columns, there are ~3% of data that are missing some demographic information of the users, and ~20% missing the song attributes, which is because the events contain not only playing a song, but also other actions including login and log out, downgrade, cancellation, etc. There are ~3% of users that do not have a registration time, so you will remove these anonymous users from the record." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"percentage of the value missing in each column is: \")\n", + "sample.isnull().sum() / len(sample)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sample = sample[~sample[\"userId\"].isnull()]\n", + "sample = sample[~sample[\"registration\"].isnull()]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Exploration\n", + "\n", + "Let's take a look at our categorical columns first: `page`, `auth`, `level`, `location`, `userAgent`, `gender`, `artist`, and `song`, and start with looking at unique values for `page`, `auth`, `level`, and `gender` since the other three have many unique values and you will take a different approach." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "cat_columns = [\"page\", \"auth\", \"level\", \"gender\"]\n", + "cat_columns_long = [\"location\", \"userAgent\", \"artist\", \"song\", \"userId\"]\n", + "for col in cat_columns:\n", + " print(\"The unique values in column {} are: {}\".format(col, sample[col].unique()))\n", + "for col in cat_columns_long:\n", + " print(\"There are {} unique values in column {}\".format(sample[col].nunique(), col))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Key observations from the above information\n", + "\n", + "* There are 101 unique users with 72 unique locations, this information may not be useful as a categorical feature. You can parse this field and only keep State information, but even that will give us 50 unique values in this category, so you can either remove this column or bucket it to a higher level (NY --> Northeast).\n", + "* Artist and song details might not be helpful as categorical features as there are too many categories; you can quantify these to a user level, i.e. how many artists this user has listened to in total, how many songs this user has played in the last week, last month, in 180 days, in 365 days. You can also bring in external data to get song genres and other artist attributes to enrich this feature.\n", + "* In the column `page`, 'Thumbs Down', 'Thumbs Up', 'Add to Playlist', 'Roll Advert','Help', 'Add Friend', 'Downgrade', 'Upgrade', and 'Error' can all be great features to churn analysis. You will aggregate them to user-level later. There is a \"cancellation confirmation\" value that can be used for the churn indicator.\n", + "\n", + "* Let's take a look at the column `userAgent`:\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "UserAgent contains little useful information, but if you care about the browser type and mac/windows difference, you can parse the text and extract the information. Sometimes businesses would love to analyze user behavior based on their App version and device type (iOS v.s. Android), so these could be useful information. In this use case, for modeling purpose, we will remove this column. but you can keep it as a filter for data visualization." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "columns_to_remove = [\"location\", \"userAgent\"]\n", + "sample = sample.drop(columns=columns_to_remove)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's take a closer look at the timestamp columns `ts` and `registration`. We can convert the event timestamp `ts` to year, month, week, day, day of the week, and hour of the day. The registration time should be the same for the same user, so we can aggregate this value to user-level and create a time delta column to calculate the time between registration and the newest event." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sample[\"date\"] = pd.to_datetime(sample[\"ts\"], unit=\"ms\")\n", + "sample[\"ts_year\"] = sample[\"date\"].dt.year\n", + "sample[\"ts_month\"] = sample[\"date\"].dt.month\n", + "sample[\"ts_week\"] = sample[\"date\"].dt.week\n", + "sample[\"ts_day\"] = sample[\"date\"].dt.day\n", + "sample[\"ts_dow\"] = sample[\"date\"].dt.weekday\n", + "sample[\"ts_hour\"] = sample[\"date\"].dt.hour\n", + "sample[\"ts_date_day\"] = sample[\"date\"].dt.date\n", + "sample[\"ts_is_weekday\"] = [1 if x in [0, 1, 2, 3, 4] else 0 for x in sample[\"ts_dow\"]]\n", + "sample[\"registration_ts\"] = pd.to_datetime(sample[\"registration\"], unit=\"ms\").dt.date" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Define Churn\n", + "\n", + "In this use case, you will use `page == \"Cancellation Confirmation\"` as the indicator of a user churn. You can also use `page == 'downgrade` if you are interested in users downgrading their payment plan. There are ~13% users churned, so you will need to up-sample or down-sample the full dataset to deal with the imbalanced class, or carefully choose your algorithms." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\n", + " \"There are {:.2f}% of users churned in this dataset\".format(\n", + " (\n", + " (sample[sample[\"page\"] == \"Cancellation Confirmation\"][\"userId\"].nunique())\n", + " / sample[\"userId\"].nunique()\n", + " )\n", + " * 100\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can label a user by adding a churn label at a event level then aggregate this value to user level. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sample[\"churned_event\"] = [1 if x == \"Cancellation Confirmation\" else 0 for x in sample[\"page\"]]\n", + "sample[\"user_churned\"] = sample.groupby(\"userId\")[\"churned_event\"].transform(\"max\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Imbalanced Class\n", + "\n", + "Imbalanced class (much more positive cases than negative cases) is very common in churn analysis. It can be misleading for some machine learning model as the accuracy will be biased towards the majority class. Some useful tactics to deal with imbalanced class are [SMOTE](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html), use algorithms that are less sensitive to imbalanced class like a tree-based algorithm or use a cost-sensitive algorithm that penalizes wrongly classified minority class." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To Summarize every pre-processing steps you have covered:\n", + "* null removals\n", + "* drop irrelevant columns\n", + "* convert event timestamps to features used for analysis and modeling: year, month, week, day, day of week, hour, date, if the day is weekday or weekend, and convert registration timestamp to UTC.\n", + "* create labels (whether the user churned eventually), which is calculated by if one churn event happened in the user's history, you can label the user as a churned user (1). " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Exploring Data\n", + "\n", + "Based on the available data, look at every column, and decide if you can create a feature from it. For all the columns, here are some directions to explore:\n", + "\n", + " * `ts`: distribution of activity time: time of the day, day of the week\n", + " * `sessionId`: average number of sessions per user\n", + " * `page`: number of thumbs up/thumbs down, added to the playlist, ads, add friend, if the user has downgrade or upgrade the plan, how many errors the user has encountered.\n", + " * `level`: if the user is a free or paid user\n", + " * `registration`: days the user being active, time the user joined the service\n", + " * `gender`: gender of the user\n", + " * `artist`: average number of artists the user listened to\n", + " * `song`: average number of songs listened per user\n", + " * `length`: average time spent per day per user\n", + " \n", + "**Activity Time**\n", + "\n", + "1. Weekday v.s. weekend trends for churned users and active users. It seems like churned users are more active on weekdays than weekends whereas active users do not show a strong difference between weekday v.s. weekends. You can create some features from here: for each user, average events per day -- weekends, average events per day -- weekdays. You can also create features - average events per day of the week, but that will be converted to 7 features after one-hot-encoding, which may be less informative than the previous method.\n", + "2. In terms of hours active during a day, our simulated data did not show a significant difference between day and night for both sets of users. You can have it on your checklist for your analysis, and similarly for the day of the month, the month of the year when you have more than a year of data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "\n", + "events_per_day_per_user = (\n", + " sample.groupby([\"userId\", \"ts_date_day\", \"ts_is_weekday\", \"user_churned\"])\n", + " .agg({\"page\": \"count\"})\n", + " .reset_index()\n", + ")\n", + "events_dist = (\n", + " events_per_day_per_user.groupby([\"userId\", \"ts_is_weekday\", \"user_churned\"])\n", + " .agg({\"page\": \"mean\"})\n", + " .reset_index()\n", + ")\n", + "\n", + "\n", + "def trend_plot(\n", + " df, plot_type, x, y, hue=None, title=None, x_axis=None, y_axis=None, xticks=None, yticks=None\n", + "):\n", + " if plot_type == \"box\":\n", + " fig = sns.boxplot(x=\"page\", y=y, data=df, hue=hue, orient=\"h\")\n", + " elif plot_type == \"bar\":\n", + " fig = sns.barplot(x=x, y=y, data=df, hue=hue)\n", + "\n", + " sns.set(rc={\"figure.figsize\": (12, 3)})\n", + " sns.set_palette(\"Set2\")\n", + " sns.set_style(\"darkgrid\")\n", + " plt.title(title)\n", + " plt.xlabel(x_axis)\n", + " plt.ylabel(y_axis)\n", + " plt.yticks([0, 1], yticks)\n", + " return plt.show(fig)\n", + "\n", + "\n", + "trend_plot(\n", + " events_dist,\n", + " \"box\",\n", + " \"page\",\n", + " \"user_churned\",\n", + " \"ts_is_weekday\",\n", + " \"Weekday V.S. Weekends - Average events per day per user\",\n", + " \"average events per user per day\",\n", + " yticks=[\"active users\", \"churned users\"],\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "events_per_hour_per_user = (\n", + " sample.groupby([\"userId\", \"ts_date_day\", \"ts_hour\", \"user_churned\"])\n", + " .agg({\"page\": \"count\"})\n", + " .reset_index()\n", + ")\n", + "events_dist = (\n", + " events_per_hour_per_user.groupby([\"userId\", \"ts_hour\", \"user_churned\"])\n", + " .agg({\"page\": \"mean\"})\n", + " .reset_index()\n", + " .groupby([\"ts_hour\", \"user_churned\"])\n", + " .agg({\"page\": \"mean\"})\n", + " .reset_index()\n", + ")\n", + "trend_plot(\n", + " events_dist,\n", + " \"bar\",\n", + " \"ts_hour\",\n", + " \"page\",\n", + " \"user_churned\",\n", + " \"Hourly activity - Average events per hour of day per user\",\n", + " \"hour of the day\",\n", + " \"average events per user per hour\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Listening Behavior**\n", + "\n", + "You can look at some basic stats for a user's listening habits. Churned users generally listen to a wider variety of songs and artists and spend more time on the App/be with the App longer.\n", + "* Average total: number of sessions, App usage length, number of songs listened, number of artists listened per user, number of ad days active\n", + "* Average daily: number of sessions, App usage length, number of songs listened, number of artists listened per user\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "stats_per_user = (\n", + " sample.groupby([\"userId\", \"user_churned\"])\n", + " .agg(\n", + " {\n", + " \"sessionId\": \"count\",\n", + " \"song\": \"nunique\",\n", + " \"artist\": \"nunique\",\n", + " \"length\": \"sum\",\n", + " \"ts_date_day\": \"count\",\n", + " }\n", + " )\n", + " .reset_index()\n", + ")\n", + "avg_stats_group = (\n", + " stats_per_user.groupby([\"user_churned\"])\n", + " .agg(\n", + " {\n", + " \"sessionId\": \"mean\",\n", + " \"song\": \"mean\",\n", + " \"artist\": \"mean\",\n", + " \"length\": \"mean\",\n", + " \"ts_date_day\": \"mean\",\n", + " }\n", + " )\n", + " .reset_index()\n", + ")\n", + "\n", + "print(\n", + " \"Average total: number of sessions, App usage length, number of songs listened, number of artists listened per user, days active: \"\n", + ")\n", + "avg_stats_group" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "stats_per_user = (\n", + " sample.groupby([\"userId\", \"ts_date_day\", \"user_churned\"])\n", + " .agg({\"sessionId\": \"count\", \"song\": \"nunique\", \"artist\": \"nunique\", \"length\": \"sum\"})\n", + " .reset_index()\n", + ")\n", + "avg_stats_group = (\n", + " stats_per_user.groupby([\"user_churned\"])\n", + " .agg({\"sessionId\": \"mean\", \"song\": \"mean\", \"artist\": \"mean\", \"length\": \"mean\"})\n", + " .reset_index()\n", + ")\n", + "print(\n", + " \"Average daily: number of sessions, App usage length, number of songs listened, number of artists listened per user: \"\n", + ")\n", + "avg_stats_group" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**App Usage Behavior**\n", + "\n", + "You can further explore how the users are using the App besides just listening: number of thumbs up/thumbs down, added to playlist, ads, add friend, if the user has downgrade or upgrade the plan, how many errors the user has encountered. Churned users are slightly more active than other users, and also encounter more errors, listened to more ads, and more downgrade and upgrade. These can be numerical features (number of total events per type per user), or more advanced time series numerical features (errors in last 7 days, errors in last month, etc.)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "events_list = [\n", + " \"NextSong\",\n", + " \"Thumbs Down\",\n", + " \"Thumbs Up\",\n", + " \"Add to Playlist\",\n", + " \"Roll Advert\",\n", + " \"Add Friend\",\n", + " \"Downgrade\",\n", + " \"Upgrade\",\n", + " \"Error\",\n", + "]\n", + "usage_column_name = []\n", + "for event in events_list:\n", + " event_name = \"_\".join(event.split()).lower()\n", + " usage_column_name.append(event_name)\n", + " sample[event_name] = [1 if x == event else 0 for x in sample[\"page\"]]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "app_use_per_user = sample.groupby([\"userId\", \"user_churned\"])[usage_column_name].sum().reset_index()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "app_use_group = app_use_per_user.groupby([\"user_churned\"])[usage_column_name].mean().reset_index()\n", + "app_use_group" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Pre-processing with SageMaker Data Wrangler\n", + "\n", + "Now that you have a good understanding of your data and decided which steps are needed to pre-process your data, you can utilize the new Amazon SageMaker GUI tool **Data Wrangler**, without writing all the code for the SageMaker Processing Job.\n", "\n", - "### Define Business problem\n", + "* Here we used a Processing Job to convert the raw streaming data files downloaded from the github repo (`simu-*.zip` files) to a full, CSV formatted file for Data Wrangler Ingestion purpose.\n", + "you are importing the raw streaming data files downloaded from the github repo (`simu-*.zip` files). The raw JSON files were converted to CSV format and combined to one file for Data Wrangler Ingestion purpose." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile preprocessing_predw.py\n", "\n", - "To start with, here are some common business problems to consider depending on your specific use cases and your focus:\n", - " * Will this customer churn (cancel the plan, cancel the subscription)?\n", - " * Will this customer downgrade a pricing plan?\n", - " * For a subscription business model, will a customer renew his/her subscription?\n", + "import argparse\n", + "import os\n", + "import warnings\n", + "import glob\n", + "import time\n", + "import pandas as pd\n", + "import json\n", + "import argparse\n", "\n", - "### Machine learning problem formulation\n", + "from sklearn.exceptions import DataConversionWarning\n", "\n", - "#### Classification: will this customer churn?\n", + "warnings.filterwarnings(action=\"ignore\", category=DataConversionWarning)\n", + "start_time = time.time()\n", "\n", - "To goal of classification is to identify the at-risk customers and sometimes their unusual behavior, such as: will this customer churn or downgrade their plan? Is there any unusual behavior for a customer? The latter question can be formulated as an anomaly detection problem.\n", + "if __name__ == \"__main__\":\n", + " parser = argparse.ArgumentParser()\n", + " parser.add_argument(\"--processing-output-filename\")\n", "\n", - "#### Time Series: will this customer churn in the next X months? When will this customer churn?\n", + " args, _ = parser.parse_known_args()\n", + " print(\"Received arguments {}\".format(args))\n", "\n", - "You can further explore your users by formulating the problem as a time series one and detect when will the customer churn.\n", + " input_jsons = glob.glob(\"/opt/ml/processing/input/data/**/*.json\", recursive=True)\n", "\n", - "### Data Requirements\n", + " df_all = pd.DataFrame()\n", + " for name in input_jsons:\n", + " print(\"\\nStarting file: {}\".format(name))\n", + " df = pd.read_json(name, lines=True)\n", + " df_all = df_all.append(df)\n", "\n", - "#### Data collection Sources\n", + " output_filename = args.processing_output_filename\n", + " final_features_output_path = os.path.join(\"/opt/ml/processing/output\", output_filename)\n", + " print(\"Saving processed data to {}\".format(final_features_output_path))\n", + " df_all.to_csv(final_features_output_path, header=True, index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.sklearn.processing import SKLearnProcessor\n", "\n", - "Some most common data sources used to construct a data set for churn analysis are:\n", - "* Customer Relationship Management platform (CRM), \n", - "* engagement and usage data (analytics services), \n", - "* passive feedback (ratings based on your request), and active feedback (customer support request, feedback on social media and review platforms).\n", + "sklearn_processor = SKLearnProcessor(\n", + " framework_version=\"0.23-1\", role=role, instance_type=\"ml.m5.xlarge\", instance_count=1\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s3_client = boto3.client(\"s3\")\n", + "list_response = s3_client.list_objects_v2(Bucket=bucket, Prefix=f\"{prefix}/data/json\")\n", + "s3_input_uris = [f\"s3://{bucket}/{i['Key']}\" for i in list_response[\"Contents\"]]\n", + "s3_input_uris" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", "\n", - "#### Construct a Data Set for Churn Analysis\n", + "processing_inputs = []\n", + "for i in s3_input_uris:\n", + " name = i.split(\"/\")[-1].split(\".\")[0]\n", + " processing_input = ProcessingInput(\n", + " source=i, input_name=name, destination=f\"/opt/ml/processing/input/data/{name}\"\n", + " )\n", + " processing_inputs.append(processing_input)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "processing_output_path = f\"s3://{bucket}/{prefix}/data/processing\"\n", + "final_features_filename = \"full_data.csv\"\n", "\n", - "Most raw data collected from the sources mentioned above are huge and often needs a lot of cleaning and pre-processing. For example, usage data is usually event-based log data and can be more than a few gigabytes every day; you can aggregate the data to user-level daily for further analysis. Feedback and review data are mostly text data, so you would need to clean and pre-process the natural language data to be normalized, machine-readable data. If you are joining multiple data sources (especially from different platforms) together, you would want to make sure all data points are consistent, and the user identity can be matched across different platforms.\n", - " \n", - "#### Challenges with Customer Churn\n", + "sklearn_processor.run(\n", + " code=\"preprocessing_predw.py\",\n", + " inputs=processing_inputs,\n", + " outputs=[\n", + " ProcessingOutput(\n", + " output_name=\"processed_data\",\n", + " source=\"/opt/ml/processing/output\",\n", + " destination=processing_output_path,\n", + " )\n", + " ],\n", + " arguments=[\"--processing-output-filename\", final_features_filename],\n", + ")\n", "\n", - "* Business related\n", - " * Importance of domain knowledge: this is critical when you start building features for the machine learning model. It is important to understand the business enough to decide which features would trigger retention.\n", - "* Data issues\n", - " * fewer churn data available (imbalanced classes): data for churn analysis is often very imbalanced as most of the customers of a business are happy customers (usually).\n", - " * User identity mapping problem: if you are joining data from different platforms (CRM, email, feedback, mobile app, and website usage data), you would want to make sure user A is recognized as the same user across multiple platforms. There are third-party solutions that help you tackle this problem.\n", - " * Not collecting the right data for the use case or Lacking enough data" + "preprocessing_job_description = sklearn_processor.jobs[-1].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Use Case Study - Music Streaming User Churn Prediction\n", + "Now you can initiate a Data Wrangler flow. An example flow (`dw_example.flow`) is provided in the github repo. \n", "\n", - "\n", + "From the SageMaker Studio launcher page, choose **New data flow**, then choose **import from S3** and select processing_output_filename. \n", "\n", - "## Data Selection\n", + "
\n", + "\n", + "
\n", + " \n", + "You can import any .csv format file with SageMaker Data Wrangler, preview your data, and decide what pre-processing steps are needed.\n", + "
\n", + "\n", + "
\n", + "You can choose your pre-processing steps, including drop columns and rename columns from the pre-built solutions, also customize processing and feature engineering code in the custom Pandas code block.\n", + "
\n", + "\n", "\n", - "You will use generated music streaming data that is simulated to imitate music streaming user behaviors. The data simulated contains 1100 users and their user behavior for one year (2019/10/28 - 2020/10/28). Data is simulated using the [EventSim](https://github.com/Interana/eventsim) and does not contain any real user data.\n", + "
\n", + "After everything run through, it will create a Processing job notebook for you. You can run through the notebook to kick off the Processing Job and check the status in the console.\n", "\n", - "* Observation window: you will use 1 year of data to generate predictions.\n", - "* Explanation of fields:\n", - " * `ts`: event UNIX timestamp\n", - " * `userId`: a randomly assigned unique user id\n", - " * `sessionId`: a randomly assigned session id unique to each user\n", - " * `page`: event taken by the user, e.g. \"next song\", \"upgrade\", \"cancel\"\n", - " * `auth`: whether the user is a logged-in user\n", - " * `method`: request method, GET or PUT\n", - " * `status`: request status\n", - " * `level`: if the user is a free or paid user\n", - " * `itemInSession`: event happened in the session\n", - " * `location`: location of the user's IP address\n", - " * `userAgent`: agent of the user's device\n", - " * `lastName`: user's last name\n", - " * `firstName`: user's first name\n", - " * `registration`: user's time of registration\n", - " * `gender`: gender of the user\n", - " * `artist`: artist of the song the user is playing at the event\n", - " * `song`: song title the user is playing at the event\n", - " * `length`: length of the session\n", - " \n", - " \n", - " * the data will be downloaded from Github and contained in an [_Amazon Simple Storage Service_](https://aws.amazon.com/s3/) (Amazon S3) bucket." + "
\n", + "\n", + "\n", + "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "For this specific use case, you will focus on a solution to predict whether a customer will cancel the subscription. Some possible expansion of the work includes:\n", - "* predict plan downgrading\n", - "* when a user will churn\n", - "* add song attributes (genre, playlist, charts) and user attributes (demographics) to the data\n", - "* add user feedback and customer service requests to the data\n" + "#### Find the data path of the SageMaker Data Wrangler Job\n", + "\n", + "You can get the results from your Data Wrangler Job, check the results, and use it as input for your feature engineering processing job." ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": {}, + "outputs": [], "source": [ - "## Architecture Diagram\n", - "\n", - "The services covered in the use case and an architecture diagram is shown below.\n", - "\n", - "
\n", - " \n", - "\n", - "
" + "processing_output_filename = f\"{processing_output_path}/{final_features_filename}\"\n", + "processing_output_filename" ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "## The output from Data Wrangler is also provided in the github repo (data/data_wrangler_output.csv).\n", - "## You can also read the provided csv directly." + "flow_file = \"dw_example.flow\"\n", + "\n", + "# read flow file and change the s3 location to our `processing_output_filename`\n", + "with open(flow_file, \"r\") as f:\n", + " flow = f.read()\n", + "\n", + " flow = json.loads(flow)\n", + " flow[\"nodes\"][0][\"parameters\"][\"dataset_definition\"][\"s3ExecutionContext\"][\n", + " \"s3Uri\"\n", + " ] = processing_output_filename\n", + "\n", + "with open(\"dw_example.flow\", \"w\") as f:\n", + " json.dump(flow, f)\n", + "flow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", - "## Feature engineering with SageMaker Processing Job\n", + "## Feature Engineering with SageMaker Processing\n", "\n", "\n", "For user churn analysis, usually, you can consider build features from the following aspects:\n", @@ -209,88 +972,11 @@ "You can find a complete guide to the SageMaker Processing job in [this blog](https://aws.amazon.com/blogs/aws/amazon-sagemaker-processing-fully-managed-data-processing-and-model-evaluation/)." ] }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[31mERROR: pg8000 1.17.0 has requirement scramp==1.2.0, but you'll have scramp 1.2.2 which is incompatible.\u001b[0m\n" - ] - } - ], - "source": [ - "!pip install -q pandas=='1.1.5'" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "# !pip -uQ install s3fs" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%store -r\n", - "%store" - ] - }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "processing_output_filename" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": {}, - "outputs": [], - "source": [ - "import sagemaker\n", - "import json\n", - "import pandas as pd\n", - "import numpy as np\n", - "import glob\n", - "import boto3" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": {}, - "outputs": [], - "source": [ - "sagemaker_session = sagemaker.Session()\n", - "s3 = sagemaker_session.boto_session.resource(\"s3\")\n", - "\n", - "region = boto3.Session().region_name\n", - "role = sagemaker.get_execution_role()\n", - "smclient = boto3.Session().client(\"sagemaker\")\n", - "\n", - "output_path = f\"s3://{bucket}/{prefix}/data/processing/\"" - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": {}, - "outputs": [], "source": [ "from sagemaker.sklearn.processing import SKLearnProcessor\n", "\n", @@ -305,41 +991,31 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### SAVE THE OUTPUT FILE NAME FROM PROCESSING JOB\n", - "processing_job_output_name = 'processing_job_output.csv'\n", - "%store processing_job_output_name" + "processing_job_output_name = \"processing_job_output.csv\"" ] }, { "cell_type": "code", - "execution_count": 35, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Overwriting preprocessing.py\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "%%writefile preprocessing.py\n", "\n", + "import sys\n", + "import subprocess\n", + "\n", "import os\n", "import warnings\n", "import time\n", - "import pandas as pd\n", "import argparse\n", - "import subprocess\n", - "import sys\n", - "\n", - "subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"awswrangler\"])\n", - "import awswrangler as wr\n", + "import boto3\n", + "import pandas as pd\n", "\n", "start_time = time.time()\n", "\n", @@ -354,9 +1030,12 @@ " data_s3_uri = args.dw_output_path\n", " output_filename = args.processing_output_filename\n", "\n", - " # data_path = os.path.join('/opt/ml/processing/input', dw_output_name)\n", - " # df = pd.read_csv(data_path)\n", - " df = wr.s3.read_csv(path=data_s3_uri, dataset=True)\n", + " bucket = data_s3_uri.split(\"/\")[2]\n", + " key = \"/\".join(data_s3_uri.split(\"/\")[3:] + [\"full_data.csv\"])\n", + " s3_client = boto3.client(\"s3\")\n", + " s3_client.download_file(bucket, key, \"full_data.csv\")\n", + " df = pd.read_csv(\"full_data.csv\")\n", + "\n", " ## convert to time\n", " df[\"date\"] = pd.to_datetime(df[\"ts\"], unit=\"ms\")\n", " df[\"ts_dow\"] = df[\"date\"].dt.weekday\n", @@ -576,7 +1255,7 @@ }, { "cell_type": "code", - "execution_count": 36, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -629,26 +1308,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Congratulations! You have completed Part1: Prepare the data, and now you should have created the complete feature set that is ready for modeling. You can proceed to Part2: modeling and Reference." + "Congratulations! You have preprocessed the data. You can proceed to modelling." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## PART 2: Modeling and Reference\n", - "\n", - "now that you have created the complete feature set, you can start to explore and find a best-working model for your churn use case. By the end of part 2, you will select an algorithm, find the best sets of hyperparameter for the model, examine how well the model performs, and finally find the top influential features.\n", - "\n", - "To start with Part 2, you can either read in data from the output of your Part 1 results, or use the provided 'data/full_feature_data.csv' as the input (variable dataframe `processed_data`) for the next steps. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", "### Data Splitting\n", "\n", "You formulated the use case as a classification problem on user level, so you can randomly split your data from last step into train/validation/test. If you want to predict \"will user X churn in the next Y days\" on per user per day level, you should think about spliting data in chronological order instead of random. \n", @@ -684,7 +1350,7 @@ }, { "cell_type": "code", - "execution_count": 48, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -693,7 +1359,7 @@ }, { "cell_type": "code", - "execution_count": 43, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -703,194 +1369,9 @@ }, { "cell_type": "code", - "execution_count": 49, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
userIduser_churnedaverage_events_weekendaverage_events_weekdaynum_songs_played_7dnum_ads_7dnum_error_7dnum_songs_played_30dnum_songs_played_90dnum_sessions...num_thumbs_upnum_add_to_playlistnum_adsnum_add_friendnum_downgradenum_upgradenum_errorpercentage_addays_since_activerepeats_ratio
0110010.0189.875152.60869682701428270827051...586280141621120.0013923590.589722
1110020.0141.000153.333333952209529527...82322281000.0016642650.526261
2110031.0197.500241.750000773424187734773437...5442062413811180.002576660.587665
3110041.0140.000240.888889216842216821687...136604181020.001546480.538284
\n", - "

4 rows × 27 columns

\n", - "
" - ], - "text/plain": [ - " userId user_churned average_events_weekend average_events_weekday \\\n", - "0 11001 0.0 189.875 152.608696 \n", - "1 11002 0.0 141.000 153.333333 \n", - "2 11003 1.0 197.500 241.750000 \n", - "3 11004 1.0 140.000 240.888889 \n", - "\n", - " num_songs_played_7d num_ads_7d num_error_7d num_songs_played_30d \\\n", - "0 8270 14 2 8270 \n", - "1 952 2 0 952 \n", - "2 7734 24 18 7734 \n", - "3 2168 4 2 2168 \n", - "\n", - " num_songs_played_90d num_sessions ... num_thumbs_up \\\n", - "0 8270 51 ... 586 \n", - "1 952 7 ... 82 \n", - "2 7734 37 ... 544 \n", - "3 2168 7 ... 136 \n", - "\n", - " num_add_to_playlist num_ads num_add_friend num_downgrade num_upgrade \\\n", - "0 280 14 162 1 1 \n", - "1 32 2 28 1 0 \n", - "2 206 24 138 1 1 \n", - "3 60 4 18 1 0 \n", - "\n", - " num_error percentage_ad days_since_active repeats_ratio \n", - "0 2 0.001392 359 0.589722 \n", - "1 0 0.001664 265 0.526261 \n", - "2 18 0.002576 66 0.587665 \n", - "3 2 0.001546 48 0.538284 \n", - "\n", - "[4 rows x 27 columns]" - ] - }, - "execution_count": 49, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "processed_data.head(4)" ] @@ -904,7 +1385,7 @@ }, { "cell_type": "code", - "execution_count": 50, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -919,7 +1400,7 @@ }, { "cell_type": "code", - "execution_count": 51, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -931,7 +1412,7 @@ }, { "cell_type": "code", - "execution_count": 52, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -954,7 +1435,7 @@ }, { "cell_type": "code", - "execution_count": 53, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -983,33 +1464,17 @@ ")" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Disclaimer\n", - "\n", - "The data used in this notebook is synthetic and does not contain real user data. The results (all the names, emails, IP addresses, and browser information) of this simulation are fake." - ] - }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Citation\n", - "\n", "The data used in this notebook is simulated using the [EventSim](https://github.com/Interana/eventsim)." ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { + "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "conda_python3", "language": "python", diff --git a/use-cases/customer_churn/2_cust_churn_train_deploy_infer.ipynb b/use-cases/customer_churn/2_cust_churn_train_deploy_infer.ipynb index b459fd1f29..4d5a420400 100644 --- a/use-cases/customer_churn/2_cust_churn_train_deploy_infer.ipynb +++ b/use-cases/customer_churn/2_cust_churn_train_deploy_infer.ipynb @@ -6,26 +6,23 @@ "source": [ "# Build a Customer Churn Model for Music Streaming App Users: Model Selection and Model Explainability\n", "\n", - "In this demo, you are going to learn how to use various SageMaker functionalities to build, train, and deploy the model from end to end, including data pre-processing steps like ingestion, cleaning and processing, feature engineering, training and hyperparameter tuning, model explainability, and eventually deploy the model. There are two parts of the demo: in part 1: Prepare Data, you will process the data with the help of Data Wrangler, then create features from the cleaned data. By the end of part 1, you will have a complete feature data set that contains all attributes built for each user, and it is ready for modeling. Then in part 2: Modeling and Reference, you will use the data set built from part 1 to find an optimal model for the use case, then test the model predictability with the test data. To start with Part 2, you can either read in data from the output of your Part 1 results, or use the provided 'data/full_feature_data.csv' as the input for the next steps.\n", + "## Background\n", "\n", + "This notebook is one of a sequence of notebooks that show you how to use various SageMaker functionalities to build, train, and deploy the model from end to end, including data pre-processing steps like ingestion, cleaning and processing, feature engineering, training and hyperparameter tuning, model explainability, and eventually deploy the model. There are two parts of the demo: \n", + "\n", + "1. Build a Customer Churn Model for Music Streaming App Users: Overview and Data Preparation - you will process the data with the help of Data Wrangler, then create features from the cleaned data. By the end of part 1, you will have a complete feature data set that contains all attributes built for each user, and it is ready for modeling.\n", + "1. Build a Customer Churn Model for Music Streaming App Users: Model Selection and Model Explainability (current notebook) - you will use the data set built from part 1 to find an optimal model for the use case, then test the model predictability with the test data. \n", "\n", "For how to set up the SageMaker Studio Notebook environment, please check the [onboarding video]( https://www.youtube.com/watch?v=wiDHCWVrjCU&feature=youtu.be). And for a list of services covered in the use case demo, please check the documentation linked in each section.\n", "\n", "\n", "## Content\n", - "* [Overview](#Overview)\n", - "* [Data Selection](#2)\n", - "* [Ingest Data](#4)\n", - "* [Data Cleaning and Data Exploration](#5)\n", - "* [Pre-processing with SageMaker Data Wrangler](#7)\n", - "* [Feature Engineering with SageMaker Processing](#6)\n", - "* [Data Splitting](#8)\n", - "* [Model Selection](#9)\n", - "* [Training with SageMaker Estimator and Experiment](#10)\n", - "* [Hyperparameter Tuning with SageMaker Hyperparameter Tuning Job](#11)\n", - "* [Deploy the model with SageMaker Batch-transform](#12)\n", - "* [Model Explainability with SageMaker Clarify](#15)\n", - "* [Optional: Automate your training and model selection with SageMaker Autopilot (Console)](#13)" + "* [Model Selection](#Model-Selection)\n", + "* [Training with SageMaker Estimator and Experiment](#Training-with-SageMaker-Estimator-and-Experiment)\n", + "* [Hyperparameter Tuning with SageMaker Hyperparameter Tuning Job](#Hyperparameter-Tuning-with-SageMaker-Hyperparameter-Tuning-Job)\n", + "* [Deploy the model with SageMaker Batch-transform](#Deploy-the-model-with-SageMaker-Batch-transform)\n", + "* [Model Explainability with SageMaker Clarify](#Model-Explainability-with-SageMaker-Clarify)\n", + "* [Optional: Automate your training and model selection with SageMaker Autopilot (Console)](#Optional:-Automate-your-training-and-model-selection-with-SageMaker-Autopilot-(Console))" ] }, { @@ -86,8 +83,6 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", "## Model Selection\n", "\n", "You can experiment with all your model choices and see which one gives better results. A few things to note when you choose algorithms:\n", @@ -104,8 +99,6 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", "## Training with SageMaker Estimator and Experiment\n", "\n", "Once you decide on a range of models you want to experiment with, you can start training and comparing model results to choose the best one. A few things left for you to make a decision:\n", @@ -127,13 +120,12 @@ "metadata": {}, "outputs": [], "source": [ - "%store -r\n", - "%store" + "! pip install sagemaker-experiments" ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -142,12 +134,14 @@ "import pandas as pd\n", "import glob\n", "import s3fs\n", - "import boto3" + "import boto3\n", + "from datetime import datetime\n", + "import os" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -156,17 +150,109 @@ "\n", "region = boto3.Session().region_name\n", "role = sagemaker.get_execution_role()\n", - "smclient = boto3.Session().client(\"sagemaker\")" + "smclient = boto3.Session().client(\"sagemaker\")\n", + "bucket = sagemaker_session.default_bucket()\n", + "prefix = \"music-streaming\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Download Data and Upload to S3\n", + "\n", + "We ingest the simulated data from the public SageMaker S3 training database. If you want to see how the train, test, and validation datasets are created in detail, look at [Build a Customer Churn Model for Music Streaming App Users: Overview and Data Preparation](0_cust_churn_overview_dw.ipynb)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "##### Alternative: copy data from a public S3 bucket to your own bucket\n", + "##### data file should include full_data.csv and sample.json\n", + "#### cell 5 - 7 is not needed; the processing job before data wrangler screenshots is not needed\n", + "!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/customer-churn/customer-churn-data-v2.zip ./data/raw/customer-churn-data.zip" ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "container = sagemaker.image_uris.retrieve(\n", - " \"xgboost\", region, version=\"1.0-1\", instance_type=\"ml.m4.xlarge\"\n", + "!unzip -o ./data/raw/customer-churn-data.zip -d ./data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# unzip the partitioned data files into the same folder\n", + "!unzip -o data/simu-1.zip -d data/raw\n", + "!unzip -o data/simu-2.zip -d data/raw\n", + "!unzip -o data/simu-3.zip -d data/raw\n", + "!unzip -o data/simu-4.zip -d data/raw" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!rm ./data/raw/*.zip" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!unzip -o data/sample.zip -d data/raw" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!aws s3 cp ./data/raw s3://$bucket/$prefix/data/json/ --recursive" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s3_input_train = (\n", + " boto3.Session()\n", + " .resource(\"s3\")\n", + " .Bucket(bucket)\n", + " .Object(os.path.join(prefix, \"train/train.csv\"))\n", + " .upload_file(\"data/train_updated.csv\")\n", + ")\n", + "s3_input_validation = (\n", + " boto3.Session()\n", + " .resource(\"s3\")\n", + " .Bucket(bucket)\n", + " .Object(os.path.join(prefix, \"validation/validation.csv\"))\n", + " .upload_file(\"data/validation_updated.csv\")\n", + ")\n", + "s3_input_validation = (\n", + " boto3.Session()\n", + " .resource(\"s3\")\n", + " .Bucket(bucket)\n", + " .Object(os.path.join(prefix, \"test/test_labeled.csv\"))\n", + " .upload_file(\"data/test_updated.csv\")\n", ")" ] }, @@ -179,7 +265,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -205,22 +291,18 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 66 µs, sys: 0 ns, total: 66 µs\n", - "Wall time: 68.7 µs\n" - ] - } - ], + "outputs": [], "source": [ "%%time\n", "from time import gmtime, strftime\n", "\n", + "container = sagemaker.image_uris.retrieve(\n", + " \"xgboost\", region, version=\"1.0-1\", instance_type=\"ml.m4.xlarge\"\n", + ")\n", + "\n", + "\n", "xgb = sagemaker.estimator.Estimator(\n", " container,\n", " role,\n", @@ -235,7 +317,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -271,13 +353,13 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# custom trial name\n", - "experiment_name = \"music-streaming-churn-exp\"\n", - "trial_name_xgb = \"xgboost\"" + "experiment_name = \"music-streaming-churn-exp-{}\".format(datetime.now().strftime(\"%Y%m%d-%H%M%S\"))\n", + "trial_name_xgb = \"xgboost-{}\".format(datetime.now().strftime(\"%Y%m%d-%H%M%S\"))" ] }, { @@ -332,8 +414,6 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", "## Hyperparameter Tuning with SageMaker Hyperparameter Tuning Job\n", "\n", "Now that you understand how training one model works and how to create a SageMaker experiment, and selected the XGBoost model as the final model, you will need to fine-tune the hyperparameters for the best model performances. For a xgboost model, you can start with defining ranges for the eta, alpha, min_child_weight, and max_depth. You can check the [documentation when considering what haperparameter to tune](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-considerations.html)." @@ -350,7 +430,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -382,7 +462,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -442,26 +522,17 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Stored 'tuning_job_name' (str)\n" - ] - } - ], + "outputs": [], "source": [ "# custom a tuner job name\n", - "tuning_job_name = \"ChurnPrediction-Tuning-Job\"\n", - "%store tuning_job_name" + "tuning_job_name = \"ChurnPredictTune-{}\".format(datetime.now().strftime(\"%Y%m%d-%H%M%S\"))" ] }, { "cell_type": "code", - "execution_count": 14, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -481,17 +552,9 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Create tuning job ChurnPrediction-Tuning-Job: SUCCESSFUL\n" - ] - } - ], + "outputs": [], "source": [ "from sagemaker.tuner import HyperparameterTuner\n", "\n", @@ -527,7 +590,6 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", "## Deploy the model with SageMaker Batch-transform\n", "\n", "You can directly deploy the best model from your hyperparameter tuning job by getting the best training job from your tuner." @@ -600,7 +662,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -614,7 +676,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -663,7 +725,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -672,7 +734,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -694,129 +756,9 @@ }, { "cell_type": "code", - "execution_count": 24, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
user_churnedpredicted_resultspredicted_binary
00.00.1246090
10.00.1246090
20.00.1996270
30.00.2618250
40.00.2510630
............
971.00.8809041
981.00.8793751
991.00.1350270
1001.00.8982261
1011.00.8862311
\n", - "

102 rows × 3 columns

\n", - "
" - ], - "text/plain": [ - " user_churned predicted_results predicted_binary\n", - "0 0.0 0.124609 0\n", - "1 0.0 0.124609 0\n", - "2 0.0 0.199627 0\n", - "3 0.0 0.261825 0\n", - "4 0.0 0.251063 0\n", - ".. ... ... ...\n", - "97 1.0 0.880904 1\n", - "98 1.0 0.879375 1\n", - "99 1.0 0.135027 0\n", - "100 1.0 0.898226 1\n", - "101 1.0 0.886231 1\n", - "\n", - "[102 rows x 3 columns]" - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "test_data[\"predicted_results\"] = pd.to_numeric(results)\n", "# define a threshold to convert probability to class, you can set as 0.5 by default\n", @@ -833,20 +775,9 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Test Evaluation: \n", - "Average F1 Score: 0.8736913204998312\n", - "Precision Score: 0.9285714285714286\n", - "Recall Score: 0.7428571428571429\n" - ] - } - ], + "outputs": [], "source": [ "from sklearn import metrics\n", "\n", @@ -869,8 +800,6 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", "## Model Explainability with SageMaker Clarify\n", "\n", "You can visualize which feature contributes most to your prediction results by using the new SageMaker feature SageMaker Clarify. It will provide SHAP values which measures the importance of a feature by replacing it with a dummy and seeing how it affects the prediciton. (In reality, SHAP is smart about the choice of dummy and also takes into account feature interactions.) For a more general overview of model interpretability, see [this post](https://towardsdatascience.com/guide-to-interpretable-machine-learning-d40e8a64b6cf). For other capabilities of SageMaker Clarify, please see the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-fairness-and-explainability.html) and the [example notebook](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker_processing/fairness_and_explainability/fairness_and_explainability.ipynb)." @@ -891,7 +820,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -904,7 +833,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -915,27 +844,16 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'predicted_binary', 'predicted_results', 'user_churned'}" - ] - }, - "execution_count": 29, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "set(test_data.columns) - set(test_set.columns)" ] }, { "cell_type": "code", - "execution_count": 30, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -980,8 +898,6 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\n", - "\n", "## Optional: Automate your training and model selection with SageMaker Autopilot (Console)\n", "\n", "With [SageMaker Autopilot](https://aws.amazon.com/blogs/aws/amazon-sagemaker-autopilot-fully-managed-automatic-machine-learning/), you can skip all the steps above and let it automatically tracks the inputs, parameters, configurations, and results of your iterations as trials. Go to SageMaker Experiments List on the left navigation pane, then choose **Create Experiment**. You will be directed to the experiment creating page. All you need to do is do give the Experiment job a name, specify your input and output data location, specify your target variable, and choose your ML problem type (classification or regression), or leave it as auto.\n", @@ -997,7 +913,7 @@ }, { "cell_type": "code", - "execution_count": 32, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ diff --git a/use-cases/customer_churn/README.md b/use-cases/customer_churn/README.md index 7beb65418b..9ba85789dd 100644 --- a/use-cases/customer_churn/README.md +++ b/use-cases/customer_churn/README.md @@ -56,6 +56,10 @@ As part of the solution, the following services are used: * [Amazon SageMaker Studio Notebooks](https://aws.amazon.com/sagemaker/): Used to preprocess and visualize the data, and to train model. * [Amazon SageMaker Endpoint](https://aws.amazon.com/sagemaker/): Used to deploy the trained model. +The diagram below shows how each service is used in relation to other services in different stages of this use case. +
+ +
## Cleaning Up diff --git a/use-cases/index.rst b/use-cases/index.rst index 8e2cae295e..9ecbaea207 100644 --- a/use-cases/index.rst +++ b/use-cases/index.rst @@ -4,7 +4,6 @@ Music Streaming Service: Customer Churn Detection .. toctree:: :maxdepth: 1 - customer_churn/0_cust_churn_overview_dw customer_churn/1_cust_churn_dataprep customer_churn/2_cust_churn_train_deploy_infer