diff --git a/.tests/skip_files.txt b/.tests/skip_files.txt index 468e100e..6eefdc9f 100644 --- a/.tests/skip_files.txt +++ b/.tests/skip_files.txt @@ -19,3 +19,4 @@ ../ModelOps/12_ModelOps_Model_Factory_REST_Python.ipynb ../UseCases/Data_Dictionary/Data_Dictionary_Raw.ipynb ../UseCases/Augmented_call_center_AgenticAI/Augmented_call_center_AgenticAI.ipynb +../UseCases/Opensource_Data_Science_OAF/Opensource_Data_Science_OAF.ipynb diff --git a/UseCases/Opensource_Data_Science_OAF/Opensource_Data_Science_OAF.ipynb b/UseCases/Opensource_Data_Science_OAF/Opensource_Data_Science_OAF.ipynb new file mode 100755 index 00000000..b1f26487 --- /dev/null +++ b/UseCases/Opensource_Data_Science_OAF/Opensource_Data_Science_OAF.ipynb @@ -0,0 +1,1091 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "hawaiian-daniel", + "metadata": {}, + "source": [ + "\n", + "\n", + "
\n", + "

\n", + " Leveraging Open Source Machine Learning with ClearScape Analytics and Open Analytics Framework\n", + "
\n", + " \"Teradata\"\n", + "

\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "a0d65f99", + "metadata": {}, + "source": [ + " \n", + " \n", + "
\n", + "
\n", + " ⚠️\n", + "
\n", + " This demo requires Open Analytics Framework
\n", + " You need to have Open Analytics Framework enabled for this environment. If you have not done it already, go back to ClearScape Analytics Experience dashboard to request access.\n", + "
\n", + "
\n", + " \n", + " Learn more\n", + " \"New\n", + " \n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "c5a2465a", + "metadata": {}, + "source": [ + "

Open-source Machine Learning, AI, and Advanced Analytics tools, techniques, and resources offer enterprises limitless opportunities to drive new insights and business value from their internal and external data landscape. Unfortunately, with these opportunities come significant challenges to realizing success. Some of these challenges include:

\n", + "\n", + " \n", + " \n", + " \n", + "

VantageCloud Lake Edition Open Analytics Framework is the only enterprise-class platform that addresses these challenges with a simple, powerful architecture. The following demonstration will illustrate how users can use any open-source tool or package of choice, deploy it to a custom, isolated environment; and then execute in parallel and at massive scale.

\n", + "\n", + "
\n", + "\n", + "Environment Overview\n", + "\n", + "

This demonstration utilizes a VantageCloud Lake Analytic Cluster architecture, using the shared data sets created in the previous demonstration. Specifically the \"Txn_History\" data that represents \"CashApp\" style transaction history stored in the Vantage Object File System (OFS).

\n", + "\n", + "

The high level process is as follows:

\n", + "\n", + "\n", + " \n", + "
\n", + "
    \n", + "
  1. The Data Scientist conducts analytics activities using his or her own python tools and packages of choice, then connects to VantageCloud Lake through teradataml client library and teradatasql python driver.
  2. \n", + "
    \n", + "
  3. Teradataml provides APIs to create and manage artifacts in User Environment Service, including custom libraries, dependencies, model artifacts, and scoring scripts. The user can leverage these APIs to create one or many custom, dedicated environments to host their code.
  4. \n", + "
    \n", + "
  5. The Data Scientist will then execute their pipeline that will;\n", + "
    • Call ClearScape Analytics functions on Compute Clusters (data prep, transformation, etc.)
    • \n", + "
    • Prepared data is passed to the python container running in parallel on cluster nodes.
    • \n", + "
    • Results (inference/predictions) are returned as \"virtual\" dataframes; where the data resides in Vantage
    • \n", + "
    • Data can be persisted in the Object Filesystem, written to open object storage, or copied to the client
    • \n", + "
  6. \n", + "
\n", + "
\n", + "\n", + "Demonstration Overview\n", + "\n", + "

This notebook consists of three primary demonstrations

\n", + "
    \n", + "
  1. Custom Environment Management - Create a server-side, custom python container with explicit package and versions installed
  2. \n", + "
  3. File Management - Upload model files, scoring scripts, and any other asset type
  4. \n", + "
  5. Analytics - Execute powerful feature engineering and statistical functions and pass this directly to the python container running in parallel
  6. \n", + "
  7. Appendix - Model Training and Testing - The process for creating and testing the model using open-source tools is provided in the Appendix
  8. \n", + "
\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "transsexual-poverty", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Python Package Imports

\n", + "\n", + "

Standard practice to import required packages and libraries; execute this cell to import packages for Teradata automation as well as machine learning, analytics, utility, and data management packages.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "southeast-density", + "metadata": {}, + "outputs": [], + "source": [ + "# install other required packages\n", + "%pip install xgboost" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "great-shadow", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Import the Python library teradataml and the specific environment setup modules.\n", + "#\n", + "import warnings\n", + "from teradataml import *\n", + "from db_utils import *\n", + "warnings.filterwarnings('ignore')\n", + "display.suppress_vantage_runtime_warnings = True\n", + "\n", + "from IPython.display import display as ipydisplay\n", + "from IPython.display import clear_output \n", + "\n", + "from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay\n", + "import matplotlib.pyplot as plt\n", + "#\n", + "# Account for the data types to be used with the script.\n", + "#\n", + "from teradatasqlalchemy.types import BIGINT, VARCHAR, FLOAT, INTEGER\n", + "from collections import OrderedDict\n", + "#\n", + "# Other case-specific imports.\n", + "#\n", + "import json, os, sys, getpass\n", + "import pandas as pd\n", + "from time import sleep\n", + "\n", + "# container name - set here for easier notebook navigation\n", + "### User will also be asked to change it ###\n", + "oaf_name = 'OAF_demo_env'\n", + "###########################\n", + "print(f'using \"{oaf_name}\" for the OAF environment')\n", + "\n", + "# get the current python version to match deploy a custom container\n", + "python_version = str(sys.version_info[0]) + '.' + str(sys.version_info[1])\n", + "print(f'Using Python version {python_version} for user environment')" + ] + }, + { + "cell_type": "markdown", + "id": "muslim-intention", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Connect to Vantage

\n", + "\n", + "

Before performing any operations in Vantage, we need to connect to the system. The below code will read in a variables file (vars.json - this has been used in prior environment setup and data engineering examples) and will connect to Vantage with this information. The Vantage connection is referred to as a \"Context\" - a common python-rdbms connection architecture.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "pretty-forge", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# load vars json\n", + "with open('../../vars.json', 'r') as f:\n", + " session_vars = json.load(f)\n", + "\n", + "# Create the SQLAlchemy Context\n", + "host = session_vars['environment']['host']\n", + "username = session_vars['hierarchy']['users']['business_users'][1]['username']\n", + "password = session_vars['hierarchy']['users']['business_users'][1]['password']\n", + "\n", + "# UES Authentication information\n", + "ues_url = session_vars['environment']['UES_URI']\n", + "configure.ues_url = ues_url\n", + "pat_token = session_vars['hierarchy']['users']['business_users'][1]['pat_token']\n", + "pem_file = session_vars['hierarchy']['users']['business_users'][1]['key_file']\n", + "\n", + "compute_group = session_vars['hierarchy']['users']['business_users'][1]['compute_group']\n", + "\n", + "# check for existing connection\n", + "eng = check_and_connect(host=host, username=username, password=password, compute_group = compute_group)\n", + "print(eng)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e3507baa-0c76-488a-b6af-fb704b0c6542", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# check cluster status\n", + "res = check_cluster_start(compute_group = compute_group)" + ] + }, + { + "cell_type": "markdown", + "id": "offshore-watch", + "metadata": {}, + "source": [ + "
\n", + "

Demo 1 - Custom Container Management

\n", + "\n", + "\n", + "\n", + "

The Teradata Vantage Python Client Library provides simple, powerful methods for the creation and maintenance of custom Python runtime environments in the VantageCloud environment . This allows practitioners complete control over the behavior and quality of their model performance and analytic accuracy running on the Analytic Cluster. The following demonstration will show how easy it is to create a custom xgboost-based scoring environment.

\n", + "\n", + "\n", + "\n", + "

Custom environments are persistent. Users only need to create these once and then can be saved, updated, or modified only as needed.

\n", + "\n", + "
\n", + "

Container Management Process

\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
    \n", + "
  • Set up a connection to the Environment Service
  • \n", + "
    \n", + "
  • Create a unique User Environment based on available base images
  • \n", + "
    \n", + "
  • Install custom libraries and specifc versions if required
  • \n", + "
    \n", + "
  • Monitor packages installation/view installed packages
  • \n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "bridal-matrix", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Connect to the Environment Service

\n", + "\n", + "

To better support integration with Cloud Services and commong automation tools; the User Environment Service is accessed via RESTful APIs. These APIs can be called directly or in the examples shown below that leverage the Python Package for Teradata (teradataml) methods.

\n", + "\n", + "

In order to properly authenticate to the UES infrastructure, the user must log in with the same credentials that are used to connect to the database. When the following cell executes, follow the instructions to open a browser window, and log in with that user.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "seasonal-jonathan", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# check to see if there is a valid UES auth\n", + "# if not, authenticate\n", + "try:\n", + " demo_env = get_env(oaf_name)\n", + " print('Existing valid UES token')\n", + "\n", + "except Exception as e:\n", + " if '''NoneType' object has no attribute 'value''' in str(e) or '''Failed to execute get_env''' in str(e):\n", + " if set_auth_token(ues_url = ues_url, username = username, pat_token = pat_token, pem_file = pem_file):\n", + " print('UES Authentication successful')\n", + " else:\n", + " print('UES Authentication failed, check URL and account info')\n", + " pass\n", + " else:\n", + " raise\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "eligible-newfoundland", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Create a Custom Container in Vantage

\n", + "\n", + "

If desired, the user can create a new custom environment by starting with a \"base\" image and customizing it. The steps are:

\n", + "" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "consistent-component", + "metadata": {}, + "outputs": [], + "source": [ + "# List available Base Python environments\n", + "\n", + "ipydisplay(list_base_envs())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "charming-geology", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Create a new environment, or connect to an existing one\n", + "\n", + "try:\n", + " ipydisplay(list_user_envs())\n", + "except Exception as e:\n", + " \n", + " if str(e).find('No user environments found') > 0:\n", + " print('No user environments found')\n", + " pass\n", + " else:\n", + " raise\n", + "\n", + "print('Use an existing environment, or create a new one:')\n", + "print(f'OAF Environment is set to {oaf_name}.')\n", + "print('Enter to accept, or input a new value.')\n", + "print('If the environment is not in the list, a new one will be created')\n", + "i = input()\n", + "if len(i) != 0:\n", + " oaf_name = i\n", + " print(f'OAF Environment is now {oaf_name}')\n", + "\n", + "try:\n", + " demo_env = create_env(env_name = oaf_name,\n", + " base_env = f'python_{python_version}',\n", + " desc = 'OAF Demo environment')\n", + "except Exception as e:\n", + " if str(e).find('same name already exists') > 0:\n", + " print('Environment already exists, obtaining a reference to it')\n", + " demo_env = get_env(oaf_name)\n", + " pass\n", + " elif 'Invalid value for base environment name' in str(e):\n", + " print('Unsupported base environment version, using defaults')\n", + " demo_env = create_env(env_name = oaf_name,\n", + " desc = 'OAF Demo environment')\n", + " else:\n", + " raise\n", + "\n", + "# Note create_env seems to be asynchronous - sleep a bit for it to register\n", + "sleep(5)\n", + "\n", + "try:\n", + " ipydisplay(list_user_envs())\n", + "except Exception as e:\n", + " if str(e).find('No user environments found') > 0:\n", + " print('No user environments found')\n", + " pass\n", + " else:\n", + " raise" + ] + }, + { + "cell_type": "markdown", + "id": "breeding-shame", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Install Dependencies

\n", + "\n", + "

The second step in the customization process is to install Python package dependencies. This set of code:\n", + "

\n", + "\n", + "" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "plain-psychology", + "metadata": {}, + "outputs": [], + "source": [ + "# View existing libraries in the user environment.\n", + "demo_env.libs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "premier-agenda", + "metadata": {}, + "outputs": [], + "source": [ + "# Install any Python add-ons needed by the script in the user environment\n", + "# Using option asynchronous=True for an asychronous execution of the statement.\n", + "# Note: Avoid asynchronous installation when batch-executing all notebook statements,\n", + "# as execution will continue even without installation being complete.\n", + "#\n", + "claim_id = demo_env.install_lib(['numpy','pandas','scikit-learn', 'xgboost==1.6.2'], asynchronous=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "blond-reliance", + "metadata": {}, + "outputs": [], + "source": [ + "# Check the status of installation using status() API.\n", + "# Create a loop here for demo purposes\n", + "\n", + "ipydisplay(demo_env.status(claim_id))\n", + "stage = demo_env.status(claim_id)['Stage'].iloc[-1]\n", + "while stage == 'Started':\n", + " stage = demo_env.status(claim_id)['Stage'].iloc[-1]\n", + " clear_output()\n", + " ipydisplay(demo_env.status(claim_id))\n", + " sleep(5)\n", + " \n", + "# Verify the Python libraries have been installed correctly.\n", + "ipydisplay(demo_env.libs)" + ] + }, + { + "cell_type": "markdown", + "id": "innovative-monster", + "metadata": {}, + "source": [ + "
\n", + "

Demo 2 - Install Custom Models and Scripts

\n", + "\n", + "

Once the custom runtime environment has been created, the user can then load custom user-created assets. For the purposes of this Demonstration, we will load two files;

\n", + "\n", + "
    \n", + "
  1. 'xgb_model' - This is a simple XGBoost Classifier model that was trained on the \"Financial Fraud\" data in the OFS table. It has an accuracy score of approximately 97.4%. The Appendix provides the code used to train, test, and save this model file.
  2. \n", + "
    \n", + "
  3. 'Demo_XBG_Scoring.py' - This file is a simple python program that acts as the bridge between EDW processing on the Analytics Cluster and the XGBoost model scoring. It simply formats the incoming data, calls the model, and outputs the model predictions. When executed on the individual parallel Analytic Cluster Nodes, it will us the XGBoost model file to score it's portion of the data.
  4. \n", + "
\n", + " \n", + "

Once again, the Vantage Python Library makes this process straightforward by calling two simple methods:

\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
    \n", + "
  • \"install_file\" for each of the two assets
  • \n", + "
    \n", + "
  • Verification using the \"files\" property
  • \n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "configured-skiing", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Install User Files in the Cluster Container

\n", + "\n", + "

Users can load any asset to the environment using the install_file method. This ensures that only authenticated users can install specific files into a dedicated filesystem, and helps prevent malicious code injection. Users pass the file name, and whether to replace an existing file.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "large-luther", + "metadata": {}, + "outputs": [], + "source": [ + "# Install xgboost model file.\n", + "#\n", + "demo_env.install_file('xgb_model', replace = True)\n", + "\n", + "# Install the desired Python script into the environment.\n", + "demo_env.install_file('Demo_XGB_Scoring.py', replace = True)" + ] + }, + { + "cell_type": "markdown", + "id": "minimal-transport", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

List all installed files

\n", + "\n", + "

files property lists the asset, size, and last updated timestamp. As above, these methods are available to manage the container remotely, since these containers live in the Vantage environment.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "running-tribute", + "metadata": {}, + "outputs": [], + "source": [ + "# Verify the files have been installed correctly.\n", + "demo_env.files" + ] + }, + { + "cell_type": "markdown", + "id": "responsible-switzerland", + "metadata": {}, + "source": [ + "
\n", + "

Demo 3 - Model Scoring at Scale

\n", + "\n", + "

VantageCloud Lake Edition Analytic Clusters combine the power and scale of native ClearScape Analytics Functions with the open and flexible runtime environments; offering users the flexibility to balance built-in data prep, transformation and feature engineering functions with custom code and models at massive scale.

\n", + "\n", + "

Enterprise Class customers report the ability to reduce data prep and model scoring times from several hours per run to seconds; effectively allowing model scoring in near-real-time.

\n", + "\n", + "

This demonstration will illustrate these key concepts:

\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
    \n", + "
  • Leverage native data preparation functions to process incoming data for the model scoring
  • \n", + "
    \n", + "
  • Execute the combined native query and the python scoring functions together, in parallel
  • \n", + "
    \n", + "
  • Analyze the results of the process to determine ongoing model accuracty and efficacy
  • \n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "involved-assist", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Data Transformation/Feature Engineering

\n", + "\n", + "

Create a reference to the data set in Vantage, and apply powerful transformation functions directly on the Data. ClearScape Analytics is a suite of in-database massively-parallel-processing functions for statistical analysis, data cleaning and transformation, machine learning, text analytics, and model scoring. Practictioners can leverage these functions together with open-source modeling as illustrated here, or create powerful, native end-to-end pipelines using just these functions.

\n", + "\n", + "" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "material-personality", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Create a reference to the data set in-Vantage\n", + "# by creating a \"Teradata DataFrame\"\n", + "# which is a reference to the data.\n", + "\n", + "\n", + "tdf_test = DataFrame('\"demo_ofs\".\"txn_history\"')\n", + "\n", + "# Only retrieve a small subset of rows to verify the connection\n", + "tdf_test.head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "signal-induction", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Engineer Features

\n", + "\n", + "

Call the ClearScape One Hot Encoding function to transform the categorical column into numeric features.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "imposed-match", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Perform native one-hot encoding on the data\n", + "# These functions use a \"fit-and-transform\" pattern\n", + "# that supports reuse and easier operationalization of the transformation process\n", + "\n", + "from teradataml import OneHotEncodingFit, OneHotEncodingTransform\n", + "\n", + "res_ohe = OneHotEncodingFit(data = tdf_test, \n", + " target_column = 'txn_type', \n", + " categorical_values = ['CASH_OUT', 'CASH_IN', 'TRANSFER', 'DEBIT', 'PAYMENT'], \n", + " other_column = 'other',\n", + " is_input_dense = True)\n", + "\n", + "res_transformed = OneHotEncodingTransform(data = tdf_test, object = res_ohe.result, is_input_dense = True)\n", + "res_transformed.result.head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "collectible-gather", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Execute the Scoring function

\n", + "\n", + "

Now that the categorical column has been encoded, the XGBoost model can be called. This is executed via the Apply method, where we pass;

\n", + "\n", + "\n", + " \n", + "\n", + "

Finally, the script is executed by calling the \"execute_script\" method; this \"lazy\" evaluation allows for more modular and performant architecture.

\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "unlimited-liver", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "\n", + "apply_obj = Apply(data = res_transformed.result.drop(['step', 'nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1),\n", + " apply_command = 'python3 Demo_XGB_Scoring.py',\n", + " returns = {'txn_id': VARCHAR(20), 'Prob_0': VARCHAR(30), \n", + " 'Prob_1': VARCHAR(30), 'Prediction':VARCHAR(2),\n", + " 'Actual': VARCHAR(2)},\n", + " env_name = demo_env,\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "opening-manner", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Execute the Python script inside the remote user environment.\n", + "# The result is a teradataml DataFrame. \n", + "#\n", + "\n", + "\n", + "scored_data = apply_obj.execute_script()\n", + "\n", + "# Only return five rows - minimize network overhead\n", + "scored_data.head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "chief-falls", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Analyze the Results

\n", + "\n", + "

It is common practice to measure the efficacy of a model. For this demonstration, a \"Confusion Matrix\" is generated that shows the quantity of true vs. false positives and negatives for the model.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "distinguished-motor", + "metadata": {}, + "outputs": [], + "source": [ + "# Copy the predictions to the client\n", + "# to generate the simple Confusion Matrix\n", + "# and print the AUC (Area Under Curve)\n", + "\n", + "df_test = scored_data.to_pandas(all_rows = True)\n", + "cm = confusion_matrix(df_test['Actual'].astype(int), df_test['Prediction'].astype(int))\n", + "disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = ['0', '1'])\n", + "fig, ax = plt.subplots(figsize=(10,10))\n", + "disp.plot(ax=ax)\n", + "\n", + "plt.show()\n", + "\n", + "#Get AUC score - anything over .75 is decent\n", + "AUC = roc_auc_score(df_test['Actual'].astype(int), df_test['Prediction'].astype(int))\n", + "print(f'AUC: {AUC}')" + ] + }, + { + "cell_type": "markdown", + "id": "conceptual-crash", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Disconnect from Vantage

\n", + "\n", + "

Once complete, one can remove the custom environment (if desired) and close the \"context\" to the Vantage system.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e43065f2-19c8-4815-9f3d-3e638325070d", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# check cluster status\n", + "res = check_cluster_stop(compute_group = compute_group)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "tired-purple", + "metadata": {}, + "outputs": [], + "source": [ + "# uninstall the libraries from the environment first before removing it\n", + "demo_env.uninstall_lib(libs = demo_env.libs['name'].to_list())\n", + "remove_env(demo_env.env_name)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fiscal-animal", + "metadata": {}, + "outputs": [], + "source": [ + "remove_context()" + ] + }, + { + "cell_type": "markdown", + "id": "material-groove", + "metadata": {}, + "source": [ + "
\n", + "

Appendix - Model Training and Evaluation

\n", + "\n", + "

VantageCloud Lake Edition Analytic Clusters and ClearScape Analytics functions can also be leveraged for model training. This brief addendum shows an abbreviated process for developing and testing an open-source fraud detection model with Vantage and XGBoost.

" + ] + }, + { + "cell_type": "markdown", + "id": "abroad-underground", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Connect to Vantage

\n", + "\n", + "

If necessary, connect to Vantage. If the context is still valid from above this doesn't need to be run. The below code will read in a variables file (vars.json - this has been used in prior environment setup and data engineering examples) and will connect to Vantage with this information. The Vantage connection is referred to as a \"Context\" - a common python-rdbms connection architecture.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "contemporary-rouge", + "metadata": {}, + "outputs": [], + "source": [ + "# load vars json\n", + "with open('vars.json', 'r') as f:\n", + " session_vars = json.load(f)\n", + "\n", + "# Create the SQLAlchemy Context\n", + "host = session_vars['environment']['host']\n", + "username = session_vars['hierarchy']['users']['business_users'][1]['username']\n", + "password = session_vars['hierarchy']['users']['business_users'][1]['password']\n", + "\n", + "# UES Authentication information\n", + "ues_url = session_vars['environment']['UES_URI']\n", + "configure.ues_url = ues_url\n", + "pat_token = session_vars['hierarchy']['users']['business_users'][1]['pat_token']\n", + "pem_file = session_vars['hierarchy']['users']['business_users'][1]['key_file']\n", + "\n", + "compute_group = session_vars['hierarchy']['users']['business_users'][1]['compute_group']\n", + "\n", + "# check for existing connection\n", + "eng = check_and_connect(host=host, username=username, password=password, compute_group = compute_group)\n", + "print(eng)" + ] + }, + { + "cell_type": "markdown", + "id": "modified-services", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Get a reference to the data

\n", + "\n", + "

Create a Teradataml DataFrame which references the data set in Vantage. This could be a table stored in direct-attach block storage, Performance-Optimized Object Storage (OFS), or stored in an open format in any Object Store.

\n", + "\n", + "

Teradataml DataFrames do not copy data into local memory, so complex analytic and transformation operations can run against data at any scale, while leveraging the parallel processing and workload isolation of Vantage Analytic Clusters.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "american-centre", + "metadata": {}, + "outputs": [], + "source": [ + "# Updated variables to insure they are the same\n", + "tdf_test = DataFrame('\"demo_ofs\".\"txn_history\"')\n", + "tdf_test.head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "terminal-network", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Engineer Features

\n", + "\n", + "

Call the ClearScape One Hot Encoding function to transform the categorical column into numeric features.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "higher-courage", + "metadata": {}, + "outputs": [], + "source": [ + "from teradataml import OneHotEncodingFit, OneHotEncodingTransform\n", + "\n", + "res_ohe = OneHotEncodingFit(data = tdf_test, \n", + " target_column = 'txn_type', \n", + " categorical_values = ['CASH_OUT', 'CASH_IN', 'TRANSFER', 'DEBIT', 'PAYMENT'], \n", + " other_column = 'other',\n", + " is_input_dense = True)\n", + "\n", + "res_transformed = OneHotEncodingTransform(data = tdf_test, object = res_ohe.result, is_input_dense = True)\n", + "res_transformed.result.head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "billion-drawing", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Design for Operations

\n", + "\n", + "

Persist the \"Fit\" table to reuse it for the Operational transformation of new data

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "meaning-trading", + "metadata": {}, + "outputs": [], + "source": [ + "# copy the fit table to a permanent table for use later\n", + "res = copy_to_sql(res_ohe.result, table_name = 'OHE_FIT_TABLE', schema_name = 'demo_ofs', if_exists = 'replace')" + ] + }, + { + "cell_type": "markdown", + "id": "cognitive-dream", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Test/Train Split

\n", + "\n", + "

Extraordinarily fast \"Sample\" function can split the data into multiple data sets in seconds.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ignored-scholar", + "metadata": {}, + "outputs": [], + "source": [ + "tdf_samples = res_transformed.result.sample(frac = [0.2, 0.8])\n", + "copy_to_sql(tdf_samples[tdf_samples['sampleid'] == 2], table_name = 'txns_train', schema_name = 'demo_ofs', if_exists = 'replace')\n", + "copy_to_sql(tdf_samples[tdf_samples['sampleid'] == 1], table_name = 'txns_test', schema_name = 'demo_ofs', if_exists = 'replace')" + ] + }, + { + "cell_type": "markdown", + "id": "major-nudist", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Train the Model

\n", + "\n", + "

Use open-source XGBoost Classifier to train the model using the \"training\" data split above.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "demanding-bouquet", + "metadata": {}, + "outputs": [], + "source": [ + "# Create a Pandas DataFrame\n", + "df_train = DataFrame('\"demo_ofs\".\"txns_train\"').to_pandas(all_rows = True)\n", + "\n", + "# define the input columns and target variable:\n", + "X_train = df_train[['txn_type_CASH_OUT', 'txn_type_CASH_IN', 'txn_type_TRANSFER',\n", + " 'txn_type_DEBIT', 'txn_type_PAYMENT', 'txn_type_other', 'amount','oldbalanceOrig', 'newbalanceOrig',\n", + " 'oldbalanceDest', 'newbalanceDest']]\n", + "y_train = df_train[['isFraud']]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "strong-lottery", + "metadata": {}, + "outputs": [], + "source": [ + "# Fit the Model\n", + "warnings.filterwarnings('ignore')\n", + "from xgboost import XGBClassifier\n", + "\n", + "model = XGBClassifier()\n", + "model.fit(X_train, y_train)" + ] + }, + { + "cell_type": "markdown", + "id": "atmospheric-occasions", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Test the Model

\n", + "\n", + "

It is common practice to measure the efficacy of a model. For this demonstration, a \"Confusion Matrix\" is generated that shows the quantity of true vs. false positives and negatives for the model.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "australian-religion", + "metadata": {}, + "outputs": [], + "source": [ + "# Return a Pandas DataFrame from the split data above\n", + "\n", + "df_test = DataFrame('\"demo_ofs\".\"txns_test\"').to_pandas(all_rows = True)\n", + "\n", + "# Define the input columns and target\n", + "X_test = df_test[['txn_type_CASH_OUT', 'txn_type_CASH_IN', 'txn_type_TRANSFER',\n", + " 'txn_type_DEBIT', 'txn_type_PAYMENT', 'txn_type_other', 'amount','oldbalanceOrig', 'newbalanceOrig',\n", + " 'oldbalanceDest', 'newbalanceDest']]\n", + "y_test = df_test[['isFraud']]\n", + "\n", + "\n", + "# Predict the class and the probability of Fraud\n", + "y_pred = model.predict(X_test)\n", + "y_prob = model.predict_proba(X_test)\n", + "\n", + "\n", + "# Generate the Confusion Matrix\n", + "df_test[['prob_0', 'prob_1']] = y_prob\n", + "df_test['prediction'] = y_pred\n", + "\n", + "cm = confusion_matrix(df_test['isFraud'], df_test['prediction'])\n", + "disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = ['0', '1'])\n", + "fig, ax = plt.subplots(figsize=(10,10))\n", + "disp.plot(ax=ax)\n", + "\n", + "plt.show()\n", + "\n", + "#Get AUC score - anything over .75 is decent\n", + "AUC = roc_auc_score(df_test['isFraud'], df_test['prediction'])\n", + "print(f'AUC: {AUC}')" + ] + }, + { + "cell_type": "markdown", + "id": "proper-friendship", + "metadata": {}, + "source": [ + "
\n", + "\n", + "

Save the Model

\n", + "\n", + "

Save the model file in native xgboost format. This is used above in the main demonstration.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "assured-progressive", + "metadata": {}, + "outputs": [], + "source": [ + "model.save_model('xgb_model')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "formed-sheet", + "metadata": {}, + "outputs": [], + "source": [ + "remove_context()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "changed-certification", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "3.10.0", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.0" + }, + "toc-autonumbering": false, + "toc-showmarkdowntxt": true + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/UseCases/Opensource_Data_Science_OAF/images/Container_Layout.png b/UseCases/Opensource_Data_Science_OAF/images/Container_Layout.png new file mode 100755 index 00000000..79fac5d8 Binary files /dev/null and b/UseCases/Opensource_Data_Science_OAF/images/Container_Layout.png differ diff --git a/UseCases/Opensource_Data_Science_OAF/images/In_DB_Functions.png b/UseCases/Opensource_Data_Science_OAF/images/In_DB_Functions.png new file mode 100755 index 00000000..7445ea5f Binary files /dev/null and b/UseCases/Opensource_Data_Science_OAF/images/In_DB_Functions.png differ diff --git a/UseCases/Opensource_Data_Science_OAF/images/ML_Step1.png b/UseCases/Opensource_Data_Science_OAF/images/ML_Step1.png new file mode 100755 index 00000000..8266119f Binary files /dev/null and b/UseCases/Opensource_Data_Science_OAF/images/ML_Step1.png differ diff --git a/UseCases/Opensource_Data_Science_OAF/images/Model.png b/UseCases/Opensource_Data_Science_OAF/images/Model.png new file mode 100755 index 00000000..228bf77b Binary files /dev/null and b/UseCases/Opensource_Data_Science_OAF/images/Model.png differ diff --git a/UseCases/Opensource_Data_Science_OAF/images/OAF_Env.png b/UseCases/Opensource_Data_Science_OAF/images/OAF_Env.png new file mode 100755 index 00000000..1be627c3 Binary files /dev/null and b/UseCases/Opensource_Data_Science_OAF/images/OAF_Env.png differ diff --git a/UseCases/Opensource_Data_Science_OAF/images/OAF_Overview.png b/UseCases/Opensource_Data_Science_OAF/images/OAF_Overview.png new file mode 100755 index 00000000..73b29048 Binary files /dev/null and b/UseCases/Opensource_Data_Science_OAF/images/OAF_Overview.png differ diff --git a/UseCases/Opensource_Data_Science_OAF/images/OAF_Scoring.png b/UseCases/Opensource_Data_Science_OAF/images/OAF_Scoring.png new file mode 100755 index 00000000..239be028 Binary files /dev/null and b/UseCases/Opensource_Data_Science_OAF/images/OAF_Scoring.png differ diff --git a/UseCases/Opensource_Data_Science_OAF/images/Overview.png b/UseCases/Opensource_Data_Science_OAF/images/Overview.png new file mode 100755 index 00000000..0ca2cc23 Binary files /dev/null and b/UseCases/Opensource_Data_Science_OAF/images/Overview.png differ diff --git a/UseCases/Opensource_Data_Science_OAF/images/TeradataLogo.png b/UseCases/Opensource_Data_Science_OAF/images/TeradataLogo.png new file mode 100644 index 00000000..a6811164 Binary files /dev/null and b/UseCases/Opensource_Data_Science_OAF/images/TeradataLogo.png differ diff --git a/UseCases/Opensource_Data_Science_OAF/images/new-tab-icon.png b/UseCases/Opensource_Data_Science_OAF/images/new-tab-icon.png new file mode 100644 index 00000000..34b83204 Binary files /dev/null and b/UseCases/Opensource_Data_Science_OAF/images/new-tab-icon.png differ