diff --git a/VantageCloud_Lake/UseCases/Employee_Feedback_teradatagenai/Employee_Feedback.ipynb b/VantageCloud_Lake/UseCases/Employee_Feedback_teradatagenai/Employee_Feedback.ipynb new file mode 100644 index 00000000..0d16f04b --- /dev/null +++ b/VantageCloud_Lake/UseCases/Employee_Feedback_teradatagenai/Employee_Feedback.ipynb @@ -0,0 +1,1215 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "79e59c89-bcf1-4cd4-8cbb-ef02e9a2c57e", + "metadata": {}, + "source": [ + "
\n", + "

\n", + " Employee Feedback and Insights Platform\n", + "
\n", + " \"Teradata\"\n", + "

\n", + "
\n", + "\n", + "

Introduction:

\n", + "\n", + "\n", + "

\n", + " In this notebook, we will demonstrate how HR teams can analyze employee feedback at scale using advanced text analytics with teradatagenai.\n", + "

\n", + "The goal is to build an end-to-end pipeline that:\n", + " \n", + "

\n", + " \n", + " \n", + "

\n", + " The teradatagenai Python library enables data scientists, analysts, and developers to run analytics on their unstructured data directly within Teradata VantageCloud. It's built-in support for open-source Hugging Face models through Teradata's Bring Your Own Large Language Model (BYOLLM) capability and cloud service provider or by using In-DB TextAnalytics AI functions to access models provided by AWS, Azure, and GCP.\n", + "\n", + "

\n", + " \"teradatagenai\n", + "
\n", + "\n", + "

Business Value:

\n", + "\n", + "

\n", + " Organizations handle massive volumes of unstructured text including emails, voice call transcripts, customer reviews, contracts and more. Traditional approaches to analyze this data often involve costly data transfers, building custom ML pipelines, and extended turnaround times. teradatagenai addresses these challenges by bringing domain specific language models LLMs and hosted LLMs closer to your data.\n", + "

\n", + "

\n", + " With built-in support for GPU acceleration and seamless integration with VantageCloud, the library offers simple function calls that abstract complex APIs, enabling secure, scalable, and performant text processing. Whether you're deploying open source models in-database or calling hosted LLMs like Amazon Bedrock, teradatagenai provides the flexibility to align with your organization's security, cost, and performance needs.\n", + "

\n", + "\n", + "

\n", + " The TextAnalyticsAI module within the library provides over 11 built-in generative AI functions for powerful in-database NLP capabilities:\n", + "

\n", + "\n", + "\n", + "

" + ] + }, + { + "cell_type": "markdown", + "id": "cb9774d8-b6b3-4206-b13b-5e5b41314034", + "metadata": {}, + "source": [ + "

How to Get Access to Run This Demo in VantageCloud

\n", + "\n", + "

\n", + "Gain free access to Teradata’s Open Analytics Framework, which includes support for BYO-LLM capabilities and GPU compute clusters. This enables you to run open-source Hugging Face models directly within your VantageCloud environment

\n", + "

\n", + "To request the access required for this demo, send an email to Support.ClearScapeAnalytics@Teradata.com and include the Host name of the environment you are requsting access from. This can be found on the ClearScape Analytics Dashboard in the section Connection Details for Vantage Database. Our team will provision your connection with the required permissions for BYO-LLM and GPU-accelerated demos.\n", + "

\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f289f49c-7d93-4945-ad2f-69ef7c3f5510", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install -r requirements.txt --quiet" + ] + }, + { + "cell_type": "markdown", + "id": "c559d02a-c28c-4532-92ec-98abb8c7ea7f", + "metadata": {}, + "source": [ + "
\n", + "

Please restart the kernel after executing the above cell to include/update these libraries into memory for this kernel. The simplest way to restart the Kernel is by typing zero zero: 0 0 and then clicking Restart.

\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "e2e5bfb5-a2ee-48ab-95c4-409807388622", + "metadata": {}, + "source": [ + "
\n", + "

1. Configure the environment

\n", + "

\n", + "Before we start working with our data, we need to set up our environment. This involves importing the necessary packages and establishing a connection to Vantage.\n", + "
\n", + "Here's how we can do this:

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c2cc7e00-0174-4c11-a61a-e0e644bf276d", + "metadata": {}, + "outputs": [], + "source": [ + "# Importing required packages\n", + "import sys\n", + "from teradatagenai import TeradataAI, TextAnalyticsAI, load_data\n", + "from teradataml import *\n", + "import getpass, os\n", + "from teradataml import *\n", + "import teradatagenai\n", + "import time\n", + "from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline\n", + "from sentence_transformers import SentenceTransformer\n", + "from IPython.display import display as ipydisplay\n", + "#from teradataml import create_context, set_config_params, list_base_envs, list_user_envs, create_env" + ] + }, + { + "cell_type": "markdown", + "id": "6e5186d3-d24b-4c11-9ef3-5a8152fdfcdb", + "metadata": {}, + "source": [ + "
\n", + "

2. Connect to VantageCloud Lake

\n", + "

Connect to VantageCloud using create_context from the teradataml Python library. If this environment has been prepared for connecting to a VantageCloud Lake OAF Container, all the details required will be loaded and you will see an acknowledgement after executing this cell.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fe5a0b4f-c0c0-4e62-a5f5-9cc9d4f45d53", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Checking if this environment is ready to connect to VantageCloud Lake...\")\n", + "\n", + "if os.path.exists(\"/home/jovyan/JupyterLabRoot/VantageCloud_Lake/.config/.env\"):\n", + " print(\"Your environment parameter file exist. Please proceed with this use case.\")\n", + " # Load all the variables from the .env file into a dictionary\n", + " env_vars = dotenv_values(\"/home/jovyan/JupyterLabRoot/VantageCloud_Lake/.config/.env\")\n", + " # Create the Context\n", + " eng = create_context(host=env_vars.get(\"host\"), username=env_vars.get(\"username\"), password=env_vars.get(\"my_variable\"))\n", + " execute_sql('''SET query_band='DEMO=Employee_Feedback.ipynb;' UPDATE FOR SESSION;''')\n", + " print(\"Connected to VantageCloud Lake with:\", eng)\n", + "else:\n", + " print(\"Your environment has not been prepared for connecting to VantageCloud Lake.\")\n", + " print(\"Please contact the support team.\")" + ] + }, + { + "cell_type": "markdown", + "id": "13d5d8a1-9a9a-44eb-ae8a-7fe77f0d1d19", + "metadata": {}, + "source": [ + "
\n", + "

3.Load the data

\n" + ] + }, + { + "cell_type": "markdown", + "id": "cfdd9d63-da91-4285-873d-bb2b066cdb26", + "metadata": {}, + "source": [ + "

\n", + "We will be loading the sample employee data using the 'load_data()' helper function. To utilize the TextAnalyticsAI functions effectively, we first need to organize our data appropriately. We are particularly interested in the 'articles', 'reviews', 'quotes', and 'employee_data' columns for each 'employee_id' and 'employee_name' in our dataframe.\n", + "\n", + "

\n", + "To streamline this process, we will generate individual dataframes for each of these columns:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "84208668-f1e2-47f4-9c23-7eb2df2f632a", + "metadata": {}, + "outputs": [], + "source": [ + "load_data('employee', 'employee_data')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9bf4cd6b-c171-4978-b7d8-186bf8c31323", + "metadata": {}, + "outputs": [], + "source": [ + "df=DataFrame('employee_data')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fb2a89a8-0339-4c20-aec9-90dbc0d8256d", + "metadata": {}, + "outputs": [], + "source": [ + "# Create separate DataFrames for articles, reviews, quotes, and employee data.\n", + "df_articles = df.select([\"employee_id\", \"employee_name\", \"articles\"])\n", + "df_reviews = df.select([\"employee_id\", \"employee_name\", \"reviews\"])\n", + "df_quotes = df.select([\"employee_id\", \"employee_name\", \"quotes\"])\n", + "df_employeeData = df.select([\"employee_id\", \"employee_name\", \"employee_data\"])\n", + "df_classify_articles = df.select([\"employee_id\", \"articles\"])" + ] + }, + { + "cell_type": "markdown", + "id": "49ea3cfe-b271-4980-a667-1a78f770a68e", + "metadata": {}, + "source": [ + "


\n", + "

4. Authenticate and Prepare the OAF Environment.

\n", + "

\n", + "The teradataml library offers simple yet powerful methods for creating and managing custom Python runtime environments within VantageCloud. This gives developers full control over model behavior, performance, and analytic accuracy when running on the Analytic Cluster.\n", + "

\n", + "\n", + "

\n", + "Custom environments are persistent—created once and reused as needed. They can be saved, updated, or modified at any time, allowing for efficient and flexible environment management.\n", + "

\n", + "\n", + "

\n", + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + "
    \n", + "
  1. Create a unique User Environment based on available base images
  2. \n", + "
  3. Install libraries
  4. \n", + "
  5. Install models and additional user artifacts
  6. \n", + "
\n", + "
\n", + " \"Container\n", + "
\n", + "

4.1 UES Authentication

\n", + "

This security mechanism is required to create and manage the Python or R environments that we will be creating. A VantageCloud Lake user can easily create the authentication objects using the Console in a VantageCloud Lake environment. For this use case, the authentication objects has already been created and copied into this JupyterLab environment for you.\n", + "

\n", + "

\n", + " \n", + "

\n", + "

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "43f1a6ca-d0c1-4ca4-96ed-377904929917", + "metadata": {}, + "outputs": [], + "source": [ + "# We've already loaded all the values into our environment variables and into a dictionary, env_vars.\n", + "# username=env_vars.get(\"username\") isn't required when using base_url, pat and pem.\n", + "\n", + "if set_auth_token(base_url=env_vars.get(\"ues_uri\"),\n", + " pat_token=env_vars.get(\"access_token\"), \n", + " pem_file=env_vars.get(\"pem_file\"),\n", + " valid_from=int(time.time())\n", + " ):\n", + " print(\"UES Authentication successful\")\n", + "else:\n", + " print(\"UES Authentication failed. Check credentials.\")\n", + " sys.exit(1)" + ] + }, + { + "cell_type": "markdown", + "id": "d4793280-36a9-48ca-958f-01502200bbf2", + "metadata": {}, + "source": [ + "

4.2 Check for an existing OAF environment or Create a new one

\n", + "

It's ok to reuse the same OAF environment. Our VantageCloud Lake OAF Use cases and demos will use a default naming convention for the environment names. If you haven't already created one, we'll create it now.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d791a944-7e04-4b11-83b7-4e5fb3a0da1a", + "metadata": {}, + "outputs": [], + "source": [ + "environment_name = env_vars.get(\"username\")\n", + "print(\"\\nHere is a list of your current environments:\")\n", + "env_list = list_user_envs()\n", + "ipydisplay(env_list)\n", + "\n", + "if environment_name in env_list['env_name'].values: \n", + " demo_env = get_env(environment_name)\n", + " print(\"Your default environment already exists. You can continue with this notebook.\\n\\n\")\n", + "else:\n", + " demo_env = create_env(env_name=f'{environment_name}', base_env='python_3.10')\n", + " print(demo_env)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "35416e0f-92d6-4a19-bd2f-ea2ea1ff5ac2", + "metadata": {}, + "outputs": [], + "source": [ + "lib_claim_id = demo_env.install_lib([\"transformers\", \"torch\",\"sentencepiece\",\"sentence-transformers\"])\n", + "print(\"Libraries Installed\") \n", + "#Get the status of the libraries installation\n", + "demo_env.status(str(lib_claim_id[\"Claim Id\"].iloc[0]))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bb03c237-da21-4d6c-87dd-fea4d90be82f", + "metadata": {}, + "outputs": [], + "source": [ + "gpu_compute_group = env_vars.get(\"gpu_compute_group\")\n", + "execute_sql(f\"SET SESSION COMPUTE GROUP {gpu_compute_group};\")\n", + "print(f\"Compute group set to {gpu_compute_group}\") " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f3ece2e4-418d-461e-a6c4-89afa8412023", + "metadata": {}, + "outputs": [], + "source": [ + "def clean_env(llm):\n", + " ##Get LLM\n", + " llm_instance = llm.get_llm()\n", + " print(\"LLM instance:\", llm_instance)\n", + " ##Remove LLM\n", + " llm.remove()\n", + " print(\"LLM removed successfully.\")" + ] + }, + { + "cell_type": "markdown", + "id": "e8ad3c91-dd5c-4972-a513-0f04abf464c4", + "metadata": { + "tags": [] + }, + "source": [ + "
\n", + "

5. Sentiment Analysis

\n", + "\n", + "

First, we want to gauge employee morale by analyzing the emotional tone of employee reviews and quotes.\n", + "We use the Hugging Face model bhadresh-savani/distilbert-base-uncased-emotion which detects emotions like joy, anger, sadness, optimism, etc.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "132e52ef-9cf6-471d-814f-2d6a6aa81b11", + "metadata": {}, + "outputs": [], + "source": [ + "# Acess LLM endpoint\n", + "model_name = 'bhadresh-savani/distilbert-base-uncased-emotion'\n", + "model_args = {'transformer_class': 'AutoModelForSequenceClassification',\n", + " 'task' : 'text-classification'}\n", + "llm = TeradataAI(api_type = \"hugging_face\",\n", + " model_name = model_name,\n", + " model_args = model_args)" + ] + }, + { + "cell_type": "markdown", + "id": "be498e09-699f-44a6-a8d4-f5ce9e348387", + "metadata": {}, + "source": [ + "

5.1 Create the TextAnalyticsAI object

\n", + "

Now we can execute the portion of this demo that will run in our GPU Analytics Cluster. We'll provide the TextAnalyticsAI object with the preferred large language model. This will enable us to execute a variety of text analytics tasks.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "de1f4d62-c9ef-47c8-b612-fc9443a73bd7", + "metadata": {}, + "outputs": [], + "source": [ + "# Create a TextAnalyticsAI object.\n", + "obj = TextAnalyticsAI(llm=llm)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d561f8bd-36d9-4647-b535-4b08edc3c826", + "metadata": {}, + "outputs": [], + "source": [ + "# Using the default script\n", + "obj.analyze_sentiment(column='reviews', data=df_reviews, delimiter=\"#\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab559276-bc4a-4304-baa2-7ae4a3f6d144", + "metadata": {}, + "outputs": [], + "source": [ + "# Using sample_script with output_labels.\n", + "obj.analyze_sentiment(column='reviews', data=df_reviews,\n", + "output_labels={'label': str, 'score': float}, delimiter=\"#\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ae2b70a-578c-49ca-8d26-72c6b6d110b1", + "metadata": {}, + "outputs": [], + "source": [ + "clean_env(llm)" + ] + }, + { + "cell_type": "markdown", + "id": "09c61434-76f0-4aaf-b58f-c46c448ca2ee", + "metadata": {}, + "source": [ + "
\n", + "

6. Key Phrase Extraction

\n", + "

Next, we extract key phrases to identify recurring themes in employee responses, such as “work-life balance,” “salary growth,” or “team support.”\n", + "This helps HR quickly spot the main concerns and motivators.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5fb2c00c-83eb-4e61-bfb0-00a3ba4fbbf4", + "metadata": {}, + "outputs": [], + "source": [ + "# Accessing the LLM endpoint and initializing the TeradataAI and TextAnalyticsAI\n", + "model_name = 'ml6team/keyphrase-extraction-kbir-kpcrowd'\n", + "model_args = {'transformer_class': 'AutoModelForTokenClassification',\n", + " 'task' : 'text-classification'} \n", + "llm = TeradataAI(api_type = \"hugging_face\",\n", + " model_name = model_name,\n", + " model_args = model_args)\n", + "obj = TextAnalyticsAI(llm=llm)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c2000e6b-9026-4747-806d-8031457949cb", + "metadata": {}, + "outputs": [], + "source": [ + "# Default script is used\n", + "obj.extract_key_phrases(column=\"articles\", data=df_articles, delimiter=\"#\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6ed30576-90ed-4725-bbba-21499fa4aaa8", + "metadata": {}, + "outputs": [], + "source": [ + "# Using a user defined script.\n", + "base_dir = os.path.dirname(teradatagenai.__file__)\n", + "extract_key_phrases_script = os.path.join(base_dir, 'example-data', 'extract_key_phrases.py')\n", + "obj.extract_key_phrases(column=\"articles\", data=df_articles, script=extract_key_phrases_script, delimiter=\"#\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5be6d656-57a2-4a77-aaa2-cc961763fbf3", + "metadata": {}, + "outputs": [], + "source": [ + "clean_env(llm)" + ] + }, + { + "cell_type": "markdown", + "id": "517d4fb7-43ec-4efe-812f-807dcd8b9d55", + "metadata": {}, + "source": [ + "
\n", + "

7. Recongnize entities

\n", + "

Employees often mention departments, managers, projects, and organizations in their feedback.\n", + "By running entity recognition, we can structure unstructured text and identify these references for deeper analysis.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea80aab4-62e4-4683-8425-42ba01e3ae7f", + "metadata": {}, + "outputs": [], + "source": [ + "# # Accessing the LLM endpoint and initializing TeradataAI and TextAnalyticsAI\n", + "model_name = 'tner/roberta-large-ontonotes5'\n", + "model_args = {'transformer_class': 'AutoModelForTokenClassification',\n", + " 'task' : 'token-classification'}\n", + "llm = TeradataAI(api_type = \"hugging_face\",\n", + " model_name = model_name,\n", + " model_args = model_args)\n", + "obj = TextAnalyticsAI(llm=llm)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e5972fc0-2088-4f16-afbe-ff114748a041", + "metadata": {}, + "outputs": [], + "source": [ + "# Default script is used\n", + "obj.recognize_entities(column='articles', data=df_articles, delimiter=\"#\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "612a4728-bae4-46d2-a56a-8f07ebc32e53", + "metadata": {}, + "outputs": [], + "source": [ + "# use user_defined script for inferencing along with returns argument \n", + "base_dir = os.path.dirname(teradatagenai.__file__)\n", + "entity_recognition_script = os.path.join(base_dir, 'example-data', 'entity_recognition.py')\n", + "obj.recognize_entities(column='articles',\n", + " returns = {\"text\": VARCHAR(64000),\n", + " \"ORG\": VARCHAR(64000),\n", + " \"PERSON\": VARCHAR(64000),\n", + " \"DATE1\": VARCHAR(64000),\n", + " \"PRODUCT\": VARCHAR(64000),\n", + " \"GPE\": VARCHAR(64000),\n", + " \"EVENT\": VARCHAR(64000),\n", + " \"LOC\": VARCHAR(64000),\n", + " \"WORK_OF_ART\": VARCHAR(64000)},\n", + " data=df_articles,\n", + " script = entity_recognition_script, delimiter=\"#\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0878a42e-4a84-4bf3-b353-87313db0070c", + "metadata": {}, + "outputs": [], + "source": [ + "clean_env(llm)" + ] + }, + { + "cell_type": "markdown", + "id": "6f880b17-52bf-4b3f-ba81-ea10fb58a4d8", + "metadata": {}, + "source": [ + "
\n", + "

8. Language detection

\n", + "

Since employees may respond in multiple languages, we first detect the language of the feedback.\n", + "This ensures proper routing and translation where needed.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7469bfb-8a89-45b9-9e5b-8e3fca4a5c7b", + "metadata": {}, + "outputs": [], + "source": [ + "# Accessing the LLM endpoint and initializing the TeradataAI and TextAnalyticsAI\n", + "# demo_env = create_env(env_name=f'{environment_name}', base_env='python_3.10', desc='BYOLLM demo env')\n", + "#demo_env = create_env(env_name=f'{environment_name}', base_env='python_3.10')\n", + "model_name = 'papluca/xlm-roberta-base-language-detection'\n", + "model_args = {'transformer_class': 'AutoModelForSequenceClassification', 'task' : 'text-classification'}\n", + "ues_args = {'env_name': f'{environment_name}'}\n", + "llm = TeradataAI(api_type = \"hugging_face\",\n", + " model_name = model_name,\n", + " model_args = model_args,\n", + " ues_args = ues_args)\n", + "obj = TextAnalyticsAI(llm=llm)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f89731ca-122a-475a-979f-2a6f7df3b8bc", + "metadata": {}, + "outputs": [], + "source": [ + "# Default script is used\n", + "obj.detect_language(column=\"quotes\", data=df_quotes, delimiter=\"#\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a199252d-575d-4c79-b076-31f97c168006", + "metadata": {}, + "outputs": [], + "source": [ + "# output_labels argument is specified along with the default script\n", + "obj.detect_language(column='quotes', data=df_quotes, output_labels={'label': str, 'score': float}, delimiter=\"#\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "15c6b7e0-f134-4019-a4b3-e1eb09252cd9", + "metadata": {}, + "outputs": [], + "source": [ + "clean_env(llm)" + ] + }, + { + "cell_type": "markdown", + "id": "0736981c-496a-4f67-8aa0-df74d450e4bd", + "metadata": {}, + "source": [ + "
\n", + "

9. Text Summarization

\n", + "

Some employee feedback may be lengthy. Using summarization, we create concise reports that highlight the main point without losing meaning.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b6b277a2-dc06-44f8-8636-5783e46b5432", + "metadata": {}, + "outputs": [], + "source": [ + "# Accessing the LLM endpoint and initializing TeradataAI and TextAnalyticsAI\n", + "model_name = 'facebook/bart-large-cnn'\n", + "model_args = {'transformer_class': 'AutoModelForSeq2SeqLM', 'task' : 'summarization'}\n", + "llm = TeradataAI(api_type = \"hugging_face\",\n", + "model_name = model_name,\n", + "model_args = model_args)\n", + "obj = TextAnalyticsAI(llm=llm)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "948635ea-879f-46d0-a982-abf38ddbbd56", + "metadata": {}, + "outputs": [], + "source": [ + "# Using default script\n", + "obj.summarize(column='articles', data=df_articles, quotechar=\"|\", delimiter=\"#\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3c4a5ee7-33fe-45c8-8f85-034a2e6a0b3b", + "metadata": {}, + "outputs": [], + "source": [ + "# Using a user defined script.\n", + "base_dir = os.path.dirname(teradatagenai.__file__)\n", + "summarization_script = os.path.join(base_dir, 'example-data', 'summarize_text.py')\n", + "obj.summarize(column='articles',\n", + " returns = {\"text\": VARCHAR(10000),\n", + " \"summarized_text\": VARCHAR(10000)},\n", + " data=df_articles,\n", + " script = summarization_script, quotechar=\"|\", delimiter=\"#\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fd9e7eeb-9bc4-4cfb-a0ec-3d9fedd4ea7d", + "metadata": {}, + "outputs": [], + "source": [ + "clean_env(llm)" + ] + }, + { + "cell_type": "markdown", + "id": "d0a93f3d-e31a-4eaa-a035-e6ef5729c676", + "metadata": {}, + "source": [ + "
\n", + "

10. Text Classification

\n", + "

To make HR analysis easier, we classify feedback into categories such as:\n", + "

\n", + "\n", + "

This makes it easy to route feedback to the right HR sub-team.

\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8dd0b460-f177-4236-8d39-86cd44796920", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Accessing the LLM endpoint and initializing TeradataAI and TextAnalyticsAI\n", + "model_name = 'facebook/bart-large-mnli'\n", + "model_args = {'transformer_class': 'AutoModelForSequenceClassification', 'task' : 'zero-shot-classification'}\n", + "llm = TeradataAI(api_type = \"hugging_face\",\n", + " model_name = model_name,\n", + " model_args = model_args)\n", + "obj = TextAnalyticsAI(llm=llm)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c6b20406-21a8-4517-8fd8-ee8908fa8661", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Using default script\n", + "label = [\"Medical\", \"hospital\", \"healthcare\", \"historicalNews\",\n", + " \"Environment\", \"technology\", \"Games\"]\n", + "obj.classify(\"articles\", df_classify_articles, labels=label, delimiter=\"#\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f0657dcd-4033-47f7-8749-2d21fa2f43e1", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Using a user defined script.\n", + "base_dir = os.path.dirname(teradatagenai.__file__)\n", + "classify_script = os.path.join(base_dir, 'example-data', 'classify_text.py')\n", + "\n", + "obj.classify(\"articles\",\n", + " df_classify_articles,\n", + " labels=[\"Medical\", \"Hospitality\", \"Healthcare\",\n", + " \"historical-news\", \"Games\",\n", + " \"Environment\", \"Technology\",\n", + " \"Games\"], script=classify_script, delimiter=\"#\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "892125c8-f023-4b8b-a897-e00bbbfa2ab5", + "metadata": {}, + "outputs": [], + "source": [ + "clean_env(llm)" + ] + }, + { + "cell_type": "markdown", + "id": "82b9e656-d8d6-492d-b913-8130e76689af", + "metadata": {}, + "source": [ + "
\n", + "

11. Language Translation

\n", + "

Once the language is detected, non-English feedback is translated into English so that the HR team can view all responses in one unified language.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1692f21b-fb7f-48ac-a6b8-019b27b51bf2", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Acessing the LLM endpoint and initializing TeradataAI and TextAnalyticsAI\n", + "model_name = 'Helsinki-NLP/opus-mt-en-fr'\n", + "model_args = {'transformer_class': 'AutoModelForSeq2SeqLM', 'task' : 'translation'}\n", + "ues_args = {'env_name': f'{environment_name}'}\n", + "\n", + "llm = TeradataAI(api_type = \"hugging_face\",\n", + " model_name = model_name,\n", + " model_args = model_args,\n", + " ues_args = ues_args)\n", + "obj = TextAnalyticsAI(llm=llm)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ca93cd2d-8d9e-45e6-b334-8b70547cc94f", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Default script is used\n", + "obj.translate(column=\"quotes\", data=df_quotes, target_lang=\"French\", delimiter=\"#\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "994a36ae-b1cd-4eea-a368-f77b2af764d5", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# output_labels argument is specified along with the default script\n", + "obj.translate(column=\"quotes\", data=df_quotes, target_lang=\"French\", output_labels={'translation_text': str}, delimiter=\"#\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "17a1fd81-c732-461d-a13d-c4cf3797a5dc", + "metadata": {}, + "outputs": [], + "source": [ + "clean_env(llm)" + ] + }, + { + "cell_type": "markdown", + "id": "9f9f4384-9568-48c2-86c9-5e0982031708", + "metadata": {}, + "source": [ + "
\n", + "

12. Recongnize PII

\n", + "

In this section, we'll delve into the recognize_pii_entities() function provided by TextAnalyticsAI. This function is designed to identify Personal Identifiable Information (PII) entities within text data. PII entities can include sensitive data like 'names', 'addresses', 'social security numbers', 'email addresses', 'phone numbers', etc.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0d7b68f2-5153-4773-a8d2-dd28aab56b33", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Acessing the LLM endpoint and initializing the TeradataAI\n", + "model_name = 'lakshyakh93/deberta_finetuned_pii'\n", + "model_args = {'transformer_class': 'AutoModelForTokenClassification', 'task' : 'token-classification'}\n", + "llm = TeradataAI(api_type = \"hugging_face\",\n", + " model_name = model_name,\n", + " model_args = model_args)\n", + "obj = TextAnalyticsAI(llm=llm)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "28de1107-a247-44fd-8276-2344e7e438d5", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Default script is used\n", + "obj.recognize_pii_entities(column=\"employee_data\", data=df_employeeData, delimiter=\"#\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a0145cb5-45ad-459b-9d53-f19b8bb7014a", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Using a user defined script.\n", + "base_dir = os.path.dirname(teradatagenai.__file__)\n", + "recognize_script = os.path.join(base_dir, 'example-data', 'recognize_pii.py')\n", + "obj.recognize_pii_entities(column=\"employee_data\", data=df_employeeData, script=recognize_script, delimiter=\"#\")" + ] + }, + { + "cell_type": "markdown", + "id": "51033108-1692-4fd6-b709-376d428e4a6a", + "metadata": {}, + "source": [ + "
\n", + "

13. Mask PII

\n", + "

In this section, we'll delve into the recognize_pii_entities() function provided by TextAnalyticsAI. This function is designed to identify Personal Identifiable Information (PII) entities within text data. PII entities can include sensitive data like 'names', 'addresses', 'social security numbers', 'email addresses', 'phone numbers', etc.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a49409ee-39dc-4ab7-beec-67e727b40a65", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Acessing the LLM endpoint and initializing the TeradataAI\n", + "model_name = 'lakshyakh93/deberta_finetuned_pii'\n", + "model_args = {'transformer_class': 'AutoModelForTokenClassification', 'task' : 'token-classification'}\n", + "llm = TeradataAI(api_type = \"hugging_face\",\n", + " model_name = model_name,\n", + " model_args = model_args)\n", + "obj = TextAnalyticsAI(llm=llm)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a4b87438-f943-4818-9161-8ba9f00579ca", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Using a user defined script.\n", + "base_dir = os.path.dirname(teradatagenai.__file__)\n", + "mask_pii_script = os.path.join(base_dir, 'example-data', 'mask_pii.py')\n", + "obj.mask_pii(column=\"employee_data\", data=df_employeeData, script=mask_pii_script, delimiter=\"#\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1a1bfacc-91da-4211-9254-fe34539a5310", + "metadata": {}, + "outputs": [], + "source": [ + "clean_env(llm)" + ] + }, + { + "cell_type": "markdown", + "id": "dbe25953-39dd-4884-b0eb-3cd1fcf9d514", + "metadata": {}, + "source": [ + "
\n", + "

14. Sentence Similarity

\n", + "

We can check similarity between employee responses to group together feedback that talks about the same issue.\n", + "This helps HR avoid duplicate analysis and focus on unique concerns.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5b701a2b-9476-4f4f-84b8-011f35c128ed", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Acessing the LLM endpoint and initializing the TeradataAI and TextAnalyticsAI\n", + "model_name = 'sentence-transformers/all-MiniLM-L6-v2'\n", + "model_args = {'transformer_class': 'AutoModelForTokenClassification', 'task' : 'token-classification'}\n", + "ues_args = {'env_name': f'{environment_name}'}\n", + "\n", + "llm = TeradataAI(api_type = \"hugging_face\",\n", + " model_name = model_name,\n", + " model_args = model_args,\n", + " ues_args = ues_args)\n", + "obj = TextAnalyticsAI(llm=llm)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4520a91e-cc01-4272-8d45-e89d6712dbfe", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Using a user-defind script\n", + "base_dir = os.path.dirname(teradatagenai.__file__)\n", + "sentence_similarity_script = os.path.join(base_dir, 'example-data', 'sentence_similarity.py')\n", + "obj.sentence_similarity(column1=\"employee_data\", column2=\"articles\", data=df, script=sentence_similarity_script, delimiter=\"#\")" + ] + }, + { + "cell_type": "markdown", + "id": "36e02264-5fb5-4e24-a80b-9a33dfd9904d", + "metadata": {}, + "source": [ + "
\n", + "

15. Embeddings

\n", + "

Finally, we generate vector embeddings for each feedback entry.\n", + "This enables:\n", + "

\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "06f205d2-7bc2-41ea-88bd-f473d44cae62", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Acessing the LLM endpoint and initializing TeradataAI and TextAnalyticsAI\n", + "model_name = 'sentence-transformers/all-MiniLM-L6-v2'\n", + "model_args = {'transformer_class': 'AutoModelForTokenClassification', 'task' : 'token-classification'}\n", + "llm = TeradataAI(api_type = \"hugging_face\",\n", + " model_name = model_name,\n", + " model_args = model_args)\n", + "obj = TextAnalyticsAI(llm=llm)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "af205bdc-e52a-43fd-980a-82b73e070629", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# Using a user-defined script and returns argument\n", + "embeddings_script = os.path.join(base_dir, 'example-data', 'embeddings.py')\n", + "# Construct retrun columns\n", + "returns_ = OrderedDict([('text', VARCHAR(512))])\n", + "\n", + "_ = [returns_.update({\"v{}\".format(i+1): VARCHAR(1000)}) for i in range(384)]\n", + "obj.embeddings(column=\"articles\",data=df, script=embeddings_script, returns=returns_, libs='sentence_transformers', delimiter='#', persist=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e53299b2-2bf6-4e1c-a10f-085bb230eb98", + "metadata": {}, + "outputs": [], + "source": [ + "clean_env(llm)" + ] + }, + { + "cell_type": "markdown", + "id": "1afe752d-4b87-4c8b-ae3b-5f3339af6102", + "metadata": {}, + "source": [ + "
\n", + "

16. Insights & Conclusion

\n", + "

By combining these steps, the HR team can:\n", + "

\n", + "\n", + "

This end-to-end pipeline transforms unstructured employee feedback into actionable insights for HR decision-making.\n", + "

" + ] + }, + { + "cell_type": "markdown", + "id": "b7e0da01-454a-4970-a651-c7f5b3553e5b", + "metadata": {}, + "source": [ + "
\n", + "

17. Cleanup

\n", + "

17.1 Delete your OAF Container

\n", + "

Executing this cell is optional. If you will be executing more OAF use cases, you can leave your OAF environment.

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "241b7cf3-1eb5-47e7-b0c7-c2bf2bdae0e0", + "metadata": {}, + "outputs": [], + "source": [ + "#Remove your default user environment\n", + "\n", + "try:\n", + " result = remove_env(environment_name)\n", + " print(\"Environment removed!\")\n", + "except Exception as e:\n", + " print(\"Could not remove the environment!\")\n", + " print(\"Error:\", str(e))" + ] + }, + { + "cell_type": "markdown", + "id": "7e3c08dc-d6c0-45fe-bd6b-77a5f0298dbf", + "metadata": {}, + "source": [ + "

17.2 Remove your database Context

\n", + "

Please remove your context after you've completed this notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dac8d030-2f89-4160-b2da-6a233c693018", + "metadata": {}, + "outputs": [], + "source": [ + "try:\n", + " result = remove_context()\n", + " print(\"Context removed!\")\n", + "except Exception as e:\n", + " print(\"Could not remove the Context!\")\n", + " print(\"Error:\", str(e))" + ] + }, + { + "cell_type": "markdown", + "id": "cda83a5b-e9fe-4734-aa28-5d5332cbd873", + "metadata": {}, + "source": [ + "


\n", + "

View the full TeradataAI Help

" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "edf9295a-2598-426d-b322-bba46a50c246", + "metadata": {}, + "outputs": [], + "source": [ + "help(TeradataAI)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "46975cbc-778f-40c6-9388-d2165b718cea", + "metadata": {}, + "outputs": [], + "source": [ + "help(TextAnalyticsAI)" + ] + }, + { + "cell_type": "markdown", + "id": "b8587ea3-ab30-4a63-9c86-6e11fddb79be", + "metadata": {}, + "source": [ + "" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/VantageCloud_Lake/UseCases/Employee_Feedback_teradatagenai/images/OAF_Env.png b/VantageCloud_Lake/UseCases/Employee_Feedback_teradatagenai/images/OAF_Env.png new file mode 100644 index 00000000..1be627c3 Binary files /dev/null and b/VantageCloud_Lake/UseCases/Employee_Feedback_teradatagenai/images/OAF_Env.png differ diff --git a/VantageCloud_Lake/UseCases/Employee_Feedback_teradatagenai/images/TeradataLogo.png b/VantageCloud_Lake/UseCases/Employee_Feedback_teradatagenai/images/TeradataLogo.png new file mode 100644 index 00000000..a6811164 Binary files /dev/null and b/VantageCloud_Lake/UseCases/Employee_Feedback_teradatagenai/images/TeradataLogo.png differ diff --git a/VantageCloud_Lake/UseCases/Employee_Feedback_teradatagenai/images/teradatagenai.png b/VantageCloud_Lake/UseCases/Employee_Feedback_teradatagenai/images/teradatagenai.png new file mode 100644 index 00000000..e7212395 Binary files /dev/null and b/VantageCloud_Lake/UseCases/Employee_Feedback_teradatagenai/images/teradatagenai.png differ diff --git a/VantageCloud_Lake/UseCases/Employee_Feedback_teradatagenai/requirements.txt b/VantageCloud_Lake/UseCases/Employee_Feedback_teradatagenai/requirements.txt new file mode 100644 index 00000000..537bb425 --- /dev/null +++ b/VantageCloud_Lake/UseCases/Employee_Feedback_teradatagenai/requirements.txt @@ -0,0 +1,4 @@ +sentence_transformers +teradatagenai +transformers +python-dotenv \ No newline at end of file