diff --git a/ingest_data/012_Ingest_text_data_v2.ipynb b/ingest_data/012_Ingest_text_data_v2.ipynb
deleted file mode 100644
index a0a1a120f8..0000000000
--- a/ingest_data/012_Ingest_text_data_v2.ipynb
+++ /dev/null
@@ -1,1034 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Ingest Text Data\n",
- "Labeled text data can be in a structured data format, such as reviews for sentiment analysis, news headlines for topic modeling, or documents for text classification. In these cases, you may have one column for the label, one column for the text, and sometimes other columns for attributes. You can treat this structured data like tabular data, and ingest it in one of the ways discussed in the previous notebook [011_Ingest_tabular_data.ipynb](011_Ingest_tabular_data_v1.ipynb). Sometimes text data, especially raw text data comes as unstructured data and is often in .json or .txt format, and we will discuss how to ingest these types of data files into a SageMaker Notebook in this section.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Set Up Notebook"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\u001b[33mWARNING: You are using pip version 20.0.2; however, version 20.2.4 is available.\n",
- "You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.\u001b[0m\n",
- "Note: you may need to restart the kernel to use updated packages.\n"
- ]
- }
- ],
- "source": [
- "%pip install -q 's3fs==0.4.2'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "import pandas as pd\n",
- "import json\n",
- "import glob\n",
- "import s3fs\n",
- "import sagemaker"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Get SageMaker session & default S3 bucket\n",
- "sagemaker_session = sagemaker.Session()\n",
- "bucket = sagemaker_session.default_bucket() # replace with your own bucket if you have one\n",
- "s3 = sagemaker_session.boto_session.resource(\"s3\")\n",
- "\n",
- "prefix = \"text_spam/spam\"\n",
- "prefix_json = \"json_jeo\"\n",
- "filename = \"SMSSpamCollection.txt\"\n",
- "filename_json = \"JEOPARDY_QUESTIONS1.json\""
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Downloading data from Online Sources\n",
- "\n",
- "### Text data (in structured .csv format): Twitter -- sentiment140\n",
- " **Sentiment140** This is the sentiment140 dataset. It contains 1.6M tweets extracted using the twitter API. The tweets have been annotated with sentiment (0 = negative, 4 = positive) and topics (hashtags used to retrieve tweets). The dataset contains the following columns:\n",
- "* `target`: the polarity of the tweet (0 = negative, 4 = positive)\n",
- "* `ids`: The id of the tweet ( 2087)\n",
- "* `date`: the date of the tweet (Sat May 16 23:58:44 UTC 2009)\n",
- "* `flag`: The query (lyx). If there is no query, then this value is NO_QUERY.\n",
- "* `user`: the user that tweeted (robotickilldozr)\n",
- "* `text`: the text of the tweet (Lyx is cool\n",
- "\n",
- "[Second Twitter data](https://github.com/guyz/twitter-sentiment-dataset) is a Twitter data set collected as an extension to Sanders Analytics Twitter sentiment corpus, originally designed for training and testing Twitter sentiment analysis algorithms. We will use this data to showcase how to aggregate two data sets if you want to enhance your current data set by adding more data to it."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [],
- "source": [
- "# helper functions to upload data to s3\n",
- "def write_to_s3(filename, bucket, prefix):\n",
- " # put one file in a separate folder. This is helpful if you read and prepare data with Athena\n",
- " filename_key = filename.split(\".\")[0]\n",
- " key = \"{}/{}/{}\".format(prefix, filename_key, filename)\n",
- " return s3.Bucket(bucket).upload_file(filename, key)\n",
- "\n",
- "\n",
- "def upload_to_s3(bucket, prefix, filename):\n",
- " url = \"s3://{}/{}/{}\".format(bucket, prefix, filename)\n",
- " print(\"Writing to {}\".format(url))\n",
- " write_to_s3(filename, bucket, prefix)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [],
- "source": [
- "# run this cell if you are in SageMaker Studio notebook\n",
- "#!apt-get install unzip"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "URL transformed to HTTPS due to an HSTS policy\n",
- "--2020-11-02 21:16:07-- https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip\n",
- "Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64\n",
- "Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.\n",
- "HTTP request sent, awaiting response... 200 OK\n",
- "Length: 81363704 (78M) [application/zip]\n",
- "Saving to: ‘sentimen140.zip’\n",
- "\n",
- "sentimen140.zip 100%[===================>] 77.59M 18.9MB/s in 6.4s \n",
- "\n",
- "2020-11-02 21:16:14 (12.1 MB/s) - ‘sentimen140.zip’ saved [81363704/81363704]\n",
- "\n",
- "Archive: sentimen140.zip\n",
- " inflating: sentiment140/testdata.manual.2009.06.14.csv \n",
- " inflating: sentiment140/training.1600000.processed.noemoticon.csv \n"
- ]
- }
- ],
- "source": [
- "# download first twitter dataset\n",
- "!wget http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip -O sentimen140.zip\n",
- "# Uncompressing\n",
- "!unzip -o sentimen140.zip -d sentiment140"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Writing to s3://sagemaker-us-east-2-060356833389/text_sentiment140/sentiment140/training.1600000.processed.noemoticon.csv\n",
- "Writing to s3://sagemaker-us-east-2-060356833389/text_sentiment140/sentiment140/testdata.manual.2009.06.14.csv\n"
- ]
- }
- ],
- "source": [
- "# upload the files to the S3 bucket\n",
- "csv_files = glob.glob(\"sentiment140/*.csv\")\n",
- "for filename in csv_files:\n",
- " upload_to_s3(bucket, \"text_sentiment140\", filename)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "--2020-11-02 21:16:18-- https://raw.githubusercontent.com/zfz/twitter_corpus/master/full-corpus.csv\n",
- "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.200.133\n",
- "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.200.133|:443... connected.\n",
- "HTTP request sent, awaiting response... 200 OK\n",
- "Length: 910195 (889K) [text/plain]\n",
- "Saving to: ‘full-corpus.csv.2’\n",
- "\n",
- "full-corpus.csv.2 100%[===================>] 888.86K --.-KB/s in 0.08s \n",
- "\n",
- "2020-11-02 21:16:19 (10.2 MB/s) - ‘full-corpus.csv.2’ saved [910195/910195]\n",
- "\n"
- ]
- }
- ],
- "source": [
- "# download second twitter dataset\n",
- "!wget https://raw.githubusercontent.com/zfz/twitter_corpus/master/full-corpus.csv"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Writing to s3://sagemaker-us-east-2-060356833389/text_twitter_sentiment_2/full-corpus.csv\n"
- ]
- }
- ],
- "source": [
- "filename = \"full-corpus.csv\"\n",
- "upload_to_s3(bucket, \"text_twitter_sentiment_2\", filename)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Text data (in .txt format): SMS Spam data \n",
- "[SMS Spam Data](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. Each line in the text file has the correct class followed by the raw message. We will use this data to showcase how to ingest text data in .txt format."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Writing to s3://sagemaker-us-east-2-060356833389/text_spam/spam/SMSSpamCollection.txt\n"
- ]
- }
- ],
- "source": [
- "txt_files = glob.glob(\"spam/*.txt\")\n",
- "for filename in txt_files:\n",
- " upload_to_s3(bucket, \"text_spam\", filename)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "--2020-11-02 21:16:19-- http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip\n",
- "Resolving www.dt.fee.unicamp.br (www.dt.fee.unicamp.br)... 143.106.12.20\n",
- "Connecting to www.dt.fee.unicamp.br (www.dt.fee.unicamp.br)|143.106.12.20|:80... connected.\n",
- "HTTP request sent, awaiting response... 200 OK\n",
- "Length: 210521 (206K) [application/zip]\n",
- "Saving to: ‘spam.zip’\n",
- "\n",
- "spam.zip 100%[===================>] 205.59K 112KB/s in 1.8s \n",
- "\n",
- "2020-11-02 21:16:21 (112 KB/s) - ‘spam.zip’ saved [210521/210521]\n",
- "\n",
- "Archive: spam.zip\n",
- " inflating: spam/readme \n",
- " inflating: spam/SMSSpamCollection.txt \n"
- ]
- }
- ],
- "source": [
- "!wget http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip -O spam.zip\n",
- "!unzip -o spam.zip -d spam"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Text Data (in .json format): Jeopardy Question data\n",
- "[Jeopardy Question](https://j-archive.com/) was obtained by crawling the Jeopardy question archive website. It is an unordered list of questions where each question has the following key-value pairs:\n",
- "\n",
- "* `category` : the question category, e.g. \"HISTORY\"\n",
- "* `value`: dollar value of the question as string, e.g. \"\\$200\"\n",
- "* `question`: text of question\n",
- "* `answer` : text of answer\n",
- "* `round`: one of \"Jeopardy!\",\"Double Jeopardy!\",\"Final Jeopardy!\" or \"Tiebreaker\"\n",
- "* `show_number` : string of show number, e.g '4680'\n",
- "* `air_date` : the show air date in format YYYY-MM-DD"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "--2020-11-02 21:16:22-- http://skeeto.s3.amazonaws.com/share/JEOPARDY_QUESTIONS1.json.gz\n",
- "Resolving skeeto.s3.amazonaws.com (skeeto.s3.amazonaws.com)... 52.216.241.76\n",
- "Connecting to skeeto.s3.amazonaws.com (skeeto.s3.amazonaws.com)|52.216.241.76|:80... connected.\n",
- "HTTP request sent, awaiting response... 200 OK\n",
- "Length: 12721082 (12M) [application/json]\n",
- "Saving to: ‘JEOPARDY_QUESTIONS1.json.gz’\n",
- "\n",
- "JEOPARDY_QUESTIONS1 100%[===================>] 12.13M 15.0MB/s in 0.8s \n",
- "\n",
- "2020-11-02 21:16:23 (15.0 MB/s) - ‘JEOPARDY_QUESTIONS1.json.gz’ saved [12721082/12721082]\n",
- "\n",
- "Writing to s3://sagemaker-us-east-2-060356833389/json_jeo/JEOPARDY_QUESTIONS1.json\n"
- ]
- }
- ],
- "source": [
- "# json file format\n",
- "!wget http://skeeto.s3.amazonaws.com/share/JEOPARDY_QUESTIONS1.json.gz\n",
- "# Uncompressing\n",
- "!gunzip -f JEOPARDY_QUESTIONS1.json.gz\n",
- "filename = \"JEOPARDY_QUESTIONS1.json\"\n",
- "upload_to_s3(bucket, \"json_jeo\", filename)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Ingest Data into Sagemaker Notebook\n",
- "## Method 1: Copying data to the Instance\n",
- "You can use the AWS Command Line Interface (CLI) to copy your data from s3 to your SageMaker instance. This is a quick and easy approach when you are dealing with medium sized data files, or you are experimenting and doing exploratory analysis. The documentation can be found [here](https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {},
- "outputs": [],
- "source": [
- "# Specify file names\n",
- "prefix = \"text_spam/spam\"\n",
- "prefix_json = \"json_jeo\"\n",
- "filename = \"SMSSpamCollection.txt\"\n",
- "filename_json = \"JEOPARDY_QUESTIONS1.json\"\n",
- "prefix_spam_2 = \"text_spam/spam_2\""
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "download failed: s3://sagemaker-us-east-2-060356833389/json_jeo/JEOPARDY_QUESTIONS1.json to text/json_jeo/JEOPARDY_QUESTIONS1.json [Errno 28] No space left on device\n",
- "download failed: s3://sagemaker-us-east-2-060356833389/json_jeo/JEOPARDY_QUESTIONS1/JEOPARDY_QUESTIONS1.json to text/json_jeo/JEOPARDY_QUESTIONS1/JEOPARDY_QUESTIONS1.json [Errno 28] No space left on device\n"
- ]
- }
- ],
- "source": [
- "# copy data to your sagemaker instance using AWS CLI\n",
- "!aws s3 cp s3://$bucket/$prefix_json/ text/$prefix_json/ --recursive"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "{'category': 'HISTORY', 'air_date': '2004-12-31', 'question': \"'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'\", 'value': '$200', 'answer': 'Copernicus', 'round': 'Jeopardy!', 'show_number': '4680'}\n"
- ]
- }
- ],
- "source": [
- "data_location = \"text/{}/{}\".format(prefix_json, filename_json)\n",
- "with open(data_location) as f:\n",
- " data = json.load(f)\n",
- " print(data[0])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Method 2: Use AWS compatible Python Packages\n",
- "When you are dealing with large data sets, or do not want to lose any data when you delete your Sagemaker Notebook Instance, you can use pre-built packages to access your files in S3 without copying files into your instance. These packages, such as `Pandas`, have implemented options to access data with a specified path string: while you will use `file://` on your local file system, you will use `s3://` instead to access the data through the AWS boto library. For `pandas`, any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. You can find additional documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). \n",
- "\n",
- "For text data, most of the time you can read it as line-by-line files or use Pandas to read it as a DataFrame by specifying a delimiter."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " 0 | \n",
- " 1 | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " | 0 | \n",
- " ham | \n",
- " Go until jurong point, crazy.. Available only ... | \n",
- "
\n",
- " \n",
- " | 1 | \n",
- " ham | \n",
- " Ok lar... Joking wif u oni... | \n",
- "
\n",
- " \n",
- " | 2 | \n",
- " spam | \n",
- " Free entry in 2 a wkly comp to win FA Cup fina... | \n",
- "
\n",
- " \n",
- " | 3 | \n",
- " ham | \n",
- " U dun say so early hor... U c already then say... | \n",
- "
\n",
- " \n",
- " | 4 | \n",
- " ham | \n",
- " Nah I don't think he goes to usf, he lives aro... | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " 0 1\n",
- "0 ham Go until jurong point, crazy.. Available only ...\n",
- "1 ham Ok lar... Joking wif u oni...\n",
- "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n",
- "3 ham U dun say so early hor... U c already then say...\n",
- "4 ham Nah I don't think he goes to usf, he lives aro..."
- ]
- },
- "execution_count": 16,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data_s3_location = \"s3://{}/{}/{}\".format(bucket, prefix, filename) # S3 URL\n",
- "s3_tabular_data = pd.read_csv(data_s3_location, sep=\"\\t\", header=None)\n",
- "s3_tabular_data.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "For JSON files, depending on the structure, you can also use `Pandas` `read_json` function to read it if it's a flat json file."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " category | \n",
- " air_date | \n",
- " question | \n",
- " value | \n",
- " answer | \n",
- " round | \n",
- " show_number | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " | 0 | \n",
- " HISTORY | \n",
- " 2004-12-31 | \n",
- " 'For the last 8 years of his life, Galileo was... | \n",
- " $200 | \n",
- " Copernicus | \n",
- " Jeopardy! | \n",
- " 4680 | \n",
- "
\n",
- " \n",
- " | 1 | \n",
- " ESPN's TOP 10 ALL-TIME ATHLETES | \n",
- " 2004-12-31 | \n",
- " 'No. 2: 1912 Olympian; football star at Carlis... | \n",
- " $200 | \n",
- " Jim Thorpe | \n",
- " Jeopardy! | \n",
- " 4680 | \n",
- "
\n",
- " \n",
- " | 2 | \n",
- " EVERYBODY TALKS ABOUT IT... | \n",
- " 2004-12-31 | \n",
- " 'The city of Yuma in this state has a record a... | \n",
- " $200 | \n",
- " Arizona | \n",
- " Jeopardy! | \n",
- " 4680 | \n",
- "
\n",
- " \n",
- " | 3 | \n",
- " THE COMPANY LINE | \n",
- " 2004-12-31 | \n",
- " 'In 1963, live on \"The Art Linkletter Show\", t... | \n",
- " $200 | \n",
- " McDonald\\'s | \n",
- " Jeopardy! | \n",
- " 4680 | \n",
- "
\n",
- " \n",
- " | 4 | \n",
- " EPITAPHS & TRIBUTES | \n",
- " 2004-12-31 | \n",
- " 'Signer of the Dec. of Indep., framer of the C... | \n",
- " $200 | \n",
- " John Adams | \n",
- " Jeopardy! | \n",
- " 4680 | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " category air_date \\\n",
- "0 HISTORY 2004-12-31 \n",
- "1 ESPN's TOP 10 ALL-TIME ATHLETES 2004-12-31 \n",
- "2 EVERYBODY TALKS ABOUT IT... 2004-12-31 \n",
- "3 THE COMPANY LINE 2004-12-31 \n",
- "4 EPITAPHS & TRIBUTES 2004-12-31 \n",
- "\n",
- " question value answer \\\n",
- "0 'For the last 8 years of his life, Galileo was... $200 Copernicus \n",
- "1 'No. 2: 1912 Olympian; football star at Carlis... $200 Jim Thorpe \n",
- "2 'The city of Yuma in this state has a record a... $200 Arizona \n",
- "3 'In 1963, live on \"The Art Linkletter Show\", t... $200 McDonald\\'s \n",
- "4 'Signer of the Dec. of Indep., framer of the C... $200 John Adams \n",
- "\n",
- " round show_number \n",
- "0 Jeopardy! 4680 \n",
- "1 Jeopardy! 4680 \n",
- "2 Jeopardy! 4680 \n",
- "3 Jeopardy! 4680 \n",
- "4 Jeopardy! 4680 "
- ]
- },
- "execution_count": 17,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "data_json_location = \"s3://{}/{}/{}\".format(bucket, prefix_json, filename_json)\n",
- "s3_tabular_data_json = pd.read_json(data_json_location, orient=\"records\")\n",
- "s3_tabular_data_json.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Method 3: Use AWS Native methods\n",
- "#### s3fs\n",
- "[S3Fs](https://s3fs.readthedocs.io/en/latest/) is a Pythonic file interface to S3. It builds on top of botocore. The top-level class S3FileSystem holds connection information and allows typical file-system style operations like cp, mv, ls, du, glob, etc., as well as put/get of local files to/from S3. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "['sagemaker-us-east-2-060356833389/text_spam/spam/SMSSpamCollection',\n",
- " 'sagemaker-us-east-2-060356833389/text_spam/spam/SMSSpamCollection.txt']"
- ]
- },
- "execution_count": 18,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "fs = s3fs.S3FileSystem()\n",
- "data_s3fs_location = \"s3://{}/{}/\".format(bucket, prefix)\n",
- "# To List all files in your accessible bucket\n",
- "fs.ls(data_s3fs_location)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- " ham \\\n",
- "0 ham \n",
- "1 spam \n",
- "\n",
- " Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... \n",
- "0 Ok lar... Joking wif u oni... \n",
- "1 Free entry in 2 a wkly comp to win FA Cup fina... \n"
- ]
- }
- ],
- "source": [
- "# open it directly with s3fs\n",
- "data_s3fs_location = \"s3://{}/{}/{}\".format(bucket, prefix, filename) # S3 URL\n",
- "with fs.open(data_s3fs_location) as f:\n",
- " print(pd.read_csv(f, sep=\"\\t\", nrows=2))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Aggregating datasets\n",
- "If you would like to enhance your data with more data collected for your use cases, you can always aggregate your newly-collected data with your current dataset. We will use two datasets -- Sentiment140 and Sanders Twitter Sentiment to show how to aggregate data together."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "metadata": {},
- "outputs": [],
- "source": [
- "prefix_tw1 = \"text_sentiment140/sentiment140\"\n",
- "filename_tw1 = \"training.1600000.processed.noemoticon.csv\"\n",
- "prefix_added = \"text_twitter_sentiment_2\"\n",
- "filename_added = \"full-corpus.csv\""
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's read in our original data and take a look at its format and schema:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 21,
- "metadata": {},
- "outputs": [],
- "source": [
- "data_s3_location_base = \"s3://{}/{}/{}\".format(bucket, prefix_tw1, filename_tw1) # S3 URL\n",
- "# we will showcase with a smaller subset of data for demonstration purpose\n",
- "text_data = pd.read_csv(\n",
- " data_s3_location_base, header=None, encoding=\"ISO-8859-1\", low_memory=False, nrows=10000\n",
- ")\n",
- "text_data.columns = [\"target\", \"tw_id\", \"date\", \"flag\", \"user\", \"text\"]"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "We have 6 columns, `date`, `text`, `flag` (which is the topic the twitter was queried), `tw_id` (tweet's id), `user` (user account name), and `target` (0 = neg, 4 = pos)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 22,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " target | \n",
- " tw_id | \n",
- " date | \n",
- " flag | \n",
- " user | \n",
- " text | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " | 0 | \n",
- " 0 | \n",
- " 1467810369 | \n",
- " Mon Apr 06 22:19:45 PDT 2009 | \n",
- " NO_QUERY | \n",
- " _TheSpecialOne_ | \n",
- " @switchfoot http://twitpic.com/2y1zl - Awww, t... | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " target tw_id date flag \\\n",
- "0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY \n",
- "\n",
- " user text \n",
- "0 _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t... "
- ]
- },
- "execution_count": 22,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "text_data.head(1)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Let's read in and take a look at the data we want to add to our original data. \n",
- "\n",
- "We will start by checking for columns for both data sets. The new data set has 5 columns, `TweetDate` which maps to `date`, `TweetText` which maps to `text`, `Topic` which maps to `flag`, `TweetId` which maps to `tw_id`, and `Sentiment` mapped to `target`. In this new data set, we don't have `user account name` column, so when we aggregate two data sets we can add this column to the data set to be added and fill it with `NULL` values. You can also remove this column from the original data if it does not provide much valuable information based on your use cases. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {},
- "outputs": [],
- "source": [
- "data_s3_location_added = \"s3://{}/{}/{}\".format(bucket, prefix_added, filename_added) # S3 URL\n",
- "# we will showcase with a smaller subset of data for demonstration purpose\n",
- "text_data_added = pd.read_csv(\n",
- " data_s3_location_added, encoding=\"ISO-8859-1\", low_memory=False, nrows=10000\n",
- ")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " Topic | \n",
- " Sentiment | \n",
- " TweetId | \n",
- " TweetDate | \n",
- " TweetText | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " | 0 | \n",
- " apple | \n",
- " positive | \n",
- " 126415614616154112 | \n",
- " Tue Oct 18 21:53:25 +0000 2011 | \n",
- " Now all @Apple has to do is get swype on the i... | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " Topic Sentiment TweetId TweetDate \\\n",
- "0 apple positive 126415614616154112 Tue Oct 18 21:53:25 +0000 2011 \n",
- "\n",
- " TweetText \n",
- "0 Now all @Apple has to do is get swype on the i... "
- ]
- },
- "execution_count": 24,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "text_data_added.head(1)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Add the missing column to the new data set and fill it with `NULL`"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {},
- "outputs": [],
- "source": [
- "text_data_added[\"user\"] = \"\""
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Renaming the new data set columns to combine two data sets"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " flag | \n",
- " target | \n",
- " tw_id | \n",
- " date | \n",
- " text | \n",
- " user | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " | 0 | \n",
- " apple | \n",
- " positive | \n",
- " 126415614616154112 | \n",
- " Tue Oct 18 21:53:25 +0000 2011 | \n",
- " Now all @Apple has to do is get swype on the i... | \n",
- " | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " flag target tw_id date \\\n",
- "0 apple positive 126415614616154112 Tue Oct 18 21:53:25 +0000 2011 \n",
- "\n",
- " text user \n",
- "0 Now all @Apple has to do is get swype on the i... "
- ]
- },
- "execution_count": 26,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "text_data_added.columns = [\"flag\", \"target\", \"tw_id\", \"date\", \"text\", \"user\"]\n",
- "text_data_added.head(1)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Change the `target` column to the same format as the `target` in the original data set\n",
- "Note that the `target` column in the new data set is marked as \"positive\", \"negative\", \"neutral\", and \"irrelevant\", whereas the `target` in the original data set is marked as \"0\" and \"4\". So let's map \"positive\" to 4, \"neutral\" to 2, and \"negative\" to 0 in our new data set so that they are consistent. For \"irrelevant\", which are either not English or Spam, you can either remove these if it is not valuable for your use case (In our use case of sentiment analysis, we will remove those since these text does not provide any value in terms of predicting sentiment) or map them to -1. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {},
- "outputs": [],
- "source": [
- "# remove tweets labeled as irelevant\n",
- "text_data_added = text_data_added[text_data_added[\"target\"] != \"irelevant\"]\n",
- "# convert strings to number targets\n",
- "target_map = {\"positive\": 4, \"negative\": 0, \"neutral\": 2}\n",
- "text_data_added[\"target\"] = text_data_added[\"target\"].map(target_map)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### Combine the two data sets and save as one new file"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Writing to s3://sagemaker-us-east-2-060356833389/text_twitter_sentiment_full/sentiment_full.csv\n"
- ]
- }
- ],
- "source": [
- "text_data_new = pd.concat([text_data, text_data_added])\n",
- "filename = \"sentiment_full.csv\"\n",
- "text_data_new.to_csv(filename, index=False)\n",
- "upload_to_s3(bucket, \"text_twitter_sentiment_full\", filename)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Citation\n",
- "Twitter140 Data, Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.\n",
- "\n",
- "SMS Spaming data, Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.\n",
- "\n",
- "J! Archive, J! Archive is created by fans, for fans. The Jeopardy! game show and all elements thereof, including but not limited to copyright and trademark thereto, are the property of Jeopardy Productions, Inc. and are protected under law. This website is not affiliated with, sponsored by, or operated by Jeopardy Productions, Inc."
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.4"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/ingest_data/index.rst b/ingest_data/index.rst
index 75335ad81f..e81eba8b63 100644
--- a/ingest_data/index.rst
+++ b/ingest_data/index.rst
@@ -20,9 +20,9 @@ SageMaker uses a `default bucket