diff --git a/prep_data/index.rst b/prep_data/index.rst index 466575faae..28a668794d 100644 --- a/prep_data/index.rst +++ b/prep_data/index.rst @@ -49,4 +49,4 @@ Text data guide .. toctree:: :maxdepth: 1 - text_data/04_preprocessing_text_data_v3 + text_data/preprocessing_text_data diff --git a/prep_data/text_data/04_preprocessing_text_data_v3.ipynb b/prep_data/text_data/preprocessing_text_data.ipynb similarity index 54% rename from prep_data/text_data/04_preprocessing_text_data_v3.ipynb rename to prep_data/text_data/preprocessing_text_data.ipynb index 65e9260e57..825bef5d81 100644 --- a/prep_data/text_data/04_preprocessing_text_data_v3.ipynb +++ b/prep_data/text_data/preprocessing_text_data.ipynb @@ -6,7 +6,7 @@ "source": [ "# Preprocessing Text Data\n", "\n", - "The purpose of this notebook is to demonstrate how to preprocessing text data for next-step feature engineering and training a machine learning model via Amazon SageMaker. In this notebook we will focus on preprocessing our text data, and we will use the text data we ingested in a [sequel notebook](https://sagemaker-examples.readthedocs.io/en/latest/data_ingestion/012_Ingest_text_data_v2.html) to showcase text data preprocessing methodologies. We are going to discuss many possible methods to clean and enrich your text, but you do not need to run through every single step below. Usually, a rule of thumb is: if you are dealing with very noisy text, like social media text data, or nurse notes, then medium to heavy preprocessing effort might be needed, and if it's domain-specific corpus, text enrichment is helpful as well; if you are dealing with long and well-written documents such as news articles and papers, very light preprocessing is needed; you can add some enrichment to the data to better capture the sentence to sentence relationship and overall meaning. \n" + "The purpose of this notebook is to demonstrate how to preprocessing text data for next-step feature engineering and training a machine learning model via Amazon SageMaker. In this notebook we will focus on preprocessing our text data. We are going to discuss many possible methods to clean and enrich your text, but you do not need to run through every single step below. Usually, a rule of thumb is: if you are dealing with very noisy text, like social media text data, or nurse notes, then medium to heavy preprocessing effort might be needed, and if it's domain-specific corpus, text enrichment is helpful as well; if you are dealing with long and well-written documents such as news articles and papers, very light preprocessing is needed; you can add some enrichment to the data to better capture the sentence to sentence relationship and overall meaning. \n" ] }, { @@ -15,7 +15,7 @@ "source": [ "## Overview\n", "### Input Format \n", - "Labeled text data sometimes are in a structured data format. You might come across this when working on reviews for sentiment analysis, news headlines for topic modeling, or documents for text classification. One column of the dataset could be dedicated for the label, one column for the text, and sometimes other columns as attributes. You can process this dataset format similar to how you would process tabular data and ingest them in the [last section](https://github.com/aws/amazon-sagemaker-examples/blob/master/preprocessing/tabular_data/preprocessing_tabular_data.ipynb). Sometimes text data, especially raw text data, comes as unstructured data and is often in .json or .txt format. To work with this type of formatting, you will need to first extract useful information from the original dataset. \n", + "Labeled text data sometimes are in a structured data format. You might come across this when working on reviews for sentiment analysis, news headlines for topic modeling, or documents for text classification. One column of the dataset could be dedicated for the label, one column for the text, and sometimes other columns as attributes. You can process this dataset format similar to how you would process tabular data (see [Preprocessing Tabular Data](https://github.com/aws/amazon-sagemaker-examples/blob/main/prep_data/tabular_data/01_preprocessing_tabular_data.ipynb) for an example). Sometimes text data, especially raw text data, comes as unstructured data and is often in .json or .txt format. To work with this type of formatting, you will need to first extract useful information from the original dataset. \n", "\n", "### Use Cases\n", "Text data contains rich information and it's everywhere. Applicable use cases include Voice of Customer (VOC), fraud detection, warranty analysis, chatbot and customer service routing, audience analysis, and much more. \n", @@ -36,7 +36,7 @@ "\n", "* [nltk (natrual language toolkit)](https://www.nltk.org/), a leading platform includes multiple text processing libraries, which covers almost all aspects of preprocessing we will discuss in this section: tokenization, stemming, lemmatization, parsing, chunking, POS tagging, stop words, etc.\n", "\n", - "* [SpaCy] (https://spacy.io/), offers most functionality provided by `nltk`, and provides pre-trained word vectors and models. It is scalable and designed for production usage.\n", + "* [SpaCy](https://spacy.io/), offers most functionality provided by `nltk`, and provides pre-trained word vectors and models. It is scalable and designed for production usage.\n", "\n", "* [Gensim (Generate Similar)](https://radimrehurek.com/gensim/about.html), \"designed specifically for topic modeling, document indexing, and similarity retrieval with large corpora\". \n", "\n", @@ -45,26 +45,17 @@ }, { "cell_type": "code", - "execution_count": 69, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[33mWARNING: You are using pip version 20.0.2; however, version 20.2.4 is available.\n", - "You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.\u001b[0m\n", - "Note: you may need to restart the kernel to use updated packages.\n" - ] - } - ], + "outputs": [], "source": [ - "%pip install -qU 'sagemaker>=2.15.0' spacy gensim textblob emot autocorrect" + "! python -m pip install --upgrade pip\n", + "! pip install -U 'sagemaker>=2.15.0' spacy gensim==4.0.0 textblob emot==2.1 autocorrect" ] }, { "cell_type": "code", - "execution_count": 70, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -80,7 +71,7 @@ }, { "cell_type": "code", - "execution_count": 71, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -116,7 +107,7 @@ }, { "cell_type": "code", - "execution_count": 72, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -136,7 +127,7 @@ }, { "cell_type": "code", - "execution_count": 73, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -146,31 +137,9 @@ }, { "cell_type": "code", - "execution_count": 74, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "URL transformed to HTTPS due to an HSTS policy\n", - "--2020-11-02 21:57:53-- https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip\n", - "Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64\n", - "Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.\n", - "HTTP request sent, awaiting response... 200 OK\n", - "Length: 81363704 (78M) [application/zip]\n", - "Saving to: ‘sentimen140.zip’\n", - "\n", - "sentimen140.zip 100%[===================>] 77.59M 23.9MB/s in 3.5s \n", - "\n", - "2020-11-02 21:57:57 (22.1 MB/s) - ‘sentimen140.zip’ saved [81363704/81363704]\n", - "\n", - "Archive: sentimen140.zip\n", - " inflating: sentiment140/testdata.manual.2009.06.14.csv \n", - " inflating: sentiment140/training.1600000.processed.noemoticon.csv \n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "!wget http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip -O sentimen140.zip\n", "# Uncompressing\n", @@ -179,18 +148,9 @@ }, { "cell_type": "code", - "execution_count": 75, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Writing to s3://sagemaker-us-east-2-060356833389/text_sentiment140/sentiment140/training.1600000.processed.noemoticon.csv\n", - "Writing to s3://sagemaker-us-east-2-060356833389/text_sentiment140/sentiment140/testdata.manual.2009.06.14.csv\n" - ] - } - ], + "outputs": [], "source": [ "# upload the files to the S3 bucket\n", "csv_files = glob.glob(\"sentiment140/*.csv\")\n", @@ -210,19 +170,18 @@ }, { "cell_type": "code", - "execution_count": 76, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", + "import boto3\n", "\n", "prefix = \"text_sentiment140/sentiment140\"\n", "filename = \"training.1600000.processed.noemoticon.csv\"\n", - "data_s3_location = \"s3://{}/{}/{}\".format(bucket, prefix, filename) # S3 URL\n", + "s3.Bucket(bucket).download_file(prefix + \"/\" + filename, filename)\n", "# we will showcase with a smaller subset of data for demonstration purpose\n", - "text_data = pd.read_csv(\n", - " data_s3_location, header=None, encoding=\"ISO-8859-1\", low_memory=False, nrows=10000\n", - ")\n", + "text_data = pd.read_csv(filename, header=None, encoding=\"ISO-8859-1\", low_memory=False, nrows=10000)\n", "text_data.columns = [\"target\", \"tw_id\", \"date\", \"flag\", \"user\", \"text\"]" ] }, @@ -238,25 +197,9 @@ }, { "cell_type": "code", - "execution_count": 77, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D\n", - "1 is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!\n", - "2 @Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds\n", - "3 my whole body feels itchy and like its on fire \n", - "4 @nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there. \n", - "Name: text, dtype: object" - ] - }, - "execution_count": 77, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "pd.set_option(\"display.max_colwidth\", None) # show full content in a column\n", "text_data[\"text\"][:5]" @@ -294,7 +237,7 @@ }, { "cell_type": "code", - "execution_count": 78, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -312,18 +255,9 @@ }, { "cell_type": "code", - "execution_count": 79, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D\n", - "Removed URL:@switchfoot - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D\n" - ] - } - ], + "outputs": [], "source": [ "print(text_data[\"text\"][0])\n", "print(\"Removed URL:\" + remove_urls(text_data[\"text\"][0]))" @@ -338,7 +272,7 @@ }, { "cell_type": "code", - "execution_count": 80, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -347,7 +281,7 @@ }, { "cell_type": "code", - "execution_count": 81, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -363,7 +297,7 @@ }, { "cell_type": "code", - "execution_count": 82, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -387,18 +321,9 @@ }, { "cell_type": "code", - "execution_count": 83, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "original text: @switchfoot http/twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. \n", - "removed emoticons: @switchfoot httpSkeptical annoyed undecided uneasy or hesitant/twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. Wink or smirk\n" - ] - } - ], + "outputs": [], "source": [ "print(\"original text: \" + remove_emoticons(text_data[\"text\"][0]))\n", "print(\"removed emoticons: \" + convert_emoticons(text_data[\"text\"][0]))" @@ -414,7 +339,7 @@ }, { "cell_type": "code", - "execution_count": 84, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -423,57 +348,9 @@ }, { "cell_type": "code", - "execution_count": 85, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
textcleaned_text
0@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D@switchfoot - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. Wink or smirk
\n", - "
" - ], - "text/plain": [ - " text \\\n", - "0 @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D \n", - "\n", - " cleaned_text \n", - "0 @switchfoot - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. Wink or smirk " - ] - }, - "execution_count": 85, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "text_data[[\"text\", \"cleaned_text\"]][:1]" ] @@ -505,57 +382,9 @@ }, { "cell_type": "code", - "execution_count": 86, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
cleaned_texttext_lower
0@switchfoot - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. Wink or smirk@switchfoot - awww, that's a bummer. you shoulda got david carr of third day to do it. wink or smirk
\n", - "
" - ], - "text/plain": [ - " cleaned_text \\\n", - "0 @switchfoot - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. Wink or smirk \n", - "\n", - " text_lower \n", - "0 @switchfoot - awww, that's a bummer. you shoulda got david carr of third day to do it. wink or smirk " - ] - }, - "execution_count": 86, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "text_data[\"text_lower\"] = text_data[\"cleaned_text\"].str.lower()\n", "text_data[[\"cleaned_text\", \"text_lower\"]][:1]" @@ -573,7 +402,7 @@ }, { "cell_type": "code", - "execution_count": 87, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -588,20 +417,9 @@ }, { "cell_type": "code", - "execution_count": 88, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'@kenichan i dived many times for the ball. managed to save % the rest go out of bounds'" - ] - }, - "execution_count": 88, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "# let's check the results of our function\n", "remove_numbers(text_data[\"text_lower\"][2])" @@ -609,7 +427,7 @@ }, { "cell_type": "code", - "execution_count": 89, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -626,7 +444,7 @@ }, { "cell_type": "code", - "execution_count": 90, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -643,18 +461,9 @@ }, { "cell_type": "code", - "execution_count": 91, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "original text: @switchfoot - awww, that's a bummer. you shoulda got david carr of third day to do it. wink or smirk\n", - "removed mentions: - awww, that's a bummer. you shoulda got david carr of third day to do it. wink or smirk\n" - ] - } - ], + "outputs": [], "source": [ "print(\"original text: \" + text_data[\"text_lower\"][0])\n", "print(\"removed mentions: \" + remove_mentions(text_data[\"text_lower\"][0]))" @@ -662,7 +471,7 @@ }, { "cell_type": "code", - "execution_count": 92, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -683,7 +492,7 @@ }, { "cell_type": "code", - "execution_count": 93, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -694,62 +503,9 @@ }, { "cell_type": "code", - "execution_count": 94, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
textnormalized_textmentions
0@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D- awww, that's a bummer. you shoulda got david carr of third day to do it. wink or smirk[switchfoot]
\n", - "
" - ], - "text/plain": [ - " text \\\n", - "0 @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D \n", - "\n", - " normalized_text \\\n", - "0 - awww, that's a bummer. you shoulda got david carr of third day to do it. wink or smirk \n", - "\n", - " mentions \n", - "0 [switchfoot] " - ] - }, - "execution_count": 94, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "text_data[[\"text\", \"normalized_text\", \"mentions\"]].head(1)" ] @@ -765,7 +521,7 @@ }, { "cell_type": "code", - "execution_count": 95, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -785,27 +541,16 @@ }, { "cell_type": "code", - "execution_count": 96, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'i dived many times for the ball managed to save the rest go out of bounds'" - ] - }, - "execution_count": 96, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "remove_punctuation(text_data[\"normalized_text\"][2])" ] }, { "cell_type": "code", - "execution_count": 97, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -822,7 +567,7 @@ }, { "cell_type": "code", - "execution_count": 98, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -838,18 +583,9 @@ }, { "cell_type": "code", - "execution_count": 99, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "original text: i dived many times for the ball managed to save the rest go out of bounds\n", - "removed whitespaces: i dived many times for the ball managed to save the rest go out of bounds\n" - ] - } - ], + "outputs": [], "source": [ "print(\"original text: \" + text_data[\"normalized_text\"][2])\n", "print(\"removed whitespaces: \" + remove_whitespace(text_data[\"normalized_text\"][2]))" @@ -857,7 +593,7 @@ }, { "cell_type": "code", - "execution_count": 100, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -884,35 +620,16 @@ }, { "cell_type": "code", - "execution_count": 101, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...\n", - "[nltk_data] Package punkt is already up-to-date!\n" - ] - }, - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 101, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "nltk.download(\"punkt\")" ] }, { "cell_type": "code", - "execution_count": 102, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -931,7 +648,7 @@ }, { "cell_type": "code", - "execution_count": 103, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -940,57 +657,9 @@ }, { "cell_type": "code", - "execution_count": 104, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
normalized_texttokenized_text
0awww thats a bummer you shoulda got david carr of third day to do it wink or smirk[awww, thats, a, bummer, you, shoulda, got, david, carr, of, third, day, to, do, it, wink, or, smirk]
\n", - "
" - ], - "text/plain": [ - " normalized_text \\\n", - "0 awww thats a bummer you shoulda got david carr of third day to do it wink or smirk \n", - "\n", - " tokenized_text \n", - "0 [awww, thats, a, bummer, you, shoulda, got, david, carr, of, third, day, to, do, it, wink, or, smirk] " - ] - }, - "execution_count": 104, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "text_data[[\"normalized_text\", \"tokenized_text\"]][:1]" ] @@ -1006,19 +675,9 @@ }, { "cell_type": "code", - "execution_count": 105, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[nltk_data] Downloading package stopwords to\n", - "[nltk_data] /home/ec2-user/nltk_data...\n", - "[nltk_data] Package stopwords is already up-to-date!\n" - ] - } - ], + "outputs": [], "source": [ "nltk.download(\"stopwords\")\n", "from nltk.corpus import stopwords" @@ -1026,7 +685,7 @@ }, { "cell_type": "code", - "execution_count": 106, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1044,29 +703,9 @@ }, { "cell_type": "code", - "execution_count": 107, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[('i', 5317),\n", - " ('to', 4047),\n", - " ('the', 3264),\n", - " ('a', 2379),\n", - " ('my', 2271),\n", - " ('and', 1955),\n", - " ('is', 1819),\n", - " ('in', 1549),\n", - " ('it', 1495),\n", - " ('for', 1343)]" - ] - }, - "execution_count": 107, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "from collections import Counter\n", "\n", @@ -1085,28 +724,9 @@ }, { "cell_type": "code", - "execution_count": 108, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[('rainboot', 1),\n", - " ('colleague', 1),\n", - " ('jaws', 1),\n", - " ('windsor', 1),\n", - " ('castiel', 1),\n", - " ('georgous', 1),\n", - " ('thingsss', 1),\n", - " ('howwww', 1),\n", - " ('christopher', 1)]" - ] - }, - "execution_count": 108, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "# least frequent words\n", "counter.most_common()[:-10:-1]" @@ -1114,7 +734,7 @@ }, { "cell_type": "code", - "execution_count": 109, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1138,18 +758,9 @@ }, { "cell_type": "code", - "execution_count": 110, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "['i', 'dived', 'many', 'times', 'for', 'the', 'ball', 'managed', 'to', 'save', 'the', 'rest', 'go', 'out', 'of', 'bounds']\n", - "['dived', 'many', 'times', 'ball', 'managed', 'save', 'rest', 'go', 'bounds']\n" - ] - } - ], + "outputs": [], "source": [ "print(text_data[\"tokenized_text\"][2])\n", "print(remove_stopwords(text_data[\"tokenized_text\"][2]))" @@ -1157,7 +768,7 @@ }, { "cell_type": "code", - "execution_count": 111, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1174,7 +785,7 @@ }, { "cell_type": "code", - "execution_count": 112, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1190,7 +801,7 @@ }, { "cell_type": "code", - "execution_count": 113, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1213,18 +824,9 @@ }, { "cell_type": "code", - "execution_count": 114, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "['counts', 'idk', 'either', 'never', 'talk', 'anymore']\n", - "['counts', 'i', 'do', 'not', 'know', 'either', 'never', 'talk', 'anymore']\n" - ] - } - ], + "outputs": [], "source": [ "print(text_data[\"tokenized_text\"][13])\n", "print(translator(text_data[\"tokenized_text\"][13]))" @@ -1232,7 +834,7 @@ }, { "cell_type": "code", - "execution_count": 115, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1249,7 +851,7 @@ }, { "cell_type": "code", - "execution_count": 116, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1258,7 +860,7 @@ }, { "cell_type": "code", - "execution_count": 117, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1279,18 +881,9 @@ }, { "cell_type": "code", - "execution_count": 118, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "['awww', 'bummer', 'shoulda', 'got', 'david', 'carr', 'third', 'day', 'wink', 'smirk']\n", - "['www', 'summer', 'should', 'got', 'david', 'carr', 'third', 'day', 'wink', 'smirk']\n" - ] - } - ], + "outputs": [], "source": [ "print(text_data[\"tokenized_text\"][0])\n", "print(spelling_correct(text_data[\"tokenized_text\"][0]))" @@ -1298,7 +891,7 @@ }, { "cell_type": "code", - "execution_count": 119, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1323,7 +916,7 @@ }, { "cell_type": "code", - "execution_count": 120, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1353,35 +946,16 @@ }, { "cell_type": "code", - "execution_count": 121, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...\n", - "[nltk_data] Package wordnet is already up-to-date!\n" - ] - }, - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 121, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "nltk.download(\"wordnet\")" ] }, { "cell_type": "code", - "execution_count": 122, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1412,19 +986,9 @@ }, { "cell_type": "code", - "execution_count": 123, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "['dived', 'many', 'times', 'ball', 'managed', 'save', 'rest', 'go', 'bounds']\n", - "['dive', 'mani', 'time', 'ball', 'manag', 'save', 'rest', 'go', 'bound']\n", - "['dive', 'many', 'time', 'ball', 'manage', 'save', 'rest', 'go', 'bound']\n" - ] - } - ], + "outputs": [], "source": [ "print(text_data[\"tokenized_text\"][2])\n", "print(stem_text(text_data[\"tokenized_text\"][2]))\n", @@ -1440,7 +1004,7 @@ }, { "cell_type": "code", - "execution_count": 124, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1458,98 +1022,9 @@ }, { "cell_type": "code", - "execution_count": 125, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
textstem_textlemma_text
1972feeling very poorly and sorry for myself. Can't swallow, ow Stupid glands.[feel, poor, sorri, cant, swallow, ow, stupid, gland][feel, poorly, sorry, cant, swallow, ow, stupid, glands]
5625@LorettaK @HeatherShorter Seriously though - there are 6 pairs of shoes in that fedex box, all bought recently[serious, though, pair, shoe, fedex, box, bought, recent][seriously, though, pair, shoe, fedex, box, buy, recently]
7138was late to work and hopes she is not in trouble...[late, work, hope, troubl][late, work, hop, trouble]
1326has to return the shirt she bought from Topshop bc she has $50 in her bank account that has to last her the rest of the month, life sucks[return, shirt, bought, topshop, bc, bank, account, last, rest, month, life, suck][return, shirt, buy, topshop, bc, bank, account, last, rest, month, life, suck]
324@ridley1013 I agree. The shapeshifting is a copout. I was so excited for Angela's ep, I thought it was this week. Noah was awesome tho![agre, shapeshift, copout, excit, angel, ep, thought, week, noah, awesom, tho][agree, shapeshifting, copout, excite, angels, ep, think, week, noah, awesome, tho]
\n", - "
" - ], - "text/plain": [ - " text \\\n", - "1972 feeling very poorly and sorry for myself. Can't swallow, ow Stupid glands. \n", - "5625 @LorettaK @HeatherShorter Seriously though - there are 6 pairs of shoes in that fedex box, all bought recently \n", - "7138 was late to work and hopes she is not in trouble... \n", - "1326 has to return the shirt she bought from Topshop bc she has $50 in her bank account that has to last her the rest of the month, life sucks \n", - "324 @ridley1013 I agree. The shapeshifting is a copout. I was so excited for Angela's ep, I thought it was this week. Noah was awesome tho! \n", - "\n", - " stem_text \\\n", - "1972 [feel, poor, sorri, cant, swallow, ow, stupid, gland] \n", - "5625 [serious, though, pair, shoe, fedex, box, bought, recent] \n", - "7138 [late, work, hope, troubl] \n", - "1326 [return, shirt, bought, topshop, bc, bank, account, last, rest, month, life, suck] \n", - "324 [agre, shapeshift, copout, excit, angel, ep, thought, week, noah, awesom, tho] \n", - "\n", - " lemma_text \n", - "1972 [feel, poorly, sorry, cant, swallow, ow, stupid, glands] \n", - "5625 [seriously, though, pair, shoe, fedex, box, buy, recently] \n", - "7138 [late, work, hop, trouble] \n", - "1326 [return, shirt, buy, topshop, bc, bank, account, last, rest, month, life, suck] \n", - "324 [agree, shapeshifting, copout, excite, angels, ep, think, week, noah, awesome, tho] " - ] - }, - "execution_count": 125, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "text_data.sample(5)[[\"text\", \"stem_text\", \"lemma_text\"]]" ] @@ -1574,51 +1049,18 @@ }, { "cell_type": "code", - "execution_count": 126, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[nltk_data] Downloading package averaged_perceptron_tagger to\n", - "[nltk_data] /home/ec2-user/nltk_data...\n", - "[nltk_data] Package averaged_perceptron_tagger is already up-to-\n", - "[nltk_data] date!\n" - ] - }, - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 126, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "nltk.download(\"averaged_perceptron_tagger\")" ] }, { "cell_type": "code", - "execution_count": 127, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "5444 [workingggggggg, ughhh, phone, wont, let, twitter]\n", - "Name: lemma_text, dtype: object" - ] - }, - "execution_count": 127, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "text_example = text_data.sample()[\"lemma_text\"]\n", "text_example" @@ -1626,17 +1068,9 @@ }, { "cell_type": "code", - "execution_count": 128, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[('workingggggggg', 'NN'), ('ughhh', 'JJ'), ('phone', 'NN'), ('wont', 'NN'), ('let', 'NN'), ('twitter', 'NN')]\n" - ] - } - ], + "outputs": [], "source": [ "from textblob import TextBlob\n", "\n", @@ -1654,48 +1088,18 @@ }, { "cell_type": "code", - "execution_count": 129, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[nltk_data] Downloading package brown to /home/ec2-user/nltk_data...\n", - "[nltk_data] Package brown is already up-to-date!\n" - ] - }, - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 129, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "nltk.download(\"brown\")" ] }, { "cell_type": "code", - "execution_count": 130, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'sad kutner kill far show house'" - ] - }, - "execution_count": 130, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "# orginal text:\n", "text_example = text_data.sample()[\"lemma_text\"]\n", @@ -1704,18 +1108,9 @@ }, { "cell_type": "code", - "execution_count": 131, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "sad kutner\n", - "show house\n" - ] - } - ], + "outputs": [], "source": [ "# noun phrases that can be extracted from this sentence\n", "result = TextBlob(\" \".join(text_example.values[0]))\n", @@ -1733,31 +1128,9 @@ }, { "cell_type": "code", - "execution_count": 132, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "[nltk_data] Downloading package maxent_ne_chunker to\n", - "[nltk_data] /home/ec2-user/nltk_data...\n", - "[nltk_data] Package maxent_ne_chunker is already up-to-date!\n", - "[nltk_data] Downloading package words to /home/ec2-user/nltk_data...\n", - "[nltk_data] Package words is already up-to-date!\n" - ] - }, - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 132, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "nltk.download(\"maxent_ne_chunker\")\n", "nltk.download(\"words\")" @@ -1765,17 +1138,9 @@ }, { "cell_type": "code", - "execution_count": 133, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "original text: uh oh think get sick\n" - ] - } - ], + "outputs": [], "source": [ "text_example_enr = text_data.sample()[\"lemma_text\"].values[0]\n", "print(\"original text: \" + \" \".join(text_example_enr))" @@ -1783,17 +1148,9 @@ }, { "cell_type": "code", - "execution_count": 134, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "(S uh/JJ oh/MD think/VB get/VB sick/JJ)\n" - ] - } - ], + "outputs": [], "source": [ "from nltk import pos_tag, ne_chunk\n", "\n", @@ -1812,129 +1169,9 @@ }, { "cell_type": "code", - "execution_count": 135, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
targettw_iddateflagusertextcleaned_texttext_lowernormalized_textmentionstokenized_textstem_textlemma_text
001467810369Mon Apr 06 22:19:45 PDT 2009NO_QUERY_TheSpecialOne_@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D@switchfoot - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. Wink or smirk@switchfoot - awww, that's a bummer. you shoulda got david carr of third day to do it. wink or smirkawww thats a bummer you shoulda got david carr of third day to do it wink or smirk[switchfoot][www, summer, should, got, david, carr, third, day, wink, smirk][www, summer, should, got, david, carr, third, day, wink, smirk][www, summer, should, get, david, carr, third, day, wink, smirk]
101467810672Mon Apr 06 22:19:49 PDT 2009NO_QUERYscotthamiltonis upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!is upset that he can't update his facebook by texting it... and might cry as a result school today also. blah!is upset that he cant update his facebook by texting it and might cry as a result school today also blah[][upset, cant, update, facebook, texting, might, cry, result, school, today, also, blah][upset, cant, updat, facebook, text, might, cri, result, school, today, also, blah][upset, cant, update, facebook, texting, might, cry, result, school, today, also, blah]
\n", - "
" - ], - "text/plain": [ - " target tw_id date flag \\\n", - "0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY \n", - "1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY \n", - "\n", - " user \\\n", - "0 _TheSpecialOne_ \n", - "1 scotthamilton \n", - "\n", - " text \\\n", - "0 @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D \n", - "1 is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah! \n", - "\n", - " cleaned_text \\\n", - "0 @switchfoot - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. Wink or smirk \n", - "1 is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah! \n", - "\n", - " text_lower \\\n", - "0 @switchfoot - awww, that's a bummer. you shoulda got david carr of third day to do it. wink or smirk \n", - "1 is upset that he can't update his facebook by texting it... and might cry as a result school today also. blah! \n", - "\n", - " normalized_text \\\n", - "0 awww thats a bummer you shoulda got david carr of third day to do it wink or smirk \n", - "1 is upset that he cant update his facebook by texting it and might cry as a result school today also blah \n", - "\n", - " mentions \\\n", - "0 [switchfoot] \n", - "1 [] \n", - "\n", - " tokenized_text \\\n", - "0 [www, summer, should, got, david, carr, third, day, wink, smirk] \n", - "1 [upset, cant, update, facebook, texting, might, cry, result, school, today, also, blah] \n", - "\n", - " stem_text \\\n", - "0 [www, summer, should, got, david, carr, third, day, wink, smirk] \n", - "1 [upset, cant, updat, facebook, text, might, cri, result, school, today, also, blah] \n", - "\n", - " lemma_text \n", - "0 [www, summer, should, get, david, carr, third, day, wink, smirk] \n", - "1 [upset, cant, update, facebook, texting, might, cry, result, school, today, also, blah] " - ] - }, - "execution_count": 135, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "text_data.head(2)" ] @@ -1948,17 +1185,9 @@ }, { "cell_type": "code", - "execution_count": 136, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Writing to s3://sagemaker-us-east-2-060356833389/text_sentiment140_processed/processed_sentiment_140.csv\n" - ] - } - ], + "outputs": [], "source": [ "filename_write_to = \"processed_sentiment_140.csv\"\n", "text_data.to_csv(filename_write_to, index=False)\n", @@ -1971,8 +1200,7 @@ "source": [ "## Conclusion\n", "\n", - "Congratulations! You cleaned and prepared your text data and it is now ready to be vectorized or used for feature engineering. \n", - "Now that your data is ready to be converted into machine-readable format (numbers), we will cover extracting features and word embeddings in the next section **text data feature engineering**." + "Congratulations! You cleaned and prepared your text data and it is now ready to be vectorized or used for feature engineering. " ] }, { @@ -1985,9 +1213,9 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "conda_python3", "language": "python", - "name": "python3" + "name": "conda_python3" }, "language_info": { "codemirror_mode": { @@ -1999,7 +1227,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.4" + "version": "3.6.13" } }, "nbformat": 4,