diff --git a/ingest_data/012_Ingest_text_data_v2.ipynb b/ingest_data/012_Ingest_text_data_v2.ipynb
deleted file mode 100644
index a0a1a120f8..0000000000
--- a/ingest_data/012_Ingest_text_data_v2.ipynb
+++ /dev/null
@@ -1,1034 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Ingest Text Data\n",
-    "Labeled text data can be in a structured data format, such as reviews for sentiment analysis, news headlines for topic modeling, or documents for text classification. In these cases, you may have one column for the label, one column for the text, and sometimes other columns for attributes. You can treat this structured data like tabular data, and ingest it in one of the ways discussed in the previous notebook [011_Ingest_tabular_data.ipynb](011_Ingest_tabular_data_v1.ipynb). Sometimes text data, especially raw text data comes as unstructured data and is often in .json or .txt format, and we will discuss how to ingest these types of data files into a SageMaker Notebook in this section.\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Set Up Notebook"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "\u001b[33mWARNING: You are using pip version 20.0.2; however, version 20.2.4 is available.\n",
-      "You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.\u001b[0m\n",
-      "Note: you may need to restart the kernel to use updated packages.\n"
-     ]
-    }
-   ],
-   "source": [
-    "%pip install -q 's3fs==0.4.2'"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 2,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import pandas as pd\n",
-    "import json\n",
-    "import glob\n",
-    "import s3fs\n",
-    "import sagemaker"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Get SageMaker session & default S3 bucket\n",
-    "sagemaker_session = sagemaker.Session()\n",
-    "bucket = sagemaker_session.default_bucket()  # replace with your own bucket if you have one\n",
-    "s3 = sagemaker_session.boto_session.resource(\"s3\")\n",
-    "\n",
-    "prefix = \"text_spam/spam\"\n",
-    "prefix_json = \"json_jeo\"\n",
-    "filename = \"SMSSpamCollection.txt\"\n",
-    "filename_json = \"JEOPARDY_QUESTIONS1.json\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Downloading data from Online Sources\n",
-    "\n",
-    "### Text data (in structured .csv format): Twitter -- sentiment140\n",
-    " **Sentiment140** This is the sentiment140 dataset. It contains 1.6M tweets extracted using the twitter API. The tweets have been annotated with sentiment (0 = negative, 4 = positive) and topics (hashtags used to retrieve tweets). The dataset contains the following columns:\n",
-    "* `target`: the polarity of the tweet (0 = negative, 4 = positive)\n",
-    "* `ids`: The id of the tweet ( 2087)\n",
-    "* `date`: the date of the tweet (Sat May 16 23:58:44 UTC 2009)\n",
-    "* `flag`: The query (lyx). If there is no query, then this value is NO_QUERY.\n",
-    "* `user`: the user that tweeted (robotickilldozr)\n",
-    "* `text`: the text of the tweet (Lyx is cool\n",
-    "\n",
-    "[Second Twitter data](https://github.com/guyz/twitter-sentiment-dataset) is a Twitter data set collected as an extension to Sanders Analytics Twitter sentiment corpus, originally designed for training and testing Twitter sentiment analysis algorithms.  We will use this data to showcase how to aggregate two data sets if you want to enhance your current data set by adding more data to it."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# helper functions to upload data to s3\n",
-    "def write_to_s3(filename, bucket, prefix):\n",
-    "    # put one file in a separate folder. This is helpful if you read and prepare data with Athena\n",
-    "    filename_key = filename.split(\".\")[0]\n",
-    "    key = \"{}/{}/{}\".format(prefix, filename_key, filename)\n",
-    "    return s3.Bucket(bucket).upload_file(filename, key)\n",
-    "\n",
-    "\n",
-    "def upload_to_s3(bucket, prefix, filename):\n",
-    "    url = \"s3://{}/{}/{}\".format(bucket, prefix, filename)\n",
-    "    print(\"Writing to {}\".format(url))\n",
-    "    write_to_s3(filename, bucket, prefix)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 5,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# run this cell if you are in SageMaker Studio notebook\n",
-    "#!apt-get install unzip"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "URL transformed to HTTPS due to an HSTS policy\n",
-      "--2020-11-02 21:16:07--  https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip\n",
-      "Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64\n",
-      "Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.\n",
-      "HTTP request sent, awaiting response... 200 OK\n",
-      "Length: 81363704 (78M) [application/zip]\n",
-      "Saving to: ‘sentimen140.zip’\n",
-      "\n",
-      "sentimen140.zip     100%[===================>]  77.59M  18.9MB/s    in 6.4s    \n",
-      "\n",
-      "2020-11-02 21:16:14 (12.1 MB/s) - ‘sentimen140.zip’ saved [81363704/81363704]\n",
-      "\n",
-      "Archive:  sentimen140.zip\n",
-      "  inflating: sentiment140/testdata.manual.2009.06.14.csv  \n",
-      "  inflating: sentiment140/training.1600000.processed.noemoticon.csv  \n"
-     ]
-    }
-   ],
-   "source": [
-    "# download first twitter dataset\n",
-    "!wget http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip -O sentimen140.zip\n",
-    "# Uncompressing\n",
-    "!unzip -o sentimen140.zip -d sentiment140"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 7,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Writing to s3://sagemaker-us-east-2-060356833389/text_sentiment140/sentiment140/training.1600000.processed.noemoticon.csv\n",
-      "Writing to s3://sagemaker-us-east-2-060356833389/text_sentiment140/sentiment140/testdata.manual.2009.06.14.csv\n"
-     ]
-    }
-   ],
-   "source": [
-    "# upload the files to the S3 bucket\n",
-    "csv_files = glob.glob(\"sentiment140/*.csv\")\n",
-    "for filename in csv_files:\n",
-    "    upload_to_s3(bucket, \"text_sentiment140\", filename)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 8,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "--2020-11-02 21:16:18--  https://raw.githubusercontent.com/zfz/twitter_corpus/master/full-corpus.csv\n",
-      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.200.133\n",
-      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.200.133|:443... connected.\n",
-      "HTTP request sent, awaiting response... 200 OK\n",
-      "Length: 910195 (889K) [text/plain]\n",
-      "Saving to: ‘full-corpus.csv.2’\n",
-      "\n",
-      "full-corpus.csv.2   100%[===================>] 888.86K  --.-KB/s    in 0.08s   \n",
-      "\n",
-      "2020-11-02 21:16:19 (10.2 MB/s) - ‘full-corpus.csv.2’ saved [910195/910195]\n",
-      "\n"
-     ]
-    }
-   ],
-   "source": [
-    "# download second twitter dataset\n",
-    "!wget https://raw.githubusercontent.com/zfz/twitter_corpus/master/full-corpus.csv"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 9,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Writing to s3://sagemaker-us-east-2-060356833389/text_twitter_sentiment_2/full-corpus.csv\n"
-     ]
-    }
-   ],
-   "source": [
-    "filename = \"full-corpus.csv\"\n",
-    "upload_to_s3(bucket, \"text_twitter_sentiment_2\", filename)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Text data (in .txt format): SMS Spam data \n",
-    "[SMS Spam Data](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. Each line in the text file has the correct class followed by the raw message. We will use this data to showcase how to ingest text data in .txt format."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 10,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Writing to s3://sagemaker-us-east-2-060356833389/text_spam/spam/SMSSpamCollection.txt\n"
-     ]
-    }
-   ],
-   "source": [
-    "txt_files = glob.glob(\"spam/*.txt\")\n",
-    "for filename in txt_files:\n",
-    "    upload_to_s3(bucket, \"text_spam\", filename)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 11,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "--2020-11-02 21:16:19--  http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip\n",
-      "Resolving www.dt.fee.unicamp.br (www.dt.fee.unicamp.br)... 143.106.12.20\n",
-      "Connecting to www.dt.fee.unicamp.br (www.dt.fee.unicamp.br)|143.106.12.20|:80... connected.\n",
-      "HTTP request sent, awaiting response... 200 OK\n",
-      "Length: 210521 (206K) [application/zip]\n",
-      "Saving to: ‘spam.zip’\n",
-      "\n",
-      "spam.zip            100%[===================>] 205.59K   112KB/s    in 1.8s    \n",
-      "\n",
-      "2020-11-02 21:16:21 (112 KB/s) - ‘spam.zip’ saved [210521/210521]\n",
-      "\n",
-      "Archive:  spam.zip\n",
-      "  inflating: spam/readme             \n",
-      "  inflating: spam/SMSSpamCollection.txt  \n"
-     ]
-    }
-   ],
-   "source": [
-    "!wget http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip -O spam.zip\n",
-    "!unzip -o spam.zip -d spam"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Text Data (in .json format): Jeopardy Question data\n",
-    "[Jeopardy Question](https://j-archive.com/) was obtained by crawling the Jeopardy question archive website. It is an unordered list of questions where each question has the following key-value pairs:\n",
-    "\n",
-    "* `category` : the question category, e.g. \"HISTORY\"\n",
-    "* `value`: dollar value of the question as string, e.g. \"\\$200\"\n",
-    "* `question`: text of question\n",
-    "* `answer` : text of answer\n",
-    "* `round`: one of \"Jeopardy!\",\"Double Jeopardy!\",\"Final Jeopardy!\" or \"Tiebreaker\"\n",
-    "* `show_number` : string of show number, e.g '4680'\n",
-    "* `air_date` : the show air date in format YYYY-MM-DD"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 12,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "--2020-11-02 21:16:22--  http://skeeto.s3.amazonaws.com/share/JEOPARDY_QUESTIONS1.json.gz\n",
-      "Resolving skeeto.s3.amazonaws.com (skeeto.s3.amazonaws.com)... 52.216.241.76\n",
-      "Connecting to skeeto.s3.amazonaws.com (skeeto.s3.amazonaws.com)|52.216.241.76|:80... connected.\n",
-      "HTTP request sent, awaiting response... 200 OK\n",
-      "Length: 12721082 (12M) [application/json]\n",
-      "Saving to: ‘JEOPARDY_QUESTIONS1.json.gz’\n",
-      "\n",
-      "JEOPARDY_QUESTIONS1 100%[===================>]  12.13M  15.0MB/s    in 0.8s    \n",
-      "\n",
-      "2020-11-02 21:16:23 (15.0 MB/s) - ‘JEOPARDY_QUESTIONS1.json.gz’ saved [12721082/12721082]\n",
-      "\n",
-      "Writing to s3://sagemaker-us-east-2-060356833389/json_jeo/JEOPARDY_QUESTIONS1.json\n"
-     ]
-    }
-   ],
-   "source": [
-    "# json file format\n",
-    "!wget http://skeeto.s3.amazonaws.com/share/JEOPARDY_QUESTIONS1.json.gz\n",
-    "# Uncompressing\n",
-    "!gunzip -f JEOPARDY_QUESTIONS1.json.gz\n",
-    "filename = \"JEOPARDY_QUESTIONS1.json\"\n",
-    "upload_to_s3(bucket, \"json_jeo\", filename)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Ingest Data into Sagemaker Notebook\n",
-    "## Method 1: Copying data to the Instance\n",
-    "You can use the AWS Command Line Interface (CLI) to copy your data from s3 to your SageMaker instance. This is a quick and easy approach when you are dealing with medium sized data files, or you are experimenting and doing exploratory analysis. The documentation can be found [here](https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 13,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Specify file names\n",
-    "prefix = \"text_spam/spam\"\n",
-    "prefix_json = \"json_jeo\"\n",
-    "filename = \"SMSSpamCollection.txt\"\n",
-    "filename_json = \"JEOPARDY_QUESTIONS1.json\"\n",
-    "prefix_spam_2 = \"text_spam/spam_2\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 14,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "download failed: s3://sagemaker-us-east-2-060356833389/json_jeo/JEOPARDY_QUESTIONS1.json to text/json_jeo/JEOPARDY_QUESTIONS1.json [Errno 28] No space left on device\n",
-      "download failed: s3://sagemaker-us-east-2-060356833389/json_jeo/JEOPARDY_QUESTIONS1/JEOPARDY_QUESTIONS1.json to text/json_jeo/JEOPARDY_QUESTIONS1/JEOPARDY_QUESTIONS1.json [Errno 28] No space left on device\n"
-     ]
-    }
-   ],
-   "source": [
-    "# copy data to your sagemaker instance using AWS CLI\n",
-    "!aws s3 cp s3://$bucket/$prefix_json/ text/$prefix_json/ --recursive"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 15,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "{'category': 'HISTORY', 'air_date': '2004-12-31', 'question': \"'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'\", 'value': '$200', 'answer': 'Copernicus', 'round': 'Jeopardy!', 'show_number': '4680'}\n"
-     ]
-    }
-   ],
-   "source": [
-    "data_location = \"text/{}/{}\".format(prefix_json, filename_json)\n",
-    "with open(data_location) as f:\n",
-    "    data = json.load(f)\n",
-    "    print(data[0])"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Method 2: Use AWS compatible Python Packages\n",
-    "When you are dealing with large data sets, or do not want to lose any data when you delete your Sagemaker Notebook Instance, you can use pre-built packages to access your files in S3 without copying files into your instance. These packages, such as `Pandas`, have implemented options to access data with a specified path string: while you will use `file://` on your local file system, you will use `s3://` instead to access the data through the AWS boto library. For `pandas`, any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. You can find additional documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). \n",
-    "\n",
-    "For text data, most of the time you can read it as line-by-line files or use Pandas to read it as a DataFrame by specifying a delimiter."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 16,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "<div>\n",
-       "<style scoped>\n",
-       "    .dataframe tbody tr th:only-of-type {\n",
-       "        vertical-align: middle;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe tbody tr th {\n",
-       "        vertical-align: top;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe thead th {\n",
-       "        text-align: right;\n",
-       "    }\n",
-       "</style>\n",
-       "<table border=\"1\" class=\"dataframe\">\n",
-       "  <thead>\n",
-       "    <tr style=\"text-align: right;\">\n",
-       "      <th></th>\n",
-       "      <th>0</th>\n",
-       "      <th>1</th>\n",
-       "    </tr>\n",
-       "  </thead>\n",
-       "  <tbody>\n",
-       "    <tr>\n",
-       "      <th>0</th>\n",
-       "      <td>ham</td>\n",
-       "      <td>Go until jurong point, crazy.. Available only ...</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>1</th>\n",
-       "      <td>ham</td>\n",
-       "      <td>Ok lar... Joking wif u oni...</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>2</th>\n",
-       "      <td>spam</td>\n",
-       "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>3</th>\n",
-       "      <td>ham</td>\n",
-       "      <td>U dun say so early hor... U c already then say...</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>4</th>\n",
-       "      <td>ham</td>\n",
-       "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
-       "    </tr>\n",
-       "  </tbody>\n",
-       "</table>\n",
-       "</div>"
-      ],
-      "text/plain": [
-       "      0                                                  1\n",
-       "0   ham  Go until jurong point, crazy.. Available only ...\n",
-       "1   ham                      Ok lar... Joking wif u oni...\n",
-       "2  spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
-       "3   ham  U dun say so early hor... U c already then say...\n",
-       "4   ham  Nah I don't think he goes to usf, he lives aro..."
-      ]
-     },
-     "execution_count": 16,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "data_s3_location = \"s3://{}/{}/{}\".format(bucket, prefix, filename)  # S3 URL\n",
-    "s3_tabular_data = pd.read_csv(data_s3_location, sep=\"\\t\", header=None)\n",
-    "s3_tabular_data.head()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "For JSON files, depending on the structure, you can also use `Pandas` `read_json` function to read it if it's a flat json file."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 17,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "<div>\n",
-       "<style scoped>\n",
-       "    .dataframe tbody tr th:only-of-type {\n",
-       "        vertical-align: middle;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe tbody tr th {\n",
-       "        vertical-align: top;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe thead th {\n",
-       "        text-align: right;\n",
-       "    }\n",
-       "</style>\n",
-       "<table border=\"1\" class=\"dataframe\">\n",
-       "  <thead>\n",
-       "    <tr style=\"text-align: right;\">\n",
-       "      <th></th>\n",
-       "      <th>category</th>\n",
-       "      <th>air_date</th>\n",
-       "      <th>question</th>\n",
-       "      <th>value</th>\n",
-       "      <th>answer</th>\n",
-       "      <th>round</th>\n",
-       "      <th>show_number</th>\n",
-       "    </tr>\n",
-       "  </thead>\n",
-       "  <tbody>\n",
-       "    <tr>\n",
-       "      <th>0</th>\n",
-       "      <td>HISTORY</td>\n",
-       "      <td>2004-12-31</td>\n",
-       "      <td>'For the last 8 years of his life, Galileo was...</td>\n",
-       "      <td>$200</td>\n",
-       "      <td>Copernicus</td>\n",
-       "      <td>Jeopardy!</td>\n",
-       "      <td>4680</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>1</th>\n",
-       "      <td>ESPN's TOP 10 ALL-TIME ATHLETES</td>\n",
-       "      <td>2004-12-31</td>\n",
-       "      <td>'No. 2: 1912 Olympian; football star at Carlis...</td>\n",
-       "      <td>$200</td>\n",
-       "      <td>Jim Thorpe</td>\n",
-       "      <td>Jeopardy!</td>\n",
-       "      <td>4680</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>2</th>\n",
-       "      <td>EVERYBODY TALKS ABOUT IT...</td>\n",
-       "      <td>2004-12-31</td>\n",
-       "      <td>'The city of Yuma in this state has a record a...</td>\n",
-       "      <td>$200</td>\n",
-       "      <td>Arizona</td>\n",
-       "      <td>Jeopardy!</td>\n",
-       "      <td>4680</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>3</th>\n",
-       "      <td>THE COMPANY LINE</td>\n",
-       "      <td>2004-12-31</td>\n",
-       "      <td>'In 1963, live on \"The Art Linkletter Show\", t...</td>\n",
-       "      <td>$200</td>\n",
-       "      <td>McDonald\\'s</td>\n",
-       "      <td>Jeopardy!</td>\n",
-       "      <td>4680</td>\n",
-       "    </tr>\n",
-       "    <tr>\n",
-       "      <th>4</th>\n",
-       "      <td>EPITAPHS &amp; TRIBUTES</td>\n",
-       "      <td>2004-12-31</td>\n",
-       "      <td>'Signer of the Dec. of Indep., framer of the C...</td>\n",
-       "      <td>$200</td>\n",
-       "      <td>John Adams</td>\n",
-       "      <td>Jeopardy!</td>\n",
-       "      <td>4680</td>\n",
-       "    </tr>\n",
-       "  </tbody>\n",
-       "</table>\n",
-       "</div>"
-      ],
-      "text/plain": [
-       "                          category    air_date  \\\n",
-       "0                          HISTORY  2004-12-31   \n",
-       "1  ESPN's TOP 10 ALL-TIME ATHLETES  2004-12-31   \n",
-       "2      EVERYBODY TALKS ABOUT IT...  2004-12-31   \n",
-       "3                 THE COMPANY LINE  2004-12-31   \n",
-       "4              EPITAPHS & TRIBUTES  2004-12-31   \n",
-       "\n",
-       "                                            question value       answer  \\\n",
-       "0  'For the last 8 years of his life, Galileo was...  $200   Copernicus   \n",
-       "1  'No. 2: 1912 Olympian; football star at Carlis...  $200   Jim Thorpe   \n",
-       "2  'The city of Yuma in this state has a record a...  $200      Arizona   \n",
-       "3  'In 1963, live on \"The Art Linkletter Show\", t...  $200  McDonald\\'s   \n",
-       "4  'Signer of the Dec. of Indep., framer of the C...  $200   John Adams   \n",
-       "\n",
-       "       round  show_number  \n",
-       "0  Jeopardy!         4680  \n",
-       "1  Jeopardy!         4680  \n",
-       "2  Jeopardy!         4680  \n",
-       "3  Jeopardy!         4680  \n",
-       "4  Jeopardy!         4680  "
-      ]
-     },
-     "execution_count": 17,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "data_json_location = \"s3://{}/{}/{}\".format(bucket, prefix_json, filename_json)\n",
-    "s3_tabular_data_json = pd.read_json(data_json_location, orient=\"records\")\n",
-    "s3_tabular_data_json.head()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Method 3: Use AWS Native methods\n",
-    "#### s3fs\n",
-    "[S3Fs](https://s3fs.readthedocs.io/en/latest/) is a Pythonic file interface to S3. It builds on top of botocore. The top-level class S3FileSystem holds connection information and allows typical file-system style operations like cp, mv, ls, du, glob, etc., as well as put/get of local files to/from S3. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 18,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "['sagemaker-us-east-2-060356833389/text_spam/spam/SMSSpamCollection',\n",
-       " 'sagemaker-us-east-2-060356833389/text_spam/spam/SMSSpamCollection.txt']"
-      ]
-     },
-     "execution_count": 18,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "fs = s3fs.S3FileSystem()\n",
-    "data_s3fs_location = \"s3://{}/{}/\".format(bucket, prefix)\n",
-    "# To List all files in your accessible bucket\n",
-    "fs.ls(data_s3fs_location)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 19,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "    ham  \\\n",
-      "0   ham   \n",
-      "1  spam   \n",
-      "\n",
-      "  Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...  \n",
-      "0                      Ok lar... Joking wif u oni...                                                               \n",
-      "1  Free entry in 2 a wkly comp to win FA Cup fina...                                                               \n"
-     ]
-    }
-   ],
-   "source": [
-    "# open it directly with s3fs\n",
-    "data_s3fs_location = \"s3://{}/{}/{}\".format(bucket, prefix, filename)  # S3 URL\n",
-    "with fs.open(data_s3fs_location) as f:\n",
-    "    print(pd.read_csv(f, sep=\"\\t\", nrows=2))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Aggregating datasets\n",
-    "If you would like to enhance your data with more data collected for your use cases, you can always aggregate your newly-collected data with your current dataset. We will use two datasets -- Sentiment140 and Sanders Twitter Sentiment to show how to aggregate data together."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 20,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "prefix_tw1 = \"text_sentiment140/sentiment140\"\n",
-    "filename_tw1 = \"training.1600000.processed.noemoticon.csv\"\n",
-    "prefix_added = \"text_twitter_sentiment_2\"\n",
-    "filename_added = \"full-corpus.csv\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Let's read in our original data and take a look at its format and schema:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 21,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "data_s3_location_base = \"s3://{}/{}/{}\".format(bucket, prefix_tw1, filename_tw1)  # S3 URL\n",
-    "# we will showcase with a smaller subset of data for demonstration purpose\n",
-    "text_data = pd.read_csv(\n",
-    "    data_s3_location_base, header=None, encoding=\"ISO-8859-1\", low_memory=False, nrows=10000\n",
-    ")\n",
-    "text_data.columns = [\"target\", \"tw_id\", \"date\", \"flag\", \"user\", \"text\"]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We have 6 columns, `date`, `text`, `flag` (which is the topic the twitter was queried), `tw_id` (tweet's id), `user` (user account name), and `target` (0 = neg, 4 = pos)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 22,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "<div>\n",
-       "<style scoped>\n",
-       "    .dataframe tbody tr th:only-of-type {\n",
-       "        vertical-align: middle;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe tbody tr th {\n",
-       "        vertical-align: top;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe thead th {\n",
-       "        text-align: right;\n",
-       "    }\n",
-       "</style>\n",
-       "<table border=\"1\" class=\"dataframe\">\n",
-       "  <thead>\n",
-       "    <tr style=\"text-align: right;\">\n",
-       "      <th></th>\n",
-       "      <th>target</th>\n",
-       "      <th>tw_id</th>\n",
-       "      <th>date</th>\n",
-       "      <th>flag</th>\n",
-       "      <th>user</th>\n",
-       "      <th>text</th>\n",
-       "    </tr>\n",
-       "  </thead>\n",
-       "  <tbody>\n",
-       "    <tr>\n",
-       "      <th>0</th>\n",
-       "      <td>0</td>\n",
-       "      <td>1467810369</td>\n",
-       "      <td>Mon Apr 06 22:19:45 PDT 2009</td>\n",
-       "      <td>NO_QUERY</td>\n",
-       "      <td>_TheSpecialOne_</td>\n",
-       "      <td>@switchfoot http://twitpic.com/2y1zl - Awww, t...</td>\n",
-       "    </tr>\n",
-       "  </tbody>\n",
-       "</table>\n",
-       "</div>"
-      ],
-      "text/plain": [
-       "   target       tw_id                          date      flag  \\\n",
-       "0       0  1467810369  Mon Apr 06 22:19:45 PDT 2009  NO_QUERY   \n",
-       "\n",
-       "              user                                               text  \n",
-       "0  _TheSpecialOne_  @switchfoot http://twitpic.com/2y1zl - Awww, t...  "
-      ]
-     },
-     "execution_count": 22,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "text_data.head(1)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Let's read in and take a look at the data we want to add to our original data. \n",
-    "\n",
-    "We will start by checking for columns for both data sets. The new data set has 5 columns, `TweetDate` which maps to `date`, `TweetText` which maps to `text`, `Topic` which maps to `flag`, `TweetId` which maps to `tw_id`, and `Sentiment` mapped to `target`. In this new data set, we don't have `user account name` column, so when we aggregate two data sets we can add this column to the data set to be added and fill it with `NULL` values. You can also remove this column from the original data if it does not provide much valuable information based on your use cases. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 23,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "data_s3_location_added = \"s3://{}/{}/{}\".format(bucket, prefix_added, filename_added)  # S3 URL\n",
-    "# we will showcase with a smaller subset of data for demonstration purpose\n",
-    "text_data_added = pd.read_csv(\n",
-    "    data_s3_location_added, encoding=\"ISO-8859-1\", low_memory=False, nrows=10000\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 24,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "<div>\n",
-       "<style scoped>\n",
-       "    .dataframe tbody tr th:only-of-type {\n",
-       "        vertical-align: middle;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe tbody tr th {\n",
-       "        vertical-align: top;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe thead th {\n",
-       "        text-align: right;\n",
-       "    }\n",
-       "</style>\n",
-       "<table border=\"1\" class=\"dataframe\">\n",
-       "  <thead>\n",
-       "    <tr style=\"text-align: right;\">\n",
-       "      <th></th>\n",
-       "      <th>Topic</th>\n",
-       "      <th>Sentiment</th>\n",
-       "      <th>TweetId</th>\n",
-       "      <th>TweetDate</th>\n",
-       "      <th>TweetText</th>\n",
-       "    </tr>\n",
-       "  </thead>\n",
-       "  <tbody>\n",
-       "    <tr>\n",
-       "      <th>0</th>\n",
-       "      <td>apple</td>\n",
-       "      <td>positive</td>\n",
-       "      <td>126415614616154112</td>\n",
-       "      <td>Tue Oct 18 21:53:25 +0000 2011</td>\n",
-       "      <td>Now all @Apple has to do is get swype on the i...</td>\n",
-       "    </tr>\n",
-       "  </tbody>\n",
-       "</table>\n",
-       "</div>"
-      ],
-      "text/plain": [
-       "   Topic Sentiment             TweetId                       TweetDate  \\\n",
-       "0  apple  positive  126415614616154112  Tue Oct 18 21:53:25 +0000 2011   \n",
-       "\n",
-       "                                           TweetText  \n",
-       "0  Now all @Apple has to do is get swype on the i...  "
-      ]
-     },
-     "execution_count": 24,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "text_data_added.head(1)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Add the missing column to the new data set and fill it with `NULL`"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 25,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "text_data_added[\"user\"] = \"\""
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Renaming the new data set columns to combine two data sets"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 26,
-   "metadata": {},
-   "outputs": [
-    {
-     "data": {
-      "text/html": [
-       "<div>\n",
-       "<style scoped>\n",
-       "    .dataframe tbody tr th:only-of-type {\n",
-       "        vertical-align: middle;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe tbody tr th {\n",
-       "        vertical-align: top;\n",
-       "    }\n",
-       "\n",
-       "    .dataframe thead th {\n",
-       "        text-align: right;\n",
-       "    }\n",
-       "</style>\n",
-       "<table border=\"1\" class=\"dataframe\">\n",
-       "  <thead>\n",
-       "    <tr style=\"text-align: right;\">\n",
-       "      <th></th>\n",
-       "      <th>flag</th>\n",
-       "      <th>target</th>\n",
-       "      <th>tw_id</th>\n",
-       "      <th>date</th>\n",
-       "      <th>text</th>\n",
-       "      <th>user</th>\n",
-       "    </tr>\n",
-       "  </thead>\n",
-       "  <tbody>\n",
-       "    <tr>\n",
-       "      <th>0</th>\n",
-       "      <td>apple</td>\n",
-       "      <td>positive</td>\n",
-       "      <td>126415614616154112</td>\n",
-       "      <td>Tue Oct 18 21:53:25 +0000 2011</td>\n",
-       "      <td>Now all @Apple has to do is get swype on the i...</td>\n",
-       "      <td></td>\n",
-       "    </tr>\n",
-       "  </tbody>\n",
-       "</table>\n",
-       "</div>"
-      ],
-      "text/plain": [
-       "    flag    target               tw_id                            date  \\\n",
-       "0  apple  positive  126415614616154112  Tue Oct 18 21:53:25 +0000 2011   \n",
-       "\n",
-       "                                                text user  \n",
-       "0  Now all @Apple has to do is get swype on the i...       "
-      ]
-     },
-     "execution_count": 26,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
-   "source": [
-    "text_data_added.columns = [\"flag\", \"target\", \"tw_id\", \"date\", \"text\", \"user\"]\n",
-    "text_data_added.head(1)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Change the `target` column to the same format as the `target` in the original data set\n",
-    "Note that the `target` column in the new data set is marked as \"positive\", \"negative\", \"neutral\", and \"irrelevant\", whereas the `target` in the original data set is marked as \"0\" and \"4\". So let's map \"positive\" to 4, \"neutral\" to 2, and \"negative\" to 0 in our new data set so that they are consistent. For \"irrelevant\", which are either not English or Spam, you can either remove these if it is not valuable for your use case (In our use case of sentiment analysis, we will remove those since these text does not provide any value in terms of predicting sentiment) or map them to -1. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 27,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# remove tweets labeled as irelevant\n",
-    "text_data_added = text_data_added[text_data_added[\"target\"] != \"irelevant\"]\n",
-    "# convert strings to number targets\n",
-    "target_map = {\"positive\": 4, \"negative\": 0, \"neutral\": 2}\n",
-    "text_data_added[\"target\"] = text_data_added[\"target\"].map(target_map)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Combine the two data sets and save as one new file"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 28,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Writing to s3://sagemaker-us-east-2-060356833389/text_twitter_sentiment_full/sentiment_full.csv\n"
-     ]
-    }
-   ],
-   "source": [
-    "text_data_new = pd.concat([text_data, text_data_added])\n",
-    "filename = \"sentiment_full.csv\"\n",
-    "text_data_new.to_csv(filename, index=False)\n",
-    "upload_to_s3(bucket, \"text_twitter_sentiment_full\", filename)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Citation\n",
-    "Twitter140 Data, Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.\n",
-    "\n",
-    "SMS Spaming data, Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results.  Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.\n",
-    "\n",
-    "J! Archive, J! Archive is created by fans, for fans. The Jeopardy! game show and all elements thereof, including but not limited to copyright and trademark thereto, are the property of Jeopardy Productions, Inc. and are protected under law. This website is not affiliated with, sponsored by, or operated by Jeopardy Productions, Inc."
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.4"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
diff --git a/ingest_data/index.rst b/ingest_data/index.rst
index 75335ad81f..e81eba8b63 100644
--- a/ingest_data/index.rst
+++ b/ingest_data/index.rst
@@ -20,9 +20,9 @@ SageMaker uses a `default bucket <https://sagemaker.readthedocs.io/en/stable/api
 .. toctree::
    :maxdepth: 1
 
-   011_Ingest_tabular_data_v1
-   012_Ingest_text_data_v2
-   013_Ingest_image_data_v1
+   ingest-data-types/ingest_tabular_data
+   ingest-data-types/ingest_text_data
+   ingest-data-types/ingest_image_data
 
 
 Athena
@@ -35,7 +35,7 @@ This example runs the California housing dataset and uses PyAthena, a Python cli
 .. toctree::
    :maxdepth: 1
 
-   02_Ingest_data_with_Athena_v1
+   ingest-with-aws-services/ingest_data_with_Athena
 
 
 EMR
@@ -48,7 +48,7 @@ This example runs the California housing dataset.
 .. toctree::
    :maxdepth: 1
 
-   04_Ingest_data_with_EMR
+   ingest-with-aws-services/ingest_data_with_EMR
 
 
 Redshift
@@ -62,7 +62,7 @@ This example runs the California housing dataset and uses `awswrangler`, a Panda
 .. toctree::
    :maxdepth: 1
 
-   03_Ingest_data_with_Redshift_v3
+   ingest-with-aws-services/ingest_data_with_Redshift
 
 
 Amazon Keyspaces (for Apache Cassandra)
diff --git a/ingest_data/013_Ingest_image_data_v1.ipynb b/ingest_data/ingest-data-types/ingest_image_data.ipynb
similarity index 100%
rename from ingest_data/013_Ingest_image_data_v1.ipynb
rename to ingest_data/ingest-data-types/ingest_image_data.ipynb
diff --git a/ingest_data/011_Ingest_tabular_data_v1.ipynb b/ingest_data/ingest-data-types/ingest_tabular_data.ipynb
similarity index 100%
rename from ingest_data/011_Ingest_tabular_data_v1.ipynb
rename to ingest_data/ingest-data-types/ingest_tabular_data.ipynb
diff --git a/ingest_data/ingest-data-types/ingest_text_data.ipynb b/ingest_data/ingest-data-types/ingest_text_data.ipynb
new file mode 100644
index 0000000000..b2a4c58f77
--- /dev/null
+++ b/ingest_data/ingest-data-types/ingest_text_data.ipynb
@@ -0,0 +1,518 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Ingest Text Data\n",
+    "Labeled text data can be in a structured data format, such as reviews for sentiment analysis, news headlines for topic modeling, or documents for text classification. In these cases, you may have one column for the label, one column for the text, and sometimes other columns for attributes. You can treat this structured data like tabular data. Sometimes text data, especially raw text data comes as unstructured data and is often in .json or .txt format, and we will discuss how to ingest these types of data files into a SageMaker Notebook in this section.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Set Up Notebook"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install -q 's3fs==0.4.2'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import json\n",
+    "import glob\n",
+    "import s3fs\n",
+    "import sagemaker"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Get SageMaker session & default S3 bucket\n",
+    "sagemaker_session = sagemaker.Session()\n",
+    "bucket = sagemaker_session.default_bucket()  # replace with your own bucket if you have one\n",
+    "s3 = sagemaker_session.boto_session.resource(\"s3\")\n",
+    "\n",
+    "prefix = \"text_spam/spam\"\n",
+    "prefix_json = \"json_jeo\"\n",
+    "filename = \"SMSSpamCollection.txt\"\n",
+    "filename_json = \"JEOPARDY_QUESTIONS1.json\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Downloading data from Online Sources\n",
+    "\n",
+    "### Text data (in structured .csv format): Twitter -- sentiment140\n",
+    " **Sentiment140** This is the sentiment140 dataset. It contains 1.6M tweets extracted using the twitter API. The tweets have been annotated with sentiment (0 = negative, 4 = positive) and topics (hashtags used to retrieve tweets). The dataset contains the following columns:\n",
+    "* `target`: the polarity of the tweet (0 = negative, 4 = positive)\n",
+    "* `ids`: The id of the tweet ( 2087)\n",
+    "* `date`: the date of the tweet (Sat May 16 23:58:44 UTC 2009)\n",
+    "* `flag`: The query (lyx). If there is no query, then this value is NO_QUERY.\n",
+    "* `user`: the user that tweeted (robotickilldozr)\n",
+    "* `text`: the text of the tweet (Lyx is cool\n",
+    "\n",
+    "[Second Twitter data](https://github.com/guyz/twitter-sentiment-dataset) is a Twitter data set collected as an extension to Sanders Analytics Twitter sentiment corpus, originally designed for training and testing Twitter sentiment analysis algorithms.  We will use this data to showcase how to aggregate two data sets if you want to enhance your current data set by adding more data to it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# helper functions to upload data to s3\n",
+    "def write_to_s3(filename, bucket, prefix):\n",
+    "    # put one file in a separate folder. This is helpful if you read and prepare data with Athena\n",
+    "    key = \"{}/{}\".format(prefix, filename)\n",
+    "    return s3.Bucket(bucket).upload_file(filename, key)\n",
+    "\n",
+    "\n",
+    "def upload_to_s3(bucket, prefix, filename):\n",
+    "    url = \"s3://{}/{}/{}\".format(bucket, prefix, filename)\n",
+    "    print(\"Writing to {}\".format(url))\n",
+    "    write_to_s3(filename, bucket, prefix)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# run this cell if you are in SageMaker Studio notebook\n",
+    "#!apt-get install unzip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# download first twitter dataset\n",
+    "!wget http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip -O sentimen140.zip\n",
+    "# Uncompressing\n",
+    "!unzip -o sentimen140.zip -d sentiment140"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# upload the files to the S3 bucket\n",
+    "csv_files = glob.glob(\"sentiment140/*.csv\")\n",
+    "for filename in csv_files:\n",
+    "    upload_to_s3(bucket, \"text_sentiment140\", filename)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# download second twitter dataset\n",
+    "!wget https://raw.githubusercontent.com/zfz/twitter_corpus/master/full-corpus.csv"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "filename = \"full-corpus.csv\"\n",
+    "upload_to_s3(bucket, \"text_twitter_sentiment_2\", filename)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Text data (in .txt format): SMS Spam data \n",
+    "[SMS Spam Data](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. Each line in the text file has the correct class followed by the raw message. We will use this data to showcase how to ingest text data in .txt format."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!wget http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/smsspamcollection.zip -O spam.zip\n",
+    "!unzip -o spam.zip -d spam"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "txt_files = glob.glob(\"spam/*.txt\")\n",
+    "for filename in txt_files:\n",
+    "    upload_to_s3(bucket, \"text_spam\", filename)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Text Data (in .json format): Jeopardy Question data\n",
+    "[Jeopardy Question](https://j-archive.com/) was obtained by crawling the Jeopardy question archive website. It is an unordered list of questions where each question has the following key-value pairs:\n",
+    "\n",
+    "* `category` : the question category, e.g. \"HISTORY\"\n",
+    "* `value`: dollar value of the question as string, e.g. \"\\$200\"\n",
+    "* `question`: text of question\n",
+    "* `answer` : text of answer\n",
+    "* `round`: one of \"Jeopardy!\",\"Double Jeopardy!\",\"Final Jeopardy!\" or \"Tiebreaker\"\n",
+    "* `show_number` : string of show number, e.g '4680'\n",
+    "* `air_date` : the show air date in format YYYY-MM-DD"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# json file format\n",
+    "! wget 'https://docs.google.com/uc?export=download&id=0BwT5wj_P7BKXb2hfM3d2RHU1ckE' -O JEOPARDY_QUESTIONS1.json\n",
+    "# Uncompressing\n",
+    "filename = \"JEOPARDY_QUESTIONS1.json\"\n",
+    "upload_to_s3(bucket, \"json_jeo\", filename)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Ingest Data into Sagemaker Notebook\n",
+    "## Method 1: Copying data to the Instance\n",
+    "You can use the AWS Command Line Interface (CLI) to copy your data from s3 to your SageMaker instance. This is a quick and easy approach when you are dealing with medium sized data files, or you are experimenting and doing exploratory analysis. The documentation can be found [here](https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Specify file names\n",
+    "prefix = \"text_spam/spam\"\n",
+    "prefix_json = \"json_jeo\"\n",
+    "filename = \"SMSSpamCollection.txt\"\n",
+    "filename_json = \"JEOPARDY_QUESTIONS1.json\"\n",
+    "prefix_spam_2 = \"text_spam/spam_2\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# copy data to your sagemaker instance using AWS CLI\n",
+    "!aws s3 cp s3://$bucket/$prefix_json/ text/$prefix_json/ --recursive"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_location = \"text/{}/{}\".format(prefix_json, filename_json)\n",
+    "with open(data_location) as f:\n",
+    "    data = json.load(f)\n",
+    "    print(data[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Method 2: Use AWS compatible Python Packages\n",
+    "When you are dealing with large data sets, or do not want to lose any data when you delete your Sagemaker Notebook Instance, you can use pre-built packages to access your files in S3 without copying files into your instance. These packages, such as `Pandas`, have implemented options to access data with a specified path string: while you will use `file://` on your local file system, you will use `s3://` instead to access the data through the AWS boto library. For `pandas`, any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. You can find additional documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). \n",
+    "\n",
+    "For text data, most of the time you can read it as line-by-line files or use Pandas to read it as a DataFrame by specifying a delimiter."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_s3_location = \"s3://{}/{}/{}\".format(bucket, prefix, filename)  # S3 URL\n",
+    "s3_tabular_data = pd.read_csv(data_s3_location, sep=\"\\t\", header=None)\n",
+    "s3_tabular_data.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For JSON files, depending on the structure, you can also use `Pandas` `read_json` function to read it if it's a flat json file."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_json_location = \"s3://{}/{}/{}\".format(bucket, prefix_json, filename_json)\n",
+    "s3_tabular_data_json = pd.read_json(data_json_location, orient=\"records\")\n",
+    "s3_tabular_data_json.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Method 3: Use AWS Native methods\n",
+    "#### s3fs\n",
+    "[S3Fs](https://s3fs.readthedocs.io/en/latest/) is a Pythonic file interface to S3. It builds on top of botocore. The top-level class S3FileSystem holds connection information and allows typical file-system style operations like cp, mv, ls, du, glob, etc., as well as put/get of local files to/from S3. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fs = s3fs.S3FileSystem()\n",
+    "data_s3fs_location = \"s3://{}/{}/\".format(bucket, prefix)\n",
+    "# To List all files in your accessible bucket\n",
+    "fs.ls(data_s3fs_location)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# open it directly with s3fs\n",
+    "data_s3fs_location = \"s3://{}/{}/{}\".format(bucket, prefix, filename)  # S3 URL\n",
+    "with fs.open(data_s3fs_location) as f:\n",
+    "    print(pd.read_csv(f, sep=\"\\t\", nrows=2))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Aggregating datasets\n",
+    "If you would like to enhance your data with more data collected for your use cases, you can always aggregate your newly-collected data with your current dataset. We will use two datasets -- Sentiment140 and Sanders Twitter Sentiment to show how to aggregate data together."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prefix_tw1 = \"text_sentiment140/sentiment140\"\n",
+    "filename_tw1 = \"training.1600000.processed.noemoticon.csv\"\n",
+    "prefix_added = \"text_twitter_sentiment_2\"\n",
+    "filename_added = \"full-corpus.csv\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's read in our original data and take a look at its format and schema:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_s3_location_base = \"s3://{}/{}/{}\".format(bucket, prefix_tw1, filename_tw1)  # S3 URL\n",
+    "# we will showcase with a smaller subset of data for demonstration purpose\n",
+    "text_data = pd.read_csv(\n",
+    "    data_s3_location_base, header=None, encoding=\"ISO-8859-1\", low_memory=False, nrows=10000\n",
+    ")\n",
+    "text_data.columns = [\"target\", \"tw_id\", \"date\", \"flag\", \"user\", \"text\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We have 6 columns, `date`, `text`, `flag` (which is the topic the twitter was queried), `tw_id` (tweet's id), `user` (user account name), and `target` (0 = neg, 4 = pos)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text_data.head(1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's read in and take a look at the data we want to add to our original data. \n",
+    "\n",
+    "We will start by checking for columns for both data sets. The new data set has 5 columns, `TweetDate` which maps to `date`, `TweetText` which maps to `text`, `Topic` which maps to `flag`, `TweetId` which maps to `tw_id`, and `Sentiment` mapped to `target`. In this new data set, we don't have `user account name` column, so when we aggregate two data sets we can add this column to the data set to be added and fill it with `NULL` values. You can also remove this column from the original data if it does not provide much valuable information based on your use cases. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_s3_location_added = \"s3://{}/{}/{}\".format(bucket, prefix_added, filename_added)  # S3 URL\n",
+    "# we will showcase with a smaller subset of data for demonstration purpose\n",
+    "text_data_added = pd.read_csv(\n",
+    "    data_s3_location_added, encoding=\"ISO-8859-1\", low_memory=False, nrows=10000\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text_data_added.head(1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Add the missing column to the new data set and fill it with `NULL`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text_data_added[\"user\"] = \"\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Renaming the new data set columns to combine two data sets"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text_data_added.columns = [\"flag\", \"target\", \"tw_id\", \"date\", \"text\", \"user\"]\n",
+    "text_data_added.head(1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Change the `target` column to the same format as the `target` in the original data set\n",
+    "Note that the `target` column in the new data set is marked as \"positive\", \"negative\", \"neutral\", and \"irrelevant\", whereas the `target` in the original data set is marked as \"0\" and \"4\". So let's map \"positive\" to 4, \"neutral\" to 2, and \"negative\" to 0 in our new data set so that they are consistent. For \"irrelevant\", which are either not English or Spam, you can either remove these if it is not valuable for your use case (In our use case of sentiment analysis, we will remove those since these text does not provide any value in terms of predicting sentiment) or map them to -1. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# remove tweets labeled as irelevant\n",
+    "text_data_added = text_data_added[text_data_added[\"target\"] != \"irelevant\"]\n",
+    "# convert strings to number targets\n",
+    "target_map = {\"positive\": 4, \"negative\": 0, \"neutral\": 2}\n",
+    "text_data_added[\"target\"] = text_data_added[\"target\"].map(target_map)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Combine the two data sets and save as one new file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text_data_new = pd.concat([text_data, text_data_added])\n",
+    "filename = \"sentiment_full.csv\"\n",
+    "text_data_new.to_csv(filename, index=False)\n",
+    "upload_to_s3(bucket, \"text_twitter_sentiment_full\", filename)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Citation\n",
+    "Twitter140 Data, Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.\n",
+    "\n",
+    "SMS Spaming data, Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results.  Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.\n",
+    "\n",
+    "J! Archive, J! Archive is created by fans, for fans. The Jeopardy! game show and all elements thereof, including but not limited to copyright and trademark thereto, are the property of Jeopardy Productions, Inc. and are protected under law. This website is not affiliated with, sponsored by, or operated by Jeopardy Productions, Inc."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "conda_python3",
+   "language": "python",
+   "name": "conda_python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/ingest_data/image/athena-iam-1.png b/ingest_data/ingest-with-aws-services/image/athena-iam-1.png
similarity index 100%
rename from ingest_data/image/athena-iam-1.png
rename to ingest_data/ingest-with-aws-services/image/athena-iam-1.png
diff --git a/ingest_data/image/athena-iam-2.PNG b/ingest_data/ingest-with-aws-services/image/athena-iam-2.PNG
similarity index 100%
rename from ingest_data/image/athena-iam-2.PNG
rename to ingest_data/ingest-with-aws-services/image/athena-iam-2.PNG
diff --git a/ingest_data/image/athena-iam-3.PNG b/ingest_data/ingest-with-aws-services/image/athena-iam-3.PNG
similarity index 100%
rename from ingest_data/image/athena-iam-3.PNG
rename to ingest_data/ingest-with-aws-services/image/athena-iam-3.PNG
diff --git a/ingest_data/image/redshift-sg-1.PNG b/ingest_data/ingest-with-aws-services/image/redshift-sg-1.PNG
similarity index 100%
rename from ingest_data/image/redshift-sg-1.PNG
rename to ingest_data/ingest-with-aws-services/image/redshift-sg-1.PNG
diff --git a/ingest_data/image/redshift-sg-1.jpg b/ingest_data/ingest-with-aws-services/image/redshift-sg-1.jpg
similarity index 100%
rename from ingest_data/image/redshift-sg-1.jpg
rename to ingest_data/ingest-with-aws-services/image/redshift-sg-1.jpg
diff --git a/ingest_data/02_Ingest_data_with_Athena_v1.ipynb b/ingest_data/ingest-with-aws-services/ingest_data_with_Athena.ipynb
similarity index 99%
rename from ingest_data/02_Ingest_data_with_Athena_v1.ipynb
rename to ingest_data/ingest-with-aws-services/ingest_data_with_Athena.ipynb
index 24ff1cf846..7016b6e028 100644
--- a/ingest_data/02_Ingest_data_with_Athena_v1.ipynb
+++ b/ingest_data/ingest-with-aws-services/ingest_data_with_Athena.ipynb
@@ -5,7 +5,7 @@
    "metadata": {},
    "source": [
     "# Ingest data with Athena\n",
-    "This notebook demonstrates how to set up a database with Athena and query data with it. We are going to use the data we load into S3 in the previous notebook [011_Ingest_tabular_data.ipynb](011_Ingest_tabular_data_v1.ipynb).\n",
+    "This notebook demonstrates how to set up a database with Athena and query data with it.\n",
     "\n",
     "Amazon Athena is a serverless interactive query service that makes it easy to analyze your S3 data with standard SQL. It uses S3 as its underlying data store, and uses Presto with ANSI SQL support, and works with a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet. Athena is ideal for quick, ad-hoc querying but it can also handle complex analysis, including large joins, window functions, and arrays. \n",
     "\n",
diff --git a/ingest_data/04_Ingest_data_with_EMR.ipynb b/ingest_data/ingest-with-aws-services/ingest_data_with_EMR.ipynb
similarity index 97%
rename from ingest_data/04_Ingest_data_with_EMR.ipynb
rename to ingest_data/ingest-with-aws-services/ingest_data_with_EMR.ipynb
index 9a47fe667a..61140d1917 100644
--- a/ingest_data/04_Ingest_data_with_EMR.ipynb
+++ b/ingest_data/ingest-with-aws-services/ingest_data_with_EMR.ipynb
@@ -7,7 +7,6 @@
     "# Ingest Data with EMR\n",
     "\n",
     "This notebook demonstrates how to read the data from the EMR cluster.\n",
-    "We are going to use the data we load into S3 in the previous notebook [011_Ingest_tabular_data.ipynb](011_Ingest_tabular_data_v1.ipynb).\n",
     "\n",
     "Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. With EMR you can run Petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark. "
    ]
diff --git a/ingest_data/03_Ingest_data_with_Redshift_v3.ipynb b/ingest_data/ingest-with-aws-services/ingest_data_with_Redshift.ipynb
similarity index 96%
rename from ingest_data/03_Ingest_data_with_Redshift_v3.ipynb
rename to ingest_data/ingest-with-aws-services/ingest_data_with_Redshift.ipynb
index 9efa17400e..857790e001 100644
--- a/ingest_data/03_Ingest_data_with_Redshift_v3.ipynb
+++ b/ingest_data/ingest-with-aws-services/ingest_data_with_Redshift.ipynb
@@ -5,7 +5,7 @@
    "metadata": {},
    "source": [
     "# Ingest data with Redshift\n",
-    "This notebook demonstrates how to set up a database with Redshift and query data with it. We are going to use the data we load into S3 in the previous notebook [011_Ingest_tabular_data.ipynb](011_Ingest_tabular_data_v1.ipynb) and database and schema we created in [02_Ingest_data_with_Athena.ipynb](02_Ingest_data_with_Athena_v1.ipynb).\n",
+    "This notebook demonstrates how to set up a database with Redshift and query data with it.\n",
     "\n",
     "Amazon Redshift is a fully managed data warehouse that allows you to run complex analytic queries against petabytes of structured data. Your queries are distributed and parallelized across multiple physical resources, and you can easily scale your Amazon Redshift environment up and down depending on your business needs.\n",
     "\n",
@@ -878,6 +878,59 @@
    "source": [
     "## Method 2: Loading Data into Redshift from Athena\n",
     "To load data into Redshift, you need to either use `COPY` command or `INSERT INTO` command to move data into a table from data files. Copied files may reside in an S3 bucket, an EMR cluster, or on a remote host accessed.\n",
+    "\n",
+    "#### Create and Upload Data into Athena Database"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "database_name = \"tabular_california_housing\"\n",
+    "table_name_csv = \"california_housing_athena\"\n",
+    "\n",
+    "# SQL statement to execute\n",
+    "statement = \"\"\"CREATE EXTERNAL TABLE IF NOT EXISTS {}.{}(\n",
+    "        MedInc double,\n",
+    "        HouseAge double,\n",
+    "        AveRooms double,\n",
+    "        AveBedrms double,\n",
+    "        Population double,\n",
+    "        AveOccup double,\n",
+    "        Latitude double,\n",
+    "        Longitude double, \n",
+    "        MedValue double\n",
+    "\n",
+    ") ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\\\\n' LOCATION '{}'\n",
+    "TBLPROPERTIES ('skip.header.line.count'='1')\"\"\".format(\n",
+    "    database_name, table_name_csv, data_s3_path\n",
+    ")\n",
+    "\n",
+    "# Execute statement using connection cursor\n",
+    "cursor = connect(region_name=region_name, s3_staging_dir=s3_staging_dir).cursor()\n",
+    "cursor.execute(statement)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# verify the table has been created\n",
+    "statement = \"SHOW TABLES in {}\".format(database_name)\n",
+    "cursor.execute(statement)\n",
+    "\n",
+    "df_show = as_pandas(cursor)\n",
+    "df_show.head(5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "#### Create Schema in Redshift"
    ]
   },