Being able to predict which Tweets are about real Public Emergencies (eg Earthquakes, Floods, Terrorist Events) and which ones are not.
(The words 'Pubic Emergency' and 'Disasters' have been used interchangeably)
Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).
But, it’s not always clear whether a person’s words are actually announcing a disaster.
Source: Real or Not? NLP with Disaster Tweets, A Kaggle Competition
The dataset has the contents of the Tweet (text
variable), the location from where the Tweet was posted (location
variable), and a keyword associated with the Tweet (keyword
variable).
The model will be built based on the contents of the Tweet, as from a domain standpoint, the location and the keyword information may not always be available (in the situation this algorithm is actually deployed)
This is the second exploration of the dataset. In the 1st attempt, I had used an ML based approach (an ensemble of ensemble methods) yielded 80% accuracy on the test data.
The goal is to get a better model (~83%) using one or an ensemble of multiple Deep Learning models.
In this notebook, I have tried out two variants of LSTM based approaches - one with pretrained embeddings and one without. The model which incorporate pre-trained embeddings from the Glove model had an accuracy of 83% on the test set.
I have also created a set of helper functions for data preprocessing, vocabulary building and creating embedding matrix from pre-trained embeddings.
The best performing model had an accuracy of 83% in the test data
The next step is to get better results by trying out a few more approaches which have been listed below. These will be incorporated in the Second edition (Part 2) of the notebook.
- An
Ensemble
model (dense - word focussed + lstm - sequence focussed) - An
N-GRAM
model (especially n = 2) - Using
Attention
based frameworks - (Optional) A
Functional
model (using Keras functional API)