Reading this file is optional and you can use the data without knowing how I got it. But if you want more information about how I got it, then reading this document will be suitable.
The data used in this repo is a data that contains reviews on IMDb movies on Twitter. The data is a csv file containing:
movie_id
: the Id of the movie on IMDb.movie_name
: the name of the movie as shown on IMDb.tweet_id
: the tweet id on Twitter.tweet_text
: the tweet text.tweet_tone
: the tone of the tweet.
I have prepared this data using the following steps:
- Download this MovieTwetings data (65MB).
- After downloading it and uncompressing you will find a few files. We are interested in just one file
test.dat
which contains around 20,000 tweets - Download this data contains the movie names (2MB).
- Then, put the two file (
movies.dat
andtest.dat
) into a new directory calleddata
. - Run the
prepare_data.py
python script which would produce theMovieTweets.csv
file.
This script is used to obtain the data in three steps:
- Step1: obtaining what we want from
test.dat
andmovies.dat
. - Step2: Getting tweets content.
- Step3: Getting Tones for tweets.
- Step4: Group data by movie
The function responsible for doing this step is extract_info()
. This function uses test.dat
to extract user_id
, movie_id
, and tweet_id
. And it uses movies.dat
to extract movie_name
for each single movie in our data. The output of this step is a csv file called MovieTweets1.csv
.
The resulting CSV from this step won't contain tweet_text
or tweet_tone
as both will be obtained in the following two steps. Also, it won't contain the whole 20,000 tweets in the test.dat
file, just the first 10,000. Why is that? Apparently, Twitter puts a rate limit for retrieving data from its API. When you exceed this rate, you will a warning that says Rate limit reached. Sleeping for: 758
which means that I had to slow down my retrieving rate to the extent that I have collected the 10,000 tweets in more than 3 hours.
This step is responsible for getting only the tweet_text
using Twitter API. The function responsible for that is add_tweet_content()
which uses the script twitter.py
responsible for calling twitter API. The resulting CSV will be called MovieTweets2.csv
which will contain just one additional column (tweet_text
).
This step is responsible for getting the tone for our tweets using IBM Tone Analyzer. The function responsible for this job is add_tweet_tone()
which uses tone_analyzer.py
to communicate with the IBM API. The resulting CSV will be called MovieTweets.csv
which will contain just one additional column (tweet_tone
). The tweets in this data are so generic which have no tones. So, to make things interesting, I've generated random tones to each tweet in the data
This step is simple. Here, we group the data from the last stem by movie_id
. The resulting csv from this step is the one sent to ElasticSearch engine. This csv will contains these fileds movie_id
, movie_name
, reviews_count
, average_tones
, and tweets
. It will look like this:
movie_id | movie_name | reviews_count | average_tones | tweets |
---|---|---|---|---|
993846 | The Wolf of Wall Street (2013) | 535 | {'Anger': 0.2844859585329844, 'Fear': 0.2838749471226159, 'Joy': 0.27755553696029167, 'Sadness': 0.26465190869767846, 'Analytical': 0.2751570982991599, 'Confident': 0.2899761642095416, 'Tentative': 0.27979233958022803} | [{'tweet_id': 4.21051e+17, 'tweet_text': 'I rated The Wolf of Wall Street 1/10 #IMDb http://t.co/eJtRKX4VB5'}, ...] |