YelpDatasetETL

A MongoDb to Elasticsearch ETL pipeline

This project is mini ETL pipeline that is able to stream documents from a Mongodb collection to an Elasticsearch index after applying two layers of transformations.

First transformation is done to convert the Mongo Bson to a Json that conforms with the Elasticsearch schema defined here.
Second transformation is performed on selected collections that text fields on which a VADER sentiment analyzer is applied.

VADER Sentiment Analyzer

In this project, this analyzer could be applied using two ways:

using the NLTK package tool nltk.sentiment.vader.SentimentIntensityAnalyzer
creating an ingestion plugin that does the analysis

The aim of the project was to create an ETL pipeline as well as learn about the Ingestion pipeline introduced in Elasticsearch 5.x.

Elasticsearch Version: 5.2.1

Python: 2.7.13

VaderSentimentJava: 1.0.1

elasticsearch-sentiment-plugin: 1.0.1

Dataset used:

https://www.yelp.com/dataset

business - 77445 records
photo_business - 200000 records
checkin - 55569 records
review - 2225213 records
tip - 591864 records
users - 552339 records

Usage

$ python YelpEtlPipeline.py -c business,user,checkin,tip,photo,review -t -n 4 
Connected to MongoDB Client                                      
Connected to ElasticSearch Client                                

Indexing business...                                             
Indexed 77445 / 77445 documents with 0 failures                  
Time taken for business ingestion : 65.8003674392 seconds.       

Indexing user...                                                 
Indexed 552339 / 552339 documents with 0 failures                
Time taken for user ingestion : 433.517403755 seconds.           

Indexing checkin...                                              
Indexed 55569 / 55569 documents with 0 failures                  
Time taken for checkin ingestion : 58.5147706969 seconds.        

Indexing tip...                                                  
Indexed 591864 / 591864 documents with 0 failures                
Time taken for tip ingestion : 410.815298934 seconds.            

Indexing photo...                                                
Indexed 200000 / 200000 documents with 0 failures                
Time taken for photo ingestion : 102.609901614 seconds.          

Indexing review...                                               
Indexed 2225213 / 2225213 documents with 0 failures              
Time taken for review ingestion : 10633.6480628 seconds.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
es_mappins		es_mappins
.gitignore		.gitignore
README.md		README.md
datatype.py		datatype.py
etl_pipeline.py		etl_pipeline.py
extract.py		extract.py
load.py		load.py
transform.py		transform.py
transform_nltk.py		transform_nltk.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YelpDatasetETL

VADER Sentiment Analyzer

Usage

About

Releases

Packages

Languages

apanimesh061/YelpDatasetETL

Folders and files

Latest commit

History

Repository files navigation

YelpDatasetETL

VADER Sentiment Analyzer

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages