Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.
For the best experience please use the latest Chrome, Safari or Firefox browser.
Sentiment analysis with
Python*
@vumaasha
- On a Sunday afternoon, you are bored. You want to watch a movie that has mixed reviews.
- You want to know the overall feeling on the movie, based on reviews
- Let's build a Sentiment Model with Python!!
it's a blackbox ???
How to build the Blackbox?
Sentiments from movie reviews
This movie is really not all that bad. But then again, this movie genre is right down my alley. Sure, the sets are cheap, but they really did decent with what they had. If you like cheap, futuristic, post-apocalyptic B movies, then you'll love this one!! I sure did!. It's a movie to keep you interested forever.
Sentiments from movie reviews
This must be accompanied by a special rating and warning: NOT RECOMMENDED TO NORMAL PEOPLE.The obsession of Daneliuc with the most dirty body functions becomes here a real nightmare. Also, it's evident that the man is a misanthrope, he hates everybody - his country his people, his actors, his job. And this hatred makes him blind and he forgets anymore the profession he knew long ago.This so called ""film"" is just a hideous string of disgusting images, with no artistic value and no professionist knowledge. It is an insult to good taste and to good sense. Shame, shame, shame
Preprocessing
Tokenization
- converting sentences in to words, removes punctuations and symbols
stemming
- reducing the word to it's root. Running, runs, runner to run.
Filter stop words
- remove useless words - as, from, on, the, to, in ... from the corpus
Generate TF-IDF matrix
- computes term frequency and inverse document frequency for each word in the corpus
Preprocessed data set
Term frequency matrix
Model Learning
- Known as supervised classification/learning in the machine learning world
- Given a labelled dataset, the task is to learn a function that will predict the label given the input
- In this case we will learn a function predictReview(review as input)=>sentiment
- Algorithms such as Decision tree, Naive Bayes, Support Vector Machines, etc.. can be used
- scikit-learn has implementations of many classification algorithms out of the box
Model validation
- We have to validate if our model works
- Split the labelled dataset in to 2 (60% - training, 40%-test)
- Learn the model on the training dataset
- Apply the model on the examples from test set and calculate the accuracy
- Now, we have decent approximation of how our model would perform
- This process is known as split validation
- scikit-learn has implementations of validation techniques out of the box
The Blackbox is ready !!
Let's take a look at the code
preprocess
# load the data from labelled csv file and
# create a term frequency matrix
def preprocess():
data,target = load_file()
count_vectorizer = CountVectorizer(binary='true')
data = count_vectorizer.fit_transform(data)
tfidf_data = TfidfTransformer(use_idf=False).
fit_transform(data)
return tfidf_data
learn model
def learn_model(data,target):
# preparing data for split validation.
# 60% training, 40% test and learning model
data_train,data_test,target_train,target_test = \
cross_validation.train_test_split(data,target,\
test_size=0.4,random_state=43)
classifier = BernoulliNB().fit(data_train,target_train)
predicted = classifier.predict(data_test)
evaluate_model(target_test,predicted)
evaluate model
# evaluating model performance
def evaluate_model(target_true,target_predicted):
print classification_report(target_true,\
target_predicted)
classification report
precision recall f1-score support
negative 0.71 0.91 0.80 409
positive 0.86 0.59 0.70 375
avg / total 0.78 0.76 0.75 784
The accuracy score is 75.89%
credit where credit's due
Xoanon Analytics - for letting us work on interesting things
Arathi Arumugam - helped to develop the sample code
Shameless plug
Talented students looking for internships are always Welcome!!
mail to: venkatesh.umaashankar[at]
xoanonanalytics(dot)com