Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

Sentiment analysis with

Python*

* using scikit-learn
@vumaasha
  • On a Sunday afternoon, you are bored. You want to watch a movie that has mixed reviews.
  • You want to know the overall feeling on the movie, based on reviews
  • Let's build a Sentiment Model with Python!!
it's a blackbox ???

How to build the Blackbox?

Sentiments from movie reviews

This movie is really not all that bad. But then again, this movie genre is right down my alley. Sure, the sets are cheap, but they really did decent with what they had. If you like cheap, futuristic, post-apocalyptic B movies, then you'll love this one!! I sure did!. It's a movie to keep you interested forever.

Sentiments from movie reviews

This must be accompanied by a special rating and warning: NOT RECOMMENDED TO NORMAL PEOPLE.The obsession of Daneliuc with the most dirty body functions becomes here a real nightmare. Also, it's evident that the man is a misanthrope, he hates everybody - his country his people, his actors, his job. And this hatred makes him blind and he forgets anymore the profession he knew long ago.This so called ""film"" is just a hideous string of disgusting images, with no artistic value and no professionist knowledge. It is an insult to good taste and to good sense. Shame, shame, shame

Preprocessing

Tokenization

converting sentences in to words, removes punctuations and symbols

stemming

reducing the word to it's root. Running, runs, runner to run.

Filter stop words

remove useless words - as, from, on, the, to, in ... from the corpus

Generate TF-IDF matrix

computes term frequency and inverse document frequency for each word in the corpus

Preprocessed data set

Term frequency matrix

Model Learning

  1. Known as supervised classification/learning in the machine learning world
  2. Given a labelled dataset, the task is to learn a function that will predict the label given the input
  3. In this case we will learn a function predictReview(review as input)=>sentiment
  4. Algorithms such as Decision tree, Naive Bayes, Support Vector Machines, etc.. can be used
  5. scikit-learn has implementations of many classification algorithms out of the box

Model validation

  1. We have to validate if our model works
  2. Split the labelled dataset in to 2 (60% - training, 40%-test)
  3. Learn the model on the training dataset
  4. Apply the model on the examples from test set and calculate the accuracy
  5. Now, we have decent approximation of how our model would perform
  6. This process is known as split validation
  7. scikit-learn has implementations of validation techniques out of the box

The Blackbox is ready !!

Let's take a look at the code

preprocess


                # load the data from labelled csv file and 
                # create a term frequency matrix
                def preprocess():
                data,target = load_file()
                count_vectorizer = CountVectorizer(binary='true')
                data = count_vectorizer.fit_transform(data)
                tfidf_data = TfidfTransformer(use_idf=False).
                fit_transform(data)

                return tfidf_data
            

learn model


                def learn_model(data,target):
                # preparing data for split validation.
                # 60% training, 40% test and learning model
                data_train,data_test,target_train,target_test = \
                cross_validation.train_test_split(data,target,\
                test_size=0.4,random_state=43)
                classifier = BernoulliNB().fit(data_train,target_train)
                predicted = classifier.predict(data_test)
                evaluate_model(target_test,predicted)
            

evaluate model


                # evaluating model performance
                def evaluate_model(target_true,target_predicted):
                print classification_report(target_true,\
                target_predicted)
            

classification report


               precision    recall  f1-score   support

               negative       0.71      0.91      0.80       409
               positive       0.86      0.59      0.70       375

               avg / total       0.78      0.76      0.75       784

               The accuracy score is 75.89%
           

want to see it work?

get the source from github and run it , Luke!

credit where credit's due

Xoanon Analytics - for letting us work on interesting things

Arathi Arumugam - helped to develop the sample code

Shameless plug

Talented students looking for internships are always Welcome!!

mail to: venkatesh.umaashankar[at]
xoanonanalytics(dot)com

Use a spacebar or arrow keys to navigate

overview