Text Classification on Movie Reviews

Khulood Nasher
4 min readJan 2, 2021
source of Photo

Recently following the audience’ opinions, and feelings on the movies that they are watching have become an important source of data analysis on the movie industry. Big companies like YouTube, Netflix, amazon, IMBD and others are studying the movie reviews very seriously and use it for improving their customers’ satisfaction. The analysis of movie reviews wether they are positive or negative is called sentiment analysis. In this blog, I’m going to provide simple and basic sentiment analysis on movie reviews using machine learning through Scikit-learn library.

The dataset I used here was uploaded from Stanford university. It contains movie reviews of 6000 movies, where 3000 of them are labeled positive and 3000 labeled negative.To get the python code please visit my Github.

In any classification’s analysis, two important libraries are needed first. These two libraries are Numpy and pandas. Also my data file must be uploaded and read as a data frame, as follows:

As we can see above that the movies’ reviews have been preprocessed as a tab separated file ’tsv’. The labels are given as ‘pos’ for positive labels, and ‘neg’ for negative labels.

Now we have to perform some basic cleaning on our data. The first step will be checking for missing values through counting the sum of nan values as follows:

So we see above that we have 20 missing values in our reviews. To get rid of them , we will simply drop them from our data because they are just simple fraction (20/6000=0.3%) out of 6000 reviews which will not hurt our analysis if we just drop them as following:

Next we can take a fast look on our target column which is the ‘label’ column as follows:

As we can see that our data is balanced with 50–50 positive to negative labels, which is a good sign that we will not need to play a round the balance issue that might affect our prediction.

Now, it is time to start the machine learning process to analyze our movies reviews. We have to split our data first into training and testing dataset. It is always recommended to split 70% of the data to be training and 30% of the data to be testing. Also, it is better to set a random state so that the selection of training and testing will be more mixed and not just focused on special group of the reviews as following:

After that, we can start our machine learning process by building a pipeline which will include a Tfidf Vectorizer to vectorize our data, and a classifier which is the algorithm that our data will be fit to it and trained. For the classifier, we can try any algorithm, here I will try random forest classifier as follows:

Now we are able to check the predictions of our model which will be run on the X_test as follows:

An we are able now to get the confusion matrix and the classification report as follows:

As we can see that the recall, precision and f1-score are close numbers in both positive and negative labels which reflects the great balance between the two sentiments.

To check the accuracy score of our random forest model as follows:

Random forest model was able to classify reviews into positive or negative with accuracy about 88.6% when it is applied with Tfidf vectorizer. This accuracy is good but it can be improved more with trying different algorithms and vectorizers and still more text processing can be made here to improve the model accuracy.

--

--

Khulood Nasher

Data Scientist,Health Physicist, NLP Researcher, & Arabic Linguist. https://www.linkedin.com/in/khuloodnasher https:/khuloodnasher1.wixsite.com/resume