Twitter Sentiment Analysis on

Khulood Nasher
10 min readAug 3, 2020

--

Proclamation Suspending Entry of Aliens Who Present a Risk to the U.S. Labor Market Following the Coronavirus Outbreak

Capstone Project: GitHub link

By: Khulood Nasher

Introduction

Define The Problem

Background

As a response to the COVID 19 pandemic and for the sake of supporting the job market of America, President Donald Trump issued a new proclamation on June 22, 2020, to suspend issuing of new visas that is categorized as a non-immigrant visa but it gives the applicants the right to enter to America and the eligibility to work any kind of job, Source of June 22 Proclamation. The June 22 proclamation is an extension to the previous suspension that started on April 22, 2020, that was supposed to last 60 days, Source of April 22 Proclamation. The type of visas affected by this proclamation are the H, L, and some of the J visas.

Main Questions:

1- What do people think about proclamation June 22?

2- What aspects get the most negative mentions?

3- What aspects get the most positive mentions?

To answer those questions, I performed a twitter sentiment analysis. I collected tweets during the period from June 26-July 25 2020.

Sentiments of the public were collected weekly basis by tracing the following hashtags:

# H2bvisa

#h4visa

#Lvisa

#J1 Visa

# h1bvisa

#workvisa

Columns of tweetId and tweets were the only column used for my sentiment analysis.

On my twitter sentiment analysis project, I pursued the data science approach i.e. known as: (OSEMN) which is an abbreviation for Obtain data, Scrub Data, Explore Data, Model Data, and Finally interpret Data as follows:

Figure(1): Data Science approach in performing projects.Source of Photo

Methodology:

Figure(2): My methodology in performing twitter sentiment analysis.

Scrub Data

To clean the tweets, I did the following:

1- cleaning from punctuation through string.punctuation.

2- cleaning from stop words by removing English Stopwords.

3- Adding extended lists of removing more unwanted words such as HTTP, amber, rt, RT…etc.

4- keeping important symbols such as “#” and “@”

5- Regex function was used in cleaning.

6- Tokenizing the tweets through the cleaning functions, so tweets will be tokenized as single words separated by space meaning they are tokens.

7- Joining the tokenized tweets to String, that will make them a list of strings

8- Converting the cleaning list of tweet strings to a data frame.

First, defined my cleaning list that I’m going remove from my tweets as follows:

Then I created a pipeline to remove stop-words, punctuation, and perform tokenization

Then, I applied the cleaning function on my data frame:

After that, I joined the tokens as a list of strings, and then I convert the list of strings of tweets into a data frame

Labeling Tweets:

1- About 9018 tweets were collected and cleaned.

2-One third of the tweets were manually labeled. The tweet was labeled positive(1) if it supports the proclamation, negative(-1) if it disagrees with the proclamation, and neutral (0) if it doesn’t show any feeling or stand toward the proclamation such as sharing neutral news as a source of information.

3- A labeling Function was defined where a compiler was created from the regex library which creates patterns of all positive and negative words.

4- List of positive words and negative words were manually created and fed to the labeling function.

5- The sentiment analyzer function was applied then on all the collected cleaned tweets and all the tweets got labeled.

6- Accuracy of function was increased through comparing unlabeled tweets to the manual labeled tweets and more positive and negative words were added to the list which makes the function labeling most tweets properly.

First I defined the sentiment analysis function to help with labeling tweets:

Then I applied my function on the tweets and created new labeled values:

Explore Data

The Second stage of OSMEN approach which is to explore data that are the tweets here, I performed the following analysis:

First, explored the percentage distribution of the tweets in each class through a pie plot and a bar plot as follows:

Figure(3): Distribution of the number of tweets in each Class

And I also visualize the percentage distribution through pie plot:

Figure(4): Pie Plot shows the distribution of tweets in each class

Figure(3) and Figure (4) show that negative sentiment is more dominant than positive and neutral sentiment. The negative sentiment according to the pie plot is 51% which almost equals positive + neutral sentiments altogether. Those two plots can answer my first question of What do people think about proclamation June 22.

I also generated the word clouds for the tweet classes.

I generated a word cloud for the negative sentiment as follows:

Figure(5): Word Cloud of words expressing negative sentiment.

From the word cloud of negative class, several words appeared a lot such as Senator Durbin, help, Au pair, S386.

The word ‘help’ for example is positive in the English Dictionary, however, in my project, the word ‘help’ is negative because it showed up too many times in opponents’ tweets were called for “ help immigrants and families waiting for decades”.

  • I also visualize the positive word cloud of my tweets as follows:

Figure(6): Word Cloud of positive sentiments.

And I visualized the neutral word cloud:

Figure(7): Word Cloud of neutral sentiment

Also, I explored the length of the tweets based on the number of characters in each class:

Figure(6): Barplot of the length of tweets based in several words in each class

The above plot shows that people who show positive agreement to the proclamation were writing the same tweet-length as people who show negative acceptance to the proclamation, however, the neutral tweets were shorter.

I also visualize the top 25 most common words in my whole data tweets in general:

Figure(7): The Twenty Five Most Common Words

Then, I visualized the top 25 most common words in my negative sentiment class:

Figure(8): Bar plot of Top 25 negative words

And I visualized the top 25 words in the positive sentiment class:

Figure(9): Bar plot of Top 25 positive words

To answer question # 2 above:

What aspects get the most negative mentions?

And to answer the above question of number 3 of :

What aspects get the most positive mentions?

Modeling Data:

The third stage of the OSEMN approach is modeling data. But before I started modeling, I performed preprocessing for my data as follows:

Preprocessing Data:

  • Target columns of sentiments were encoded to 0 for negative sentiment, 1 for neutral sentiment, and 2 for negative sentiment.
  • Tweets were tokenized, then vectorized using one of vectorizer such as TFIDF, and embedding techniques of Doc2Vec and Word2Vec.

I divided my modeling step into three parts as follows:

Part 1: TfidfVectorizer:

Vectorization:

TF-IDF

Classifiers:

LinearSVC, SGDClassifier, LogisticRegression, Random Forest balanced with smote and grid search tuned, Multinomial NB, Adaboost, XGBoost, Neural Network

Train-Test Split: 60% Training, 40% Testing Evaluation: Confusion Matrix, Classification Report, Accuracy Score.

Example: RandomForest Classifier: Accuracy 96%

Part 2: Modeling based on Doc2Vec Vectorizer

Vectorization:

Doc2Vec

Classifiers:

LinearSVC, Random Forest, Multinomial NB, Adaboost, XGBoost, Neural Network

Train-Test Split: 60% Training,40% Testing

Evaluation:

Confusion Matrix, Classification Report, Accuracy Score

Part 3: Pretrained model: GloVe Word Embedding Classifiers — 100d pretrained on corpus of for 6 billion words.

Vectorization: GloVe embeddings (100d per word, pre-trained on corpus of 6 billion tweets)

Mean embeddings are then calculated using the words present in each tweet

Classifiers:

LinearSVC, Random Forest, Multinomial NB, Adaboost, XGBoost, Neural Network

Train-Test Split:

60% Training, 40 % Testing

Evaluation:

Confusion Matrix, Classification Report, Accuracy Score.

Summary of the Results:

Conclusion

After exploring multiple classifiers with hyper tuning, balancing, on different NLP techniques of vectorizing with Tfdif, Doc2Vec, Word2Vec, and using a pre-trained model i.e. Glove, I found out:

  • All the models perform better with the Tfdif vectorizer. Adding any more efforts of powerful vectorizing, pretrained models, balancing, or hyper tuning was always performing worse.
  • I can refer that to the great efforts that I spent on manually labeling my tweets, and investigating in depth the meaning of every single tweet including the sarcastic tweets, or words that have positive meaning but has negative sentiment in our tweet, such as the word ‘help’ or word ‘save’. Both words are positive in the English dictionary, but they are negative in our analysis because those two words were widely used by opponents who disagree with the proclamation and calling, for example, to “save j1 visa”, or “ help families from separation”.
  • Random Forest model was the best model with accuracy and Recall of 0.96. The smote sampling and grid search didn’t improve accuracy.
  • The words identified as important negative feature are: H1B,S386,SenatorDurbin,End,help,SaveJ1,US,h1bvisa,immigrants,GCBCoalition,Equality,visa,J1,families,American,years,immigration,PassS386,ban ,future.
  • The words that identified as positive features are tech, American, jobs, young, immigration,realDonaldTrump, OPT, graduates, imported, excluded, CEOs, Valley, Silicon, Trump, tech workers.
  • The power of the model to identify slight differences in the presence of words among the three classifications is evident by how closely their presence was in each class. For example, the word “launch” showed up in 0.03% of negative tweets, 0.06% of neutral tweets, and 0.04% of positive tweets. This sensitivity was evident across the most important words contributing to the model.

Recommendations:

Based on the public reaction we see that we should:

  1. Consider the side effect suspension exchange visitor in the job market and income, especially the au pair programs, camps, and j student tuition
  2. Consider the income of Tax collection from Work visa immigrants for the fiscal year 2021.
  3. Investigate the ratio of immigrants being hired to Americans being hired by big tech companies and set more standards for Americans first.
  4. Diversity needs to be considered in employment.
  5. Consider the tax income from all immigration programs as a kind of standard for increasing or decreasing immigration window particularly during the stagnating time of COVID 19.
  6. Consider the adverse factors such as the reduction from decreasing travel, exchange programs, and increase remote overseas employment particularly in tech jobs.
  7. Fill the demands from American physicians first, then cover the shortage from international immigrants, OPT F1/J1 students.
  8. Consider the income that comes from tuition paid by F1, and J1 student visa as a source of income and at the same time prioritize the employment opportunities for Americans.
  9. Consider the humanitarian factors of family separation with regards to H4 visa suspension, by removing visa ban but allowing suspension of employment authorization during COVID 19.
  10. Use my model for sentiment analysis of any immigration ban in the future.

--

--

Khulood Nasher
Khulood Nasher

Written by Khulood Nasher

Data Scientist,Health Physicist, NLP Researcher, & Arabic Linguist. https://www.linkedin.com/in/khuloodnasher https:/khuloodnasher1.wixsite.com/resume

No responses yet