Because individuals often use Twitter as a place to share their feelings and thoughts, analyzing these short blocks of text can be helpful in determining general sentiment towards particular political candidates. I sought to perform sentiment analysis on a database of nearly 200,000 tweets for each major party candidate (Joe Biden for the Democrats and incumbent Donald Trump for the Republicans) in the 2020 Presidential Election.
The key questions I wanted to examine were:
- What are the major events that would spark debate around the two candidates?
- What are the most commonly tweeted words in reference to each candidate?
- What is sentiment like for each candidate across the United States and on a state-by-state basis?
- Can twitter sentiment scores be used to predict election winners?
I first sought to examine interest over time in the two candidates. Google Trends was ripe for such analysis and also allowed us to see major events during the election cycle that particularly sparked interest.
We specifically looked at the time frame from the Democratic National and Republican National Conventions (as this when the major party candidates were officially certified) until Election Day. As we can see, there was a major jump in searches around the first presidential debate in late September. Both candidates received criticism for their aggressive performances during this event. The biggest spike in searches for Trump, however, occurs in early October when it is reported that he and former first lady, Melania, have tested positive for COVID-19. One would expect Election Day on November 3rd to have had the most hits, and although a spike does occur on this date, it pales in comparison to the “October Surprise” that was Trump testing positive.
I also generated word clouds that show the most frequently used terms in candidate-related search queries:
For both candidates, polls were a major concern. For Joe Biden, we see a lot of search terms surrounding the controversy around Hunter Biden. For Donald Trump, we see mentions of family members as well, such as Melania and Barron, but a lot of positive searches surrounding his Nobel Peace Prize nomination.
Now that we had an idea of major events that would most likely catalyze Twitter searches, we moved to the Twitter data. This dataset was provided by Kaggle and is linked at the bottom of the webpage. After filtering out tweets from foreign countries, to ensure that we were only examining US sentiment, there were roughly 200,000 tweets for each candidate. Because there were hundreds of thousands of words in this dataset, I thought rather than word clouds it would be more useful to look at the top 20 words that were tweeted in relation to each candidate:
As you can see, the top word tweeted about each candidate is referencing the other (which is good to note in terms of confounding data for our sentiment analysis later as people tend to mention both candidates in each tweet which could throw off the sentiment scores). It is notable that people frequently mentioned Biden’s running mate, Kamala Harris, whereas Trump’s was rarely referenced. As expected in accordance with our google trends results, “debate” is frequently mentioned for Biden and “covid” is frequently mentioned for Trump. “Covid” is frequently mentioned for Biden as well as people were most likely speculating/debating about his policy proposals regarding the matter. Election results and voting, however, were the top concern for tweets about both candidates.
Next, we looked to perform sentiment analysis on both candidates’ tweets. The twitter data was split up by state (D.C. was included as well as it has one electoral vote), and looped through to calculate a sentiment score for each candidate for all 51 areas. The results were quite interesting:
As you can see, general sentiment around both candidates is overwhelmingly negative. This is reasonable, given that the 2020 election was one of the most contentious elections in recent history due to tensions surrounding the COVID-19 pandemic and the Black Lives Matter movement. However, we do see that there are nine states for which Biden has positive sentiment scores: Arkansas, Delaware, Iowa, North Dakota, New Hampshire, New Mexico, Rhode Island, and South Dakota. These aren’t the heavy-blue, liberal-leaning states you’d expect to see. I suppose that because conservatives are less likely to use Twitter in general, that might have skewed some of the data. What is especially interesting though is that not a single state has a positive sentiment score for Donald Trump. Furthermore, the Trump trendline is consistently beneath the Biden one, even in heavy-red states like Florida and Texas. It seems that generally, perhaps due to his COVID-19 response, there was strong negative sentiment regarding Donald Trump.
I thought it would be interesting to take this data a step further and build a model. However, I didn’t think sentiment scores for each candidate alone would have enough predictive power, so I added additional data from the Bureau of Economic Analysis regarding unemployment rates, GDP, and per capita personal income for each state and D.C. The y-variable was then the binary variable of election outcome for each state/territory. I then compared the model’s prediction to the actual outcome for each election and found that it was 84.3% accurate! Here are the results for which states were predicted correctly:
Only 8 states were predicted incorrectly. As expected, a lot of these states are swing states and difficult to estimate (even for experienced pollsters) such as Arizona, Georgia, Maine, Ohio, and Wisconsin. It was surprising, however, that the model wasn’t able to accurately estimate Alaska, Delaware, and Louisiana. Alaska and Louisiana are pretty clearly conservative states, and Delaware was supposed to be a shoo-in for Biden. The model was also surprisingly able to correctly predict Pennsylvania which is historically a toss-up as well.
It’s amazing that the model was as accurate as it was just based off training data from one election’s twitter sentiment scores and economic indicators from 2020. For future study, I would love to train the model with more data from previous elections to try to further increase this accuracy. I’m looking forward to running this model in 2024 and seeing what the outcome is!
Thank you to the following data sources:
- Search data from Google Trends
- Twitter data through Kaggle: https://www.kaggle.com/manchunhui/us-election-2020-tweets/activity
- Bureau of Economic Analysis data for 2020: https://www.bea.gov/data/income-saving/personal-income-by-state