Analyzing Twitter with MATLAB
Whatever your opinion of social media these days, there is no denying it is now an integral part of our digital life. So much so, that social media metrics are now considered part of altmetrics, an alternative to the established metrics such as citations to measure the impact of scientific papers.
Today's guest blogger, Toshi, will show you how to access the Twitter API and analyze tweets with MATLAB.
Contents
- Why Twitter
- Sentiment Analysis
- Tweet Content Visualization
- Who Tweeted the News?
- Does Follower Count Really Matter? Going Viral on Twitter
- Visualizing the Retweet Social Graph
- Getting Started with Twitter using Twitty
- Processing Tweets and Scoring Sentiments
- Processing Tweets for Content Visualization
- Get the Profile of Top 5 Users
- Streaming API for High Volume Real Time Tweets
- Save an Edge List for Social Graph Visualization
- Closing
Why Twitter
Twitter is a good starting point for social media analysis because people openly share their opinions to the general public. This is very different from Facebook, where social interactions are often private. In this post, I would like to share simple examples of sentiment analysis and social graph visualization using Twitter's Search and Streaming APIs.
The first part of this post discusses analysis with Twitter, and the latter part shows the code that computes and creates plots, like those shown earlier.
Sentiment Analysis
One of the very common analyses you can perform on a large number of tweets is sentiment analysis. Sentiment is scored based on the words contained in a tweet. If you manage a brand or political campaign, for example, it may be important to keep track of your popularity, and sentiment analysis provides a convenient way to take the pulse of the tweeting public. Here is an example of sentiment analysis between Amazon and Hachette as of this writing, based on 100 tweets collected via the Twitter Search API.
The sentiment distributions are nearly identical between the two brands, but you can see that tweets mentioning both have clearly skewed to the negative, since the news is about a war between Amazon and a publisher over ebook profit margin. Is there a single metric we can use to make this comparison easier? That's where Net Sentiment Rate (NSR) comes in.
NSR = (Positive Tweets-Negative Tweets)/Total
Here is the result. You could keep taking this measurement periodically for ongoing sentiment monitoring, if interested. Perhaps you may discover that NSR is correlated to their stock prices!
Amazon NSR : 0.84 Hachette NSR: 0.58 Both NSR : -0.30
And lastly, but not in the least, did sentiment scoring actually work? Check out the top 5 positive and negative tweets for Hachette for your own assessment.
Top 5 positive tweets ___________________________________________________________________________
'@deckchairs @OccupyMyCat @aworkinglibrary but I think Hachette artists...' '@emzleb Hachette has Rowling so they hold a lot of cards (A LOT of car...' 'Amazon Confirms Hachette Spat Is To "Get a Better Deal" http://t.co/Ka...' '@shaunduke @DarkMatterzine Yeah, Gollancz is owned by Orion Publishing...' 'MUST READ Book publisher Hachette says working to resolve Amazon dispu...'
Top 5 negative tweets ___________________________________________________________________________
'Reading into the Amazon vs. Hachette battle - May 28 - The war between...' '#Vtech Reading into the Amazon vs. Hachette battle - May 28 - The war ...' '#Vbnss Reading into the Amazon vs. Hachette battle - May 28 - The war ...' 'RT @text_publishing: Amazon war with Hachette over ebook profit margin...' 'RT @text_publishing: Amazon war with Hachette over ebook profit margin...'
Tweet Content Visualization
What were the main themes they tweeted about when those users mentioned both Amazon and Hachette? The word count plot shows that mostly those tweets repeated the news headlines like “Amazon admits dispute (with) Hachette”, perhaps with some commentary - showing that Twitter was being used for news amplification.
Who Tweeted the News?
The 100 tweets collected came from 86 users. So on average each user only tweeted 1.16 times. Instead of frequency, let's find out who has a large number of followers (an indicator that they may be influential) and check their profile. It appears that 2 or 3 out of the 5 top users (based on follower count) are writers, and others are news syndication services.
Name Followers Description ________________ _________ ____________________________________________________
'Daton L Fluker' 73578 '#Horror #Novelist of Death Keeper's Biological Wast...' 'WellbeingVigor' 22224 'Writer - 10 years .here, Incurable music enthusiast #' 'E-Book Update' 10870 '' 'Michael Rosa' 10297 '' 'Net Tech News' 7487 'Latest internet and technology news headlines from ...'
Does Follower Count Really Matter? Going Viral on Twitter
In the previous section, we checked out the top 5 users based on their follower count. The assumption was that, if you have a large number of followers, you are considered more influential because more people may see your tweets.
Now let's test this assumption. For that I need more than 100 tweets. So I collected a new batch of data - 1000 tweets from 4 trending topics from the UK, and plotted the users based on their follower counts vs. how often their tweets got retweeted. The size (and the color) of the bubbles show how often those users tweeted.
It looks like you do need some base number of followers to make it to the national level, but the correlation between the follower counts to the frequency of getting retweeted looks weak. Those charts look like different stages of viral diffusion - the top two charts clearly show one user broke away from the rest of the crowd, and in that process they may have also gained more followers. The bottom two charts show a number of users competing for attention but no one has a clear breakout yet. If this was an animation, it may look like boiling water. Is anyone interested in analyzing whether this is indeed how a tweet goes viral?
Visualizing the Retweet Social Graph
Retweeting of one user's tweet by others creates a network of relationships that can be represented as a social graph. We can visualize such relationship with a popular social networking analysis tool Gephi.
"I Can't Sing" Social Graph Larger
"#InABlackHousehold" Social Graph Larger
You can see that, in the first case, two users formed large clusters of people retweeting their tweets, and everyone else was dwarfed. In the second case, we also see two dominant users, but they have not yet formed a large scale cluster.
Getting Started with Twitter using Twitty
Now that you have seen a simple analysis I did with Twitter, it is time to share how I did it in MATLAB. To get started with Twitter, you need to get your developer credentials. You also need Twitty by Vladimir Bondarenko. It is simple to use and comes with excellent documentation.
- Create a Twitter account if you do not already have one
- Create a Twitter app to obtain developer credentials
- Download and install Twitty from the FileExchange, along with the JSON Parser and optionally JSONLab
- Create a structure array to store your credentials for Twitty
Let's search for tweets that mention 'amazon' and 'hachette'.
% a sample structure array to store the credentials creds = struct('ConsumerKey','your-consumer-key-here',... 'ConsumerSecret','your-consumer-secret-here',... 'AccessToken','your-token-here',... 'AccessTokenSecret','your-token-secret-here'); % set up a Twitty object addpath twitty_1.1.1; % Twitty addpath parse_json; % Twitty's default json parser addpath jsonlab; % I prefer JSONlab, however. load('creds.mat') % load my real credentials tw = twitty(creds); % instantiate a Twitty object tw.jsonParser = @loadjson; % specify JSONlab as json parser % search for English tweets that mention 'amazon' and 'hachette' amazon = tw.search('amazon','count',100,'include_entities','true','lang','en'); hachette = tw.search('hachette','count',100,'include_entities','true','lang','en'); both = tw.search('amazon hachette','count',100,'include_entities','true','lang','en');
Processing Tweets and Scoring Sentiments
Twitty stores tweets in structure array created from the API response in JSON format. I prefer using a table when it comes to working with heterogeneous data containing a mix of numbers and text. I wrote some code, processTweets, to convert structure arrays into tables and compute sentiment scores. You can find the Amazon-Hachette data file here.
For sentiment analysis, I used AFINN, along with a list of English stop words so that we don't count frequent common words like "a" or "the".
% load supporting data for text processing scoreFile = 'AFINN/AFINN-111.txt'; stopwordsURL ='http://www.textfixer.com/resources/common-english-words.txt'; % load previously saved data load amazonHachette.mat % process the structure array with a utility method |extract| [amazonUsers,amazonTweets] = processTweets.extract(amazon); % compute the sentiment scores with |scoreSentiment| amazonTweets.Sentiment = processTweets.scoreSentiment(amazonTweets, ... scoreFile,stopwordsURL); % repeat the process for hachette [hachetteUsers,hachetteTweets] = processTweets.extract(hachette); hachetteTweets.Sentiment = processTweets.scoreSentiment(hachetteTweets, ... scoreFile,stopwordsURL); % repeat the process for tweets containing both [bothUsers,bothTweets] = processTweets.extract(both); bothTweets.Sentiment = processTweets.scoreSentiment(bothTweets, ... scoreFile,stopwordsURL); % calculate and print NSRs amazonNSR = (sum(amazonTweets.Sentiment>=0) ... -sum(amazonTweets.Sentiment<0)) ... /height(amazonTweets); hachetteNSR = (sum(hachetteTweets.Sentiment>=0) ... -sum(hachetteTweets.Sentiment<0)) ... /height(hachetteTweets); bothNSR = (sum(bothTweets.Sentiment>=0) ... -sum(bothTweets.Sentiment<0)) ... /height(bothTweets); fprintf('Amazon NSR : %.2f\n',amazonNSR) fprintf('Hachette NSR: %.2f\n',hachetteNSR) fprintf('Both NSR : %.2f\n\n',bothNSR) % plot the sentiment histogram of two brands binranges = min([amazonTweets.Sentiment; ... hachetteTweets.Sentiment; ... bothTweets.Sentiment]): ... max([amazonTweets.Sentiment; ... hachetteTweets.Sentiment; ... bothTweets.Sentiment]); bincounts = [histc(amazonTweets.Sentiment,binranges)... histc(hachetteTweets.Sentiment,binranges)... histc(bothTweets.Sentiment,binranges)]; figure bar(binranges,bincounts,'hist') legend('Amazon','Hachette','Both','Location','Best') title('Sentiment Distribution of 100 Tweets') xlabel('Sentiment Score') ylabel('# Tweets')
Amazon NSR : 0.84 Hachette NSR: 0.58 Both NSR : -0.30
Processing Tweets for Content Visualization
processTweets also has a function tokenize that parses tweets to calculate the word count.
% tokenize tweets with |tokenize| method of |processTweets| [words, dict] = processTweets.tokenize(bothTweets,stopwordsURL); % create a dictionary of unique words dict = unique(dict); % create a word count matrix [~,tdf] = processTweets.getTFIDF(words,dict); % plot the word count figure plot(1:length(dict),sum(tdf),'b.') xlabel('Word Indices') ylabel('Word Count') title('Words contained in the tweets') % annotate high frequency words annotated = find(sum(tdf)>= 10); jitter = 6*rand(1,length(annotated))-3; for i = 1:length(annotated) text(annotated(i)+3, ... sum(tdf(:,annotated(i)))+jitter(i),dict{annotated(i)}) end
Get the Profile of Top 5 Users
Twitty also supports the 'users/show' API to retrieve user profile information. Let's get the profile of the top 5 users based on the follower count.
% sort the user table by follower count in descending order [~,order] = sortrows(bothUsers,'Followers','descend'); % select top 5 users top5users = bothUsers(order(1:5),[3,1,5]); % add a column to store the profile top5users.Description = repmat({''},height(top5users),1); % retrieve user profile for each user for i = 1:5 userInfo = tw.usersShow('user_id', top5users.Id(i)); if ~isempty(userInfo{1}.description) top5users.Description{i} = userInfo{1}.description; end end % print the result disp(top5users(:,2:end))
Name Followers ________________ _________ 'Daton L Fluker' 73578 'WellbeingVigor' 22224 'E-Book Update' 10870 'Michael Rosa' 10297 'Net Tech News' 7487 Description ___________________________________________________________________________ '#Horror #Novelist of Death Keeper's Biological Wasteland, Finished Cri...' 'Writer - 10 years .here, Incurable music enthusiast #' '' '' 'Latest internet and technology news headlines from news sources around...'
Streaming API for High Volume Real Time Tweets
If you need more than 100 tweets to work with, then your only option is to use the Streaming API which provides access to the sampled Twitter fire hose in real time. That also means you need to access the tweets that are currently active. You typically start with a trending topic from a specific location.
You get local trends by specifying the geography with WOEID (Where On Earth ID), available at WOEID Lookup.
uk_woeid = '23424975'; % UK uk_trends = tw.trendsPlace(uk_woeid); uk_trends = cellfun(@(x) x.name, uk_trends{1}.trends, 'UniformOutput',false)';
Once you have the current trends (or download them from here), you can use the Streaming API to retrieve the tweets that mention the trending topic. When you specify an output function with Twitty, the data is store within Twitty. Twitty will process incoming tweets up to the sample size specified, and process data by the batch size specified.
tw.outFcn = @saveTweets; % output function tw.sampleSize = 1000; % default 1000 tw.batchSize = 1; % default 20 tic; tw.filterStatuses('track',uk_trends{1}); % Streaming API call toc uk_trend_data = tw.data; % save the data
% reload the previously saved search result for 4 trending topics in the UK load('uk_data.mat') % plot figure for i = 1:4 % process tweets [users,tweets] = processTweets.extract(uk_data(i).statuses); % get who are mentioned in retweets retweeted = tweets.Mentions(tweets.isRT); retweeted = retweeted(~cellfun('isempty',retweeted)); [screen_names,~,idx] = unique(retweeted); count = accumarray(idx,1); retweeted = table(screen_names,count,'VariableNames',{'Screen_Name','Count'}); % get the users who were mentioned in retweets match = ismember(users.Screen_Name,retweeted.Screen_Name); retweetedUsers = sortrows(users(match,:),'Screen_Name'); match = ismember(retweeted.Screen_Name,retweetedUsers.Screen_Name); retweetedUsers.Retweeted_Count = retweeted.Count(match); [~,order] = sortrows(retweetedUsers,'Retweeted_Count','descend'); % plot each topic subplot(2,2,i) scatter(retweetedUsers.Followers(order),... retweetedUsers.Retweeted_Count(order),retweetedUsers.Freq(order)*50,... retweetedUsers.Freq(order),'fill') if ismember(i, [1,2]) ylim([-20,90]); xpos = 2; ypos1 = 50; ypos2 = 40; elseif i == 3 ylim([-1,7]) xlabel('Follower Count (Log Scale)') xpos = 1010; ypos1 = 0; ypos2 = -1; else ylim([-5,23]) xlabel('Follower Count (Log Scale)') xpos = 110; ypos1 = 20; ypos2 = 17; end % set x axis to log scale set(gca, 'XScale', 'log') if ismember(i, [1,3]) ylabel('Retweeted Count') end title(sprintf('UK Tweets for: "%s"',uk_data(i).query.name)) end
Save an Edge List for Social Graph Visualization
Gephi imports an edge list in CSV format. I added a new method saveEdgeList to processTweet that saves the screen names of the users as source and the hashtags and screen names they mention in their tweets as target in a <https://gephi.org/users/supported-graph-formats/csv-format/ Gephi-ready CSV file.
processTweets.saveEdgeList(uk_data(1).statuses,'edgeList.csv');
File "edgeList.csv" was successfully saved.
Closing
It is quite easy to get started with Twitter Analytics with MATLAB and hopefully you got the taste of what kind of analyses are possible.
We only scratched the surface. Twitter offers many of the most interesting opportunities for data analytics. How would you use Twitter Analytics? Check out some examples from this search result from PLOS ONE that list various papers that used Twitter for their study. Tell us about your Twitty experiences here.
- カテゴリ:
- Fun,
- How To,
- Social Computing