Analyzing Twitter with MATLAB

作者 Loren Shure, June 4, 2014

10 次查看（过去 30 天） | 0 个赞 | 5 个评论

Whatever your opinion of social media these days, there is no denying it is now an integral part of our digital life. So much so, that social media metrics are now considered part of altmetrics, an alternative to the established metrics such as citations to measure the impact of scientific papers.

Today's guest blogger, Toshi, will show you how to access the Twitter API and analyze tweets with MATLAB.

Why Twitter
Sentiment Analysis
Tweet Content Visualization
Who Tweeted the News?
Does Follower Count Really Matter? Going Viral on Twitter
Visualizing the Retweet Social Graph
Getting Started with Twitter using Twitty
Processing Tweets and Scoring Sentiments
Processing Tweets for Content Visualization
Get the Profile of Top 5 Users
Streaming API for High Volume Real Time Tweets
Save an Edge List for Social Graph Visualization
Closing

Why Twitter

Twitter is a good starting point for social media analysis because people openly share their opinions to the general public. This is very different from Facebook, where social interactions are often private. In this post, I would like to share simple examples of sentiment analysis and social graph visualization using Twitter's Search and Streaming APIs.

The first part of this post discusses analysis with Twitter, and the latter part shows the code that computes and creates plots, like those shown earlier.

Sentiment Analysis

One of the very common analyses you can perform on a large number of tweets is sentiment analysis. Sentiment is scored based on the words contained in a tweet. If you manage a brand or political campaign, for example, it may be important to keep track of your popularity, and sentiment analysis provides a convenient way to take the pulse of the tweeting public. Here is an example of sentiment analysis between Amazon and Hachette as of this writing, based on 100 tweets collected via the Twitter Search API.

The sentiment distributions are nearly identical between the two brands, but you can see that tweets mentioning both have clearly skewed to the negative, since the news is about a war between Amazon and a publisher over ebook profit margin. Is there a single metric we can use to make this comparison easier? That's where Net Sentiment Rate (NSR) comes in.

NSR = (Positive Tweets-Negative Tweets)/Total

Here is the result. You could keep taking this measurement periodically for ongoing sentiment monitoring, if interested. Perhaps you may discover that NSR is correlated to their stock prices!

Amazon NSR  :  0.84
Hachette NSR:  0.58
Both NSR    : -0.30

And lastly, but not in the least, did sentiment scoring actually work? Check out the top 5 positive and negative tweets for Hachette for your own assessment.

                               Top 5 positive tweets
   ___________________________________________________________________________

   '@deckchairs @OccupyMyCat @aworkinglibrary but I think Hachette artists...'
   '@emzleb Hachette has Rowling so they hold a lot of cards (A LOT of car...'
   'Amazon Confirms Hachette Spat Is To "Get a Better Deal" http://t.co/Ka...'
   '@shaunduke @DarkMatterzine Yeah, Gollancz is owned by Orion Publishing...'
   'MUST READ Book publisher Hachette says working to resolve Amazon dispu...'

                               Top 5 negative tweets
   ___________________________________________________________________________

   'Reading into the Amazon vs. Hachette battle - May 28 - The war between...'
   '#Vtech Reading into the Amazon vs. Hachette battle - May 28 - The war ...'
   '#Vbnss Reading into the Amazon vs. Hachette battle - May 28 - The war ...'
   'RT @text_publishing: Amazon war with Hachette over ebook profit margin...'
   'RT @text_publishing: Amazon war with Hachette over ebook profit margin...'

Tweet Content Visualization

What were the main themes they tweeted about when those users mentioned both Amazon and Hachette? The word count plot shows that mostly those tweets repeated the news headlines like “Amazon admits dispute (with) Hachette”, perhaps with some commentary - showing that Twitter was being used for news amplification.

Who Tweeted the News?

The 100 tweets collected came from 86 users. So on average each user only tweeted 1.16 times. Instead of frequency, let's find out who has a large number of followers (an indicator that they may be influential) and check their profile. It appears that 2 or 3 out of the 5 top users (based on follower count) are writers, and others are news syndication services.

         Name          Followers                        Description
   ________________    _________    ____________________________________________________

   'Daton L Fluker'    73578        '#Horror #Novelist of Death Keeper's Biological Wast...'
   'WellbeingVigor'    22224        'Writer  - 10 years .here, Incurable music enthusiast #'
   'E-Book Update'     10870        ''
   'Michael Rosa'      10297        ''
   'Net Tech News'      7487        'Latest internet and technology news headlines from ...'

Does Follower Count Really Matter? Going Viral on Twitter

In the previous section, we checked out the top 5 users based on their follower count. The assumption was that, if you have a large number of followers, you are considered more influential because more people may see your tweets.

Now let's test this assumption. For that I need more than 100 tweets. So I collected a new batch of data - 1000 tweets from 4 trending topics from the UK, and plotted the users based on their follower counts vs. how often their tweets got retweeted. The size (and the color) of the bubbles show how often those users tweeted.

It looks like you do need some base number of followers to make it to the national level, but the correlation between the follower counts to the frequency of getting retweeted looks weak. Those charts look like different stages of viral diffusion - the top two charts clearly show one user broke away from the rest of the crowd, and in that process they may have also gained more followers. The bottom two charts show a number of users competing for attention but no one has a clear breakout yet. If this was an animation, it may look like boiling water. Is anyone interested in analyzing whether this is indeed how a tweet goes viral?

Visualizing the Retweet Social Graph

Retweeting of one user's tweet by others creates a network of relationships that can be represented as a social graph. We can visualize such relationship with a popular social networking analysis tool Gephi.

"I Can't Sing" Social Graph Larger

"#InABlackHousehold" Social Graph Larger

You can see that, in the first case, two users formed large clusters of people retweeting their tweets, and everyone else was dwarfed. In the second case, we also see two dominant users, but they have not yet formed a large scale cluster.

Getting Started with Twitter using Twitty

Now that you have seen a simple analysis I did with Twitter, it is time to share how I did it in MATLAB. To get started with Twitter, you need to get your developer credentials. You also need Twitty by Vladimir Bondarenko. It is simple to use and comes with excellent documentation.

Create a Twitter account if you do not already have one
Create a Twitter app to obtain developer credentials
Download and install Twitty from the FileExchange, along with the JSON Parser and optionally JSONLab
Create a structure array to store your credentials for Twitty

Let's search for tweets that mention 'amazon' and 'hachette'.

% a sample structure array to store the credentials
creds = struct('ConsumerKey','your-consumer-key-here',...
    'ConsumerSecret','your-consumer-secret-here',...
    'AccessToken','your-token-here',...
    'AccessTokenSecret','your-token-secret-here');

% set up a Twitty object
addpath twitty_1.1.1; % Twitty
addpath parse_json; % Twitty's default json parser
addpath jsonlab; % I prefer JSONlab, however.
load('creds.mat') % load my real credentials
tw = twitty(creds); % instantiate a Twitty object
tw.jsonParser = @loadjson; % specify JSONlab as json parser

% search for English tweets that mention 'amazon' and 'hachette'
amazon = tw.search('amazon','count',100,'include_entities','true','lang','en');
hachette = tw.search('hachette','count',100,'include_entities','true','lang','en');
both = tw.search('amazon hachette','count',100,'include_entities','true','lang','en');

Processing Tweets and Scoring Sentiments

Twitty stores tweets in structure array created from the API response in JSON format. I prefer using a table when it comes to working with heterogeneous data containing a mix of numbers and text. I wrote some code, processTweets, to convert structure arrays into tables and compute sentiment scores. You can find the Amazon-Hachette data file here.

For sentiment analysis, I used AFINN, along with a list of English stop words so that we don't count frequent common words like "a" or "the".

% load supporting data for text processing
scoreFile = 'AFINN/AFINN-111.txt';
stopwordsURL ='http://www.textfixer.com/resources/common-english-words.txt';
% load previously saved data
load amazonHachette.mat

% process the structure array with a utility method |extract|
[amazonUsers,amazonTweets] = processTweets.extract(amazon);
% compute the sentiment scores with |scoreSentiment|
amazonTweets.Sentiment = processTweets.scoreSentiment(amazonTweets, ...
    scoreFile,stopwordsURL);

% repeat the process for hachette
[hachetteUsers,hachetteTweets] = processTweets.extract(hachette);
hachetteTweets.Sentiment = processTweets.scoreSentiment(hachetteTweets, ...
    scoreFile,stopwordsURL);

% repeat the process for tweets containing both
[bothUsers,bothTweets] = processTweets.extract(both);
bothTweets.Sentiment = processTweets.scoreSentiment(bothTweets, ...
    scoreFile,stopwordsURL);

% calculate and print NSRs
amazonNSR = (sum(amazonTweets.Sentiment>=0) ...
    -sum(amazonTweets.Sentiment<0)) ...
    /height(amazonTweets);
hachetteNSR = (sum(hachetteTweets.Sentiment>=0) ...
    -sum(hachetteTweets.Sentiment<0)) ...
    /height(hachetteTweets);
bothNSR = (sum(bothTweets.Sentiment>=0) ...
    -sum(bothTweets.Sentiment<0)) ...
    /height(bothTweets);
fprintf('Amazon NSR  :  %.2f\n',amazonNSR)
fprintf('Hachette NSR:  %.2f\n',hachetteNSR)
fprintf('Both NSR    : %.2f\n\n',bothNSR)

% plot the sentiment histogram of two brands
binranges = min([amazonTweets.Sentiment; ...
    hachetteTweets.Sentiment; ...
    bothTweets.Sentiment]): ...
    max([amazonTweets.Sentiment; ...
    hachetteTweets.Sentiment; ...
    bothTweets.Sentiment]);
bincounts = [histc(amazonTweets.Sentiment,binranges)...
    histc(hachetteTweets.Sentiment,binranges)...
    histc(bothTweets.Sentiment,binranges)];
figure
bar(binranges,bincounts,'hist')
legend('Amazon','Hachette','Both','Location','Best')
title('Sentiment Distribution of 100 Tweets')
xlabel('Sentiment Score')
ylabel('# Tweets')

Amazon NSR  :  0.84
Hachette NSR:  0.58
Both NSR    : -0.30

Processing Tweets for Content Visualization

processTweets also has a function tokenize that parses tweets to calculate the word count.

% tokenize tweets with |tokenize| method of |processTweets|
[words, dict] = processTweets.tokenize(bothTweets,stopwordsURL);
% create a dictionary of unique words
dict = unique(dict);
% create a word count matrix
[~,tdf] = processTweets.getTFIDF(words,dict);

% plot the word count
figure
plot(1:length(dict),sum(tdf),'b.')
xlabel('Word Indices')
ylabel('Word Count')
title('Words contained in the tweets')
% annotate high frequency words
annotated = find(sum(tdf)>= 10);
jitter = 6*rand(1,length(annotated))-3;
for i = 1:length(annotated)
    text(annotated(i)+3, ...
        sum(tdf(:,annotated(i)))+jitter(i),dict{annotated(i)})
end

Get the Profile of Top 5 Users

Twitty also supports the 'users/show' API to retrieve user profile information. Let's get the profile of the top 5 users based on the follower count.

% sort the user table by follower count in descending order
[~,order] = sortrows(bothUsers,'Followers','descend');
% select top 5 users
top5users = bothUsers(order(1:5),[3,1,5]);
% add a column to store the profile
top5users.Description = repmat({''},height(top5users),1);
% retrieve user profile for each user
for i = 1:5
    userInfo = tw.usersShow('user_id', top5users.Id(i));
    if ~isempty(userInfo{1}.description)
        top5users.Description{i} = userInfo{1}.description;
    end
end
% print the result
disp(top5users(:,2:end))

          Name          Followers
    ________________    _________
    'Daton L Fluker'    73578    
    'WellbeingVigor'    22224    
    'E-Book Update'     10870    
    'Michael Rosa'      10297    
    'Net Tech News'      7487    

                                    Description                                
    ___________________________________________________________________________
    '#Horror #Novelist of Death Keeper's Biological Wasteland, Finished Cri...'
    'Writer  - 10 years .here, Incurable music enthusiast #'                   
    ''                                                                         
    ''                                                                         
    'Latest internet and technology news headlines from news sources around...'

Streaming API for High Volume Real Time Tweets

If you need more than 100 tweets to work with, then your only option is to use the Streaming API which provides access to the sampled Twitter fire hose in real time. That also means you need to access the tweets that are currently active. You typically start with a trending topic from a specific location.

You get local trends by specifying the geography with WOEID (Where On Earth ID), available at WOEID Lookup.

uk_woeid = '23424975'; % UK
uk_trends = tw.trendsPlace(uk_woeid);
uk_trends = cellfun(@(x) x.name, uk_trends{1}.trends, 'UniformOutput',false)';

Once you have the current trends (or download them from here), you can use the Streaming API to retrieve the tweets that mention the trending topic. When you specify an output function with Twitty, the data is store within Twitty. Twitty will process incoming tweets up to the sample size specified, and process data by the batch size specified.

tw.outFcn = @saveTweets; % output function
tw.sampleSize = 1000;  % default 1000
tw.batchSize = 1; % default 20
tic;
tw.filterStatuses('track',uk_trends{1}); % Streaming API call
toc
uk_trend_data = tw.data; % save the data

% reload the previously saved search result for 4 trending topics in the UK
load('uk_data.mat')

% plot
figure
for i = 1:4
    % process tweets
    [users,tweets] = processTweets.extract(uk_data(i).statuses);

    % get who are mentioned in retweets
    retweeted = tweets.Mentions(tweets.isRT);
    retweeted = retweeted(~cellfun('isempty',retweeted));
    [screen_names,~,idx] = unique(retweeted);
    count = accumarray(idx,1);
    retweeted = table(screen_names,count,'VariableNames',{'Screen_Name','Count'});

    % get the users who were mentioned in retweets
    match = ismember(users.Screen_Name,retweeted.Screen_Name);
    retweetedUsers = sortrows(users(match,:),'Screen_Name');
    match = ismember(retweeted.Screen_Name,retweetedUsers.Screen_Name);
    retweetedUsers.Retweeted_Count = retweeted.Count(match);
    [~,order] = sortrows(retweetedUsers,'Retweeted_Count','descend');

    % plot each topic
    subplot(2,2,i)
    scatter(retweetedUsers.Followers(order),...
        retweetedUsers.Retweeted_Count(order),retweetedUsers.Freq(order)*50,...
        retweetedUsers.Freq(order),'fill')

    if ismember(i, [1,2])
        ylim([-20,90]); xpos = 2; ypos1 = 50; ypos2 = 40;
    elseif i == 3
        ylim([-1,7])
        xlabel('Follower Count (Log Scale)')
        xpos = 1010; ypos1 = 0; ypos2 = -1;
    else
        ylim([-5,23])
        xlabel('Follower Count (Log Scale)')
        xpos = 110; ypos1 = 20; ypos2 = 17;
    end

    % set x axis to log scale
    set(gca, 'XScale', 'log')

    if ismember(i, [1,3])
        ylabel('Retweeted Count')
    end
    title(sprintf('UK Tweets for: "%s"',uk_data(i).query.name))
end

Save an Edge List for Social Graph Visualization

Gephi imports an edge list in CSV format. I added a new method saveEdgeList to processTweet that saves the screen names of the users as source and the hashtags and screen names they mention in their tweets as target in a <https://gephi.org/users/supported-graph-formats/csv-format/ Gephi-ready CSV file.

processTweets.saveEdgeList(uk_data(1).statuses,'edgeList.csv');

File "edgeList.csv" was successfully saved.

Closing

It is quite easy to get started with Twitter Analytics with MATLAB and hopefully you got the taste of what kind of analyses are possible.

We only scratched the surface. Twitter offers many of the most interesting opportunities for data analytics. How would you use Twitter Analytics? Check out some examples from this search result from PLOS ONE that list various papers that used Twitter for their study. Tell us about your Twitty experiences here.

Published with MATLAB® R2014a