Loren on the Art of MATLAB

Turn ideas into MATLAB

Note

Loren on the Art of MATLAB has been archived and will not be updated.

Analyzing Fake News with Twitter

Social media has become an important part of modern life, and Twitter is again a center of focus in recent events. Today's guest blogger, Toshi Takeuchi gives us an update on how you can use MATLAB to analyze a Twitter feed.

Contents

Twitter Revisited

When I wrote about analyzing Twitter with MATLAB back in 2014 I didn't expect that 3 years later Twitter would come to play such a huge role in politics. There have been a lot of changes in MATLAB in those years as well. Perhaps it is time to revisit this topic. We hear a lot about fake news since the US Presidential Election of 2016. Let's use Twitter to analyze this phenomenon. While fake news spreads mainly on Facebook, Twitter is the favorite social media platform for journalists who discuss them.

Load Tweets

I collected 1,000 tweets that contain the term 'fake news' using the Streaming API and saved them in fake_news.mat. Let's start processing tweets by looking at the top 10 users based on the followers count.

load fake_news                                              % load data
t = table;                                                  % initialize a table
t.names = arrayfun(@(x) x.status.user.name, ...             % get user names
    fake_news.statuses, 'UniformOutput', false);
t.names = regexprep(t.names,'[^a-zA-Z .,'']','');           % remove non-ascii
t.screen_names = arrayfun(@(x) ...                          % get screen names
    x.status.user.screen_name, fake_news.statuses, 'UniformOutput', false);
t.followers_count = arrayfun(@(x)  ...                      % get followers count
    x.status.user.followers_count, fake_news.statuses);
t = unique(t,'rows');                                       % remove duplicates
t = sortrows(t,'followers_count', 'descend');               % rank users
disp(t(1:10,:))                                             % show the table
           names              screen_names       followers_count
    ____________________    _________________    _______________
    'Glenn Greenwald'       'ggreenwald'         7.9605e+05     
    'Soledad O'Brien'       'soledadobrien'      5.6769e+05     
    'Baratunde'             'baratunde'          2.0797e+05     
    'Kenneth Roth'          'KenRoth'            1.9189e+05     
    'Stock Trade Alerts'    'AlertTrade'         1.1921e+05     
    'SokoAnalyst'           'SokoAnalyst'        1.1864e+05     
    'Tactical Investor'     'saul42'                  98656     
    'Vladimir Bajic'        'trend_auditor'           70502     
    'Marketing Gurus'       'MarketingGurus2'         68554     
    'Jillian C. York '      'jilliancyork'            53744     

Short Urls

Until recently Twitter had a 140 character limit per tweet including links. Therefore when people embed urls in their tweets, they typically used url shortening services. To identify the actual sources, we need to get the expanded urls that those short urls point to. To do it I wrote a utility function expandUrl taking advantage of the new HTTP interface introduced in R2016b. You can see that I create URI and RequestMessage objects and used the send method to get a ResponseMessage object.

dbtype expandUrl 25:32
25    import  matlab.net.* matlab.net.http.*                  % http interface libs
26    for ii = 1:length(urls)                                 % for each url
27        if contains(urls(ii),shorteners)                    % if shortened
28            uri = URI(urls(ii));                            % create URI obj
29            r = RequestMessage;                             % request object
30            options = HTTPOptions('MaxRedirects',0);        % prevent redirect
31            try                                             % try
32                response = r.send(uri,options);             % send http request

Let's give it a try.

expanded = char(expandUrl('http://trib.al/ZQuUDNx'));       % expand url
disp([expanded(1:70) '...'])
https://hbr.org/2017/01/the-u-s-medias-problems-are-much-bigger-than-f...

Tokenize Tweets

To get a sense of what was being discussed in those tweets and what sentiments were represented there, we need to process the text.

  • Our first step is to turn tweets into tokens.
  • Once we have tokens, we can use them to compute sentiment scores based on lexicons like AFINN.
  • You can also use it to visualize tweets as a word cloud.

We also want to collect embedded links along the way.

delimiters = {' ','$','/','.','-',':','&','*', ...          % remove those
    '+','=','[',']','?','!','(',')','{','}',',', ...
    '"','>','_','<',';','%',char(10),char(13)};
AFINN = readtable('AFINN/AFINN-111.txt', ...                % load score file
    'Delimiter','\t','ReadVariableNames',0);
AFINN.Properties.VariableNames = {'Term','Score'};          % add var names
stopwordsURL ='http://www.textfixer.com/resources/common-english-words.txt';
stopWords = webread(stopwordsURL);                          % read stop words
stopWords = split(string(stopWords),',');                   % split stop words
tokens = cell(fake_news.tweetscnt,1);                       % cell arrray as accumulator
expUrls = strings(fake_news.tweetscnt,1);                   % cell arrray as accumulator
dispUrls = strings(fake_news.tweetscnt,1);                  % cell arrray as accumulator
scores = zeros(fake_news.tweetscnt,1);                      % initialize accumulator
for ii = 1:fake_news.tweetscnt                              % loop over tweets
    tweet = string(fake_news.statuses(ii).status.text);     % get tweet
    s = split(tweet, delimiters)';                          % split tweet by delimiters
    s = lower(s);                                           % use lowercase
    s = regexprep(s, '[0-9]+','');                          % remove numbers
    s = regexprep(s,'(http|https)://[^\s]*','');            % remove urls
    s = erase(s,'''s');                                     % remove possessive s
    s(s == '') = [];                                        % remove empty strings
    s(ismember(s, stopWords)) = [];                         % remove stop words
    tokens{ii} = s;                                         % add to the accumulator
    scores(ii) = sum(AFINN.Score(ismember(AFINN.Term,s)));  % add to the accumulator
    if ~isempty( ...                                        % if display_url exists
            fake_news.statuses(ii).status.entities.urls) && ...
            isfield(fake_news.statuses(ii).status.entities.urls,'display_url')
        durl = fake_news.statuses(ii).status.entities.urls.display_url;
        durl = regexp(durl,'^(.*?)\/','match','once');      % get its domain name
        dispUrls(ii) = durl(1:end-1);                       % add to dipUrls
        furl = fake_news.statuses(ii).status.entities.urls.expanded_url;
        furl = expandUrl(furl,'RemoveParams',1);            % expand links
        expUrls(ii) = expandUrl(furl,'RemoveParams',1);     % one more time
    end
end

Now we can create the document term matrix. We will also do the same thing for embedded links.

dict = unique([tokens{:}]);                                 % unique words
domains = unique(dispUrls);                                 % unique domains
domains(domains == '') = [];                                % remove empty string
links = unique(expUrls);                                    % unique links
links(links == '') = [];                                    % remove empty string
DTM = zeros(fake_news.tweetscnt,length(dict));              % Doc Term Matrix
DDM = zeros(fake_news.tweetscnt,length(domains));           % Doc Domain Matrix
DLM = zeros(fake_news.tweetscnt,length(links));             % Doc Link Matrix
for ii = 1:fake_news.tweetscnt                              % loop over tokens
    [words,~,idx] = unique(tokens{ii});                     % get uniqe words
    wcounts = accumarray(idx, 1);                           % get word counts
    cols = ismember(dict, words);                           % find cols for words
    DTM(ii,cols) = wcounts;                                 % unpdate DTM with word counts
    cols = ismember(domains,dispUrls(ii));                  % find col for domain
    DDM(ii,cols) = 1;                                       % increment DMM
    expanded = expandUrl(expUrls(ii));                      % expand links
    expanded = expandUrl(expanded);                         % one more time
    cols = ismember(links,expanded);                        % find col for link
    DLM(ii,cols) = 1;                                       % increment DLM
end
DTM(:,ismember(dict,{'#','@'})) = [];                       % remove # and @
dict(ismember(dict,{'#','@'})) = [];                        % remove # and @

Sentiment Analysis

One of the typical analyses you perform on Twitter feed is sentiment analysis. The histogram shows, not surprisingly, that those tweets were mostly very negative. We can summarize this by the Net Sentiment Rate (NSR), which is based on the ratio of positive tweets to negative tweets.

NSR = (sum(scores >= 0) - sum(scores < 0)) / length(scores);% net setiment rate
figure                                                      % new figure
histogram(scores,'Normalization','probability')             % positive tweets
line([0 0], [0 .35],'Color','r');                           % reference line
title(['Sentiment Score Distribution of "Fake News" ' ...   % add title
    sprintf('(NSR: %.2f)',NSR)])
xlabel('Sentiment Score')                                   % x-axis label
ylabel('% Tweets')                                          % y-axis label
yticklabels(string(0:5:35))                                 % y-axis ticks
text(-10,.25,'Negative');text(3,.25,'Positive');            % annotate

What Words Appear Frequently in Tweets?

Now let's plot the word frequency to visualize what was discussed in those tweets. They seem to be about dominant news headlines at the time the tweets were collected.

count = sum(DTM);                                           % get word count
labels = erase(dict(count >= 40),'@');                      % high freq words
pos = [find(count >= 40);count(count >= 40)] + 0.1;         % x y positions
figure                                                      % new figure
scatter(1:length(dict),count)                               % scatter plot
text(pos(1,1),pos(2,1)+3,cellstr(labels(1)),...             % place labels
    'HorizontalAlignment','center');
text(pos(1,2),pos(2,2)-2,cellstr(labels(2)),...
    'HorizontalAlignment','right');
text(pos(1,3),pos(2,3)-4,cellstr(labels(3)));
text(pos(1,3:end),pos(2,3:end),cellstr(labels(3:end)));
title('Frequent Words in Tweets Mentioning Fake News')      % add title
xlabel('Indices')                                           % x-axis label
ylabel(' Count')                                            % y-axis label
ylim([0 150])                                               % y-axis range

What Hashtags Appear Frequently in Tweets?

Hashtags that start with "#" are often used to identify the main theme of tweets, and we see those related to the dominant news again as you would expect.

is_hash = startsWith(dict,'#') & dict ~= '#';               % get indices
hashes = erase(dict(is_hash),'#');                          % get hashtags
hash_count = count(is_hash);                                % get count
labels = hashes(hash_count >= 4);                           % high freq tags
pos = [find(hash_count >= 4) + 1; ...                       % x y positions
    hash_count(hash_count >= 4) + 0.1];
figure                                                      % new figure
scatter(1:length(hashes),hash_count)                        % scatter plot
text(pos(1,1),pos(2,1)- .5,cellstr(labels(1)),...           % place labels
    'HorizontalAlignment','center');
text(pos(1,2:end-1),pos(2,2:end-1),cellstr(labels(2:end-1)));
text(pos(1,end),pos(2,end)-.5,cellstr(labels(end)),...
    'HorizontalAlignment','right');
title('Frequently Used Hashtags')                           % add title
xlabel('Indices')                                           % x-axis label
ylabel('Count')                                             % y-axis label
ylim([0 15])                                                % y-axis range

Who Got Frequent Mentions in Tweets?

Twitter is also a commmunication medium and people can direct their tweets to specific users by including their screen names in the tweets starting with "@". These are called "mentions". We can see there is one particular user who got a lot of mentions.

is_ment = startsWith(dict,'@') & dict ~= '@';               % get indices
mentions = erase(dict(is_ment),'@');                        % get mentions
ment_count = count(is_ment);                                % get count
labels = mentions(ment_count >= 10);                        % high freq mentions
pos = [find(ment_count >= 10) + 1; ...                      % x y positions
    ment_count(ment_count >= 10) + 0.1];
figure                                                      % new figure
scatter(1:length(mentions),ment_count)                      % scatter plot
text(pos(1,:),pos(2,:),cellstr(labels));                    % place labels
title('Frequent Mentions')                                  % add title
xlabel('Indices')                                           % x-axis label
ylabel('Count')                                             % y-axis label
ylim([0 100])                                               % y-axis range

Frequently Cited Web Sites

You can also embed a link in a tweet, usually for citing sources and directing people to get more details from those sources. This tends to show where the original information came from.

Twitter was the most frequently cited source. This was interesting to me. Usually, if you want to cite other tweets, you retweet them. When you retweet, the original user gets a credit. By embedding the link without retweeting it, people circumvent this mechanism. Very curious.

count = sum(DDM);                                           % get domain count
labels = domains(count > 5);                                % high freq citations
pos = [find(count > 5) + 1;count(count > 5) + 0.1];         % x y positions
figure                                                      % new figure
scatter(1:length(domains),count)                            % scatter plot
text(pos(1,:),pos(2,:),cellstr(labels));                    % place labels
title('Frequently Cited Web Sites')                         % add title
xlabel('Indices')                                           % x-axis label
ylabel('Count')                                             % y-axis label

Frequently Cited Sources

You can also see that many of the web sites are for url shortening services. Let's find out the real urls linked from those short urls.

count = sum(DLM);                                           % get domain count
labels = links(count >= 15);                                % high freq citations
pos = [find(count >= 15) + 1;count(count >= 15)];           % x y positions
figure                                                      % new figure
scatter(1:length(links),count)                              % scatter plot
text(ones(size(pos(1,:))),pos(2,:)-2,cellstr(labels));      % place labels
title('Frequently Cited Sources ')                          % add title
xlabel('Indices')                                           % x-axis label
ylabel('Count')                                             % y-axis label

Generating a Social Graph

Now let's think of a way to see the association between users to the entities included in their tweets to reveal their relationships. We have a matrix of words by tweets, and we can convert it into a matrix of users vs. entities, such as hashtags, mentions and links.

users = arrayfun(@(x) x.status.user.screen_name, ...        % screen names
    fake_news.statuses, 'UniformOutput', false);
uniq = unique(users);                                       % remove duplicates
combo = [DTM DLM];                                          % combine matrices
UEM = zeros(length(uniq),size(combo,2));                    % User Entity Matrix
for ii = 1:length(uniq)                                     % for unique user
    UEM(ii,:) = sum(combo(ismember(users,uniq(ii)),:),1);   % sum cols
end
cols = is_hash | is_ment;                                   % hashtags, mentions
cols = [cols true(1,length(links))];                        % add links
UEM = UEM(:,cols);                                          % select those cols
ent = dict(is_hash | is_ment);                              % select entities
ent = [ent links'];                                         % add links

Handling Mentions

Some of the mentions are for users of those tweets, and others are not. When two users mention another, that forms an user-user edge, rather than user-entity edge. To map such edges correctly, we want to treat mentioned users separately.

ment_users = uniq(ismember(uniq,mentions));                 % mentioned users
is_ment = ismember(ent,'@' + string(ment_users));           % their mentions
ent(is_ment) = erase(ent(is_ment),'@');                     % remove @
UUM = zeros(length(uniq));                                  % User User Matrix
for ii =  1:length(ment_users)                              % for each ment user
    row = string(uniq) == ment_users{ii};                   % get row
    col = ent == ment_users{ii};                            % get col
    UUM(row,ii) = UEM(row,col);                             % copy count
end

Creating the Edge List

Now we can add the user to user matrix to the existing user to entity matrix, but we also need to remove the mentioned users from entities since they are already included in the user to user matrix.

All we need to do then is to turn that into a sparse matrix and find indices of non zero elements. We can then use those indices as the edge list.

UEM(:,is_ment) = [];                                        % remove mentioned users
UEM = [UUM, UEM];                                           % add UUM to adj
nodes = [uniq; cellstr(ent(~is_ment))'];                    % create node list
s = sparse(UEM);                                            % sparse matrix
[i,j,s] = find(s);                                          % find indices

Creating the Graph

Once you have the edge list, it is a piece of cake to make a social graph from that. Since our relationships have directions (user --> entity), we will create a directed graph with digraph. The size of the nodes are scaled and colored based on the number of incoming relationships called in-degrees. As you can see, most tweets are disjointed but we see some large clusters of tweets.

G = digraph(i,j);                                           % directed graph
G.Nodes.Name = nodes;                                       % add node names
figure                                                      % new figure
colormap cool                                               % set color map
deg = indegree(G);                                          % get indegrees
markersize = log(deg + 2) * 2;                              % indeg for marker size
plot(G,'MarkerSize',markersize,'NodeCData',deg)             % plot graph
labels = colorbar; labels.Label.String = 'Indegrees';                 % add colorbar
title('Graph of Tweets containing "Fake News"')             % add title
xticklabels(''); yticklabels('');                           % hide tick labels

Zooming into the Largest Subgraph

Let's zoom into the largest subgraph to see the details. This gives a much clearer idea about what those tweets were about because you see who was mentioned and what sources were cited. You can see a New York Times opinion column and an article from Sweden generated a lot of tweets along with those who were mentioned in those tweets.

bins = conncomp(G,'OutputForm','cell','Type','weak');       % get connected comps
binsizes = cellfun(@length,bins);                           % get bin sizes
[~,idx] = max(binsizes);                                    % find biggest comp
subG = subgraph(G,bins{idx});                               % create sub graph
figure                                                      % new figure
colormap cool                                               % set color map
deg = indegree(subG);                                       % get indegrees
markersize = log(deg + 2) * 2;                              % indeg for marker size
h = plot(subG,'MarkerSize',markersize,'NodeCData',deg);     % plot graph
c = colorbar; c.Label.String = 'In-degrees';                % add colorbar
title('The Largest Subgraph (Close-up)')                    % add title
xticklabels(''); yticklabels('');                           % hide tick labels
[~,rank] = sort(deg,'descend');                             % get ranking
top15 = subG.Nodes.Name(rank(1:15));                        % get top 15
labelnode(h,top15,top15 );                               	% label nodes
axis([-.5 2.5 -1.6 -0.7]);                                  % define axis limits

Using Twitty

If you want to analyze Twitter for different topics, you need to collect your own tweets. For this analysis I used Twitty by Vladimir Bondarenko. It hasn't been updated since July 2013 but it still works. Let's go over how you use Twitty. I am assuming that you already have your developer credentials and downloaded Twitty into your curent folder. The workspace variable creds should contain your credentials in a struct in the following format:

creds = struct;                                             % example
creds.ConsumerKey = 'your consumer key';
creds.ConsumerSecret = 'your consumer secret';
creds.AccessToken = 'your token';
creds.AccessTokenSecret = 'your token secret';

Twitty by default expects the JSON Parser by Joel Feenstra. However, I would like to use the new built-in functions in R2016 jsonencode and jsondecode instead. To suppress the warning Twitty generates, I will use warning.

warning('off')                                              % turn off warning
addpath twitty_1.1.1;                                       % add Twitty folder to the path
load creds                                                  % load my real credentials
tw = twitty(creds);                                         % instantiate a Twitty object
warning('on')                                               % turn on warning

Twitter Search API Example

Since Twitty returns JSON as plain text if you don't specify the parser, you can use jsondecode once you get the output from Twitty. The number of tweets you can get from the Search API is limited to 100 per request. If you need more, you usually use the Streaming API.

keyword = 'nfl';                                            % keyword to search
tweets = tw.search(keyword,'count',100,'include_entities','true','lang','en');
tweets = jsondecode(tweets);                                % parse JSON
tweet = tweets.statuses{1}.text;                            % index into text
disp([tweet(1:70) '...'])                                   % show 70 chars
RT @JBaezaTopDawg: .@NFL will be announcing a @Patriots v @RAIDERS mat...

Twitter Trending Topic API Example

If you want to find a high volume topic with thousands of tweets, one way to find such a topic is to use trending topics. Those topics will give you plenty of tweets to work with.

us_woeid = 23424977;                                        % US as location
us_trends = tw.trendsPlace(us_woeid);                       % get trending topics
us_trends = jsondecode(us_trends);                          % parse JSON
trends = arrayfun(@(x) x.name, us_trends.trends, 'UniformOutput',false);
disp(trends(1:10))
    'Beyoncé'
    'Rex Tillerson'
    '#NSD17'
    '#PressOn'
    'DeVos'
    'Roger Goodell'
    '#nationalsigningday'
    'Skype'
    '#wednesdaywisdom'
    '#MyKindOfPartyIncludes'

Twitter Streaming API Example

Once you find a high volume topic to work with, you can use the Streaming API to get tweets that contain it. Twitty stores the retrieved tweets in the 'data' property. What you save is defined in an output function like saveTweets.m. 'S' in this case will be a character array of JSON formatted text and we need to use jsondecode to convert it into a struct since we didn't specify the JSON parser.

dbtype twitty_1.1.1/saveTweets.m 17:24
17    % Parse input:
18    S = jsondecode(S);
19    
20    if length(S)== 1 && isfield(S, 'statuses')
21        T = S{1}.statuses;
22    else
23        T = S;
24    end

Now let's give it a try. By default, Twitty will get 20 batches of 1000 tweets = 20,000 tweets, but that will take a long time. We will just get 10 tweets in this example.

keyword = 'nfl';                                            % specify keyword
tw.outFcn = @saveTweets;                                    % output function
tw.sampleSize = 10;                                         % default 1000
tw.batchSize = 1;                                           % default 20
tw.filterStatuses('track',keyword);                         % Streaming API call
result = tw.data;                                           % save the data
length(result.statuses)                                     % number of tweets
tweet = result.statuses(1).status.text;                     % get a tweet
disp([tweet(1:70) '...'])                                   % show 70 chars
Tweets processed: 1 (out of 10).
Tweets processed: 2 (out of 10).
Tweets processed: 3 (out of 10).
Tweets processed: 4 (out of 10).
Tweets processed: 5 (out of 10).
Tweets processed: 6 (out of 10).
Tweets processed: 7 (out of 10).
Tweets processed: 8 (out of 10).
Tweets processed: 9 (out of 10).
Tweets processed: 10 (out of 10).
ans =
    10
RT @Russ_Mac876: Michael Jackson is still the greatest https://t.co/BE...

Summary - Visit Andy's Developer Zone for More

In this post you saw how you can analyze tweets using the more recent features in MATLAB, such as the HTTP interface to expand short urls. You also got a quick tutorial on how to use Twitty to collect tweets for your own purpose.

Twitty covers your basic needs. But you can go beyond Twitty and roll your own tool by taking advantage of the new HTTP interface. I show you how in a second blog post I wrote for Andy's Developer Zone.

Now that you understand how you can use Twitter to analyze social issues like fake news, tell us how you would put it to good use here.




Published with MATLAB® R2016b


  • print