{"id":1227,"date":"2015-09-30T10:08:41","date_gmt":"2015-09-30T15:08:41","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=1227"},"modified":"2024-01-01T17:00:31","modified_gmt":"2024-01-01T22:00:31","slug":"can-we-predict-a-breakup-social-network-analysis-with-matlab","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2015\/09\/30\/can-we-predict-a-breakup-social-network-analysis-with-matlab\/","title":{"rendered":"Can We Predict a Breakup? Social Network Analysis with MATLAB"},"content":{"rendered":"<div class=\"content\">\r\n\r\n<!--introduction-->\r\n\r\nThe <a href=\"https:\/\/www.mathworks.com\/products\/new_products\/latest_features.html\">R2015b release<\/a> is here and one of the exciting new features lets us create, explore, and analyze graphs and networks. Today's guest blogger, Toshi Takeuchi, shows you how to get started with social network analysis using this new feature.\r\n\r\n<img decoding=\"async\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/breakup.jpg\" alt=\"\" hspace=\"5\" vspace=\"5\" \/>\r\n\r\n<!--\/introduction-->\r\n<h3>Contents<\/h3>\r\n<div>\r\n<ul>\r\n \t<li><a href=\"#8bbe022f-94ba-4732-b770-cb2c1afdad14\">Social network analysis and machine learning<\/a><\/li>\r\n \t<li><a href=\"#d3378597-c3e3-4b10-9b5c-1b70916c2334\">Zachary's Karate Club Dataset<\/a><\/li>\r\n \t<li><a href=\"#a549d8fd-6653-403c-959a-112fe7f79c01\">Let's load the dataset<\/a><\/li>\r\n \t<li><a href=\"#a8536cf6-aebf-462e-84bd-f11786cecdf7\">Creating Graph Object<\/a><\/li>\r\n \t<li><a href=\"#9cd54df8-ef3a-4ea6-8abb-f52ac5e67013\">Who is the Most Connected?<\/a><\/li>\r\n \t<li><a href=\"#d68f2a1c-967e-4e57-bf48-e2dbae7a3d91\">The Breakup<\/a><\/li>\r\n \t<li><a href=\"#6ec94be3-7ec6-4842-9260-ba5950dd4b38\">Predicting the Breakup<\/a><\/li>\r\n \t<li><a href=\"#b08c29ae-5070-41ad-81c6-e57cec9b9b25\">Hierarchical Clustering Approach<\/a><\/li>\r\n \t<li><a href=\"#f5519ba8-f7fc-4baf-8437-ceded4d239ee\">Graph Partitioning Approach<\/a><\/li>\r\n \t<li><a href=\"#05eeeed4-bae9-4872-b18e-cafbdfbecc69\">Summary - how does this compare to the Facebook example?<\/a><\/li>\r\n<\/ul>\r\n<\/div>\r\n<h4>Social network analysis and machine learning<a name=\"8bbe022f-94ba-4732-b770-cb2c1afdad14\"><\/a><\/h4>\r\nWhen it comes to data science, you think of <a href=\"https:\/\/www.mathworks.com\/solutions\/machine-learning.html\">machine learning<\/a> and data mining, but you should consider adding <a href=\"https:\/\/en.wikipedia.org\/wiki\/Social_network_analysis\">social network analysis<\/a> as part of your data science toolkit. Social network analysis offers new ways to capture features from your dataset by mapping and measuring relationships using graphs and networks, and it is complementary to traditional machine learning techniques like <a href=\"https:\/\/www.mathworks.com\/discovery\/cluster-analysis.html\">cluster analsysis<\/a>.\r\n\r\nLet's use an example to show you how it works. This article explains <a href=\"http:\/\/www.mnn.com\/green-tech\/computers\/stories\/how-facebook-knows-when-youll-get-divorced-even-before-you-do\">how Facebook knows when you'll get divorced (even before you do)<\/a>. Unfortunately we don't have access to the dataset Facebook has, but we can use a publicly available dataset to do a similar analysis.\r\n<h4>Zachary's Karate Club Dataset<a name=\"d3378597-c3e3-4b10-9b5c-1b70916c2334\"><\/a><\/h4>\r\n<a href=\"https:\/\/networkdata.ics.uci.edu\/data.php?id=105\">Zachary's Karate Club<\/a> dataset contains a social network of friendships between 34 members of a karate club at a US university in the 1970s. We don't know the details, but <a href=\"https:\/\/books.google.com\/books?id=atfCl2agdi8C&amp;pg=PA64#v=onepage&amp;q&amp;f=false\">a dispute that erupted in this club eventually caused it to break up into two groups<\/a>. Could we have predicted exactly how the club would split by looking at the structure of its social relationships, similar to Facebook?\r\n<h4>Let's load the dataset<a name=\"a549d8fd-6653-403c-959a-112fe7f79c01\"><\/a><\/h4>\r\nYou can download the dataset <a href=\"http:\/\/networkdata.ics.uci.edu\/data\/karate\/karate.paj\">here<\/a> but it is in a unfamiliar non-tabular text format. I created a script <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/loaddata.m\">loaddata.m<\/a> to parse this text file but you can just load the <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/karate.mat\">MAT-file<\/a> instead.\r\n<pre class=\"codeinput\">websave karate2.mat <span class=\"string\">https:\/\/blogs.mathworks.com\/images\/loren\/2015\/karate.mat<\/span>;\r\nload karate2.mat\r\n<\/pre>\r\n<h4>Creating Graph Object<a name=\"a8536cf6-aebf-462e-84bd-f11786cecdf7\"><\/a><\/h4>\r\nIndividual members are represented as nodes, and the edges that connect two nodes represent their friendships. Edges are therefore just a pair of nodes. <tt>edges<\/tt> represents the pairs as rows in a 78 x 2 matrix. <tt>nodes<\/tt> are just an array of numbers that represents the nodes. Friendships are mutual and therefore they don't have direction. We can use an undirected graph to represent this social network. You can create a graph by supplying the lists of nodes that form edges.\r\n<pre class=\"codeinput\">G = graph(edges(:,1), edges(:,2));      <span class=\"comment\">% create a graph from edges<\/span>\r\nG.Nodes = table(name);                  <span class=\"comment\">% name the nodes<\/span>\r\nfigure                                  <span class=\"comment\">% visualize the graph<\/span>\r\nplot(G);\r\ntitle(<span class=\"string\">'Zachary''s Karate Club'<\/span>)\r\n<\/pre>\r\n<img decoding=\"async\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/breakup_01.png\" alt=\"\" hspace=\"5\" vspace=\"5\" \/>\r\n<h4>Who is the Most Connected?<a name=\"9cd54df8-ef3a-4ea6-8abb-f52ac5e67013\"><\/a><\/h4>\r\nThe plot shows a typical \"hubs and spokes\" structure with a couple of hubs, which are nodes that have a lot of connections. You are popular if you have a lot of friends, so this represents popularity. One way to quantify popularity is a metric called degree, the number of edges connected to each node. Let's find the degree distribution of this social network.\r\n<pre class=\"codeinput\">D = degree(G);                          <span class=\"comment\">% get degrees per node<\/span>\r\nmu = mean(D);                           <span class=\"comment\">% average degrees<\/span>\r\nfigure\r\nhistogram(D);                           <span class=\"comment\">% plot histogram<\/span>\r\nhold <span class=\"string\">on<\/span>\r\nline([mu, mu],[0, 10], <span class=\"string\">'Color'<\/span>, <span class=\"string\">'r'<\/span>)    <span class=\"comment\">% average degrees line<\/span>\r\ntitle(<span class=\"string\">'Karate Club Degree Distribution'<\/span>)\r\nxlabel(<span class=\"string\">'degrees (# of connections)'<\/span>); ylabel(<span class=\"string\">'# of nodes'<\/span>);\r\ntext(mu + 0.1, 10, sprintf(<span class=\"string\">'Average %.2f Degrees'<\/span>, mu))\r\ntext(14.5, 1.5, <span class=\"string\">'Node 1'<\/span>)\r\ntext(11, 2, <span class=\"string\">'Node 33'<\/span>)\r\ntext(16, 2, <span class=\"string\">'Node 34'<\/span>)\r\n<\/pre>\r\n<img decoding=\"async\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/breakup_02.png\" alt=\"\" hspace=\"5\" vspace=\"5\" \/>\r\n\r\nYou see that two people (Nodes 1 and 34) have well above the average number of degrees. Node 33 has many friends but not as many as those two. Let\u2019s color the edges (connections) of those two individuals.\r\n<pre class=\"codeinput\">N1 = neighbors(G, 1);                               <span class=\"comment\">% get 1's friends<\/span>\r\nN1 = [N1; 1];                                       <span class=\"comment\">% add 1<\/span>\r\nN34 = neighbors(G, 34);                             <span class=\"comment\">% get 34's friends<\/span>\r\nN34 = [N34; 34];                                    <span class=\"comment\">% add 34<\/span>\r\n\r\nc1 = [1,0,1];                                       <span class=\"comment\">% 1's group color<\/span>\r\nc2 = [0,1,0.5];                                     <span class=\"comment\">% 34's group color<\/span>\r\n\r\nfigure\r\nP = plot(G);                                        <span class=\"comment\">% plot the graph<\/span>\r\nhighlight(P, N1, <span class=\"string\">'NodeColor'<\/span>, c1, <span class=\"string\">'EdgeColor'<\/span>, c1); <span class=\"comment\">% highlight 1's friends<\/span>\r\nhighlight(P, N34, <span class=\"string\">'NodeColor'<\/span>, c2, <span class=\"string\">'EdgeColor'<\/span>, c2);<span class=\"comment\">% highlight 34's friends<\/span>\r\ntitle(<span class=\"string\">'Friends of Nodes 1 and 34'<\/span>)\r\n<\/pre>\r\n<img decoding=\"async\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/breakup_03.png\" alt=\"\" hspace=\"5\" vspace=\"5\" \/>\r\n<h4>The Breakup<a name=\"d68f2a1c-967e-4e57-bf48-e2dbae7a3d91\"><\/a><\/h4>\r\nBy coloring the connections of those two, you can see that the club already had two subgroups based on social relationships. Let's compare the previous visualization with the visualization of the actual breakup based on the information from the linked textbook above.\r\n<pre class=\"codeinput\">G1nodes = [1, 2, 3, 4, 5, 6, 7, 8, 9,<span class=\"keyword\">...<\/span><span class=\"comment\">                % Node 1's group<\/span>\r\n    11, 12, 13, 14, 17, 18, 20, 22];\r\n\r\nfigure\r\nP = plot(G);                                            <span class=\"comment\">% plot the graph<\/span>\r\nhighlight(P, G1nodes, <span class=\"string\">'NodeColor'<\/span>, c1,<span class=\"string\">'EdgeColor'<\/span>, c1); <span class=\"comment\">% highlight group<\/span>\r\ntitle(<span class=\"string\">'The club broke up in two groups'<\/span>)\r\n<\/pre>\r\n<img decoding=\"async\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/breakup_04.png\" alt=\"\" hspace=\"5\" vspace=\"5\" \/>\r\n<h4>Predicting the Breakup<a name=\"6ec94be3-7ec6-4842-9260-ba5950dd4b38\"><\/a><\/h4>\r\nNow that we are familiar with this dataset, let's see if there is a way to detect two groups in this dataset. Here is an adjacency matrix of this network. Each row and column are node ids and if two nodes have a relationship, then the intersection becomes 1, otherwise 0. The diagonal elements represents self-referential connections and therefore they are 0 in this social network.\r\n<pre class=\"codeinput\">A = adjacency(G);           <span class=\"comment\">% create adjacency matrix<\/span>\r\nA = full(A);                <span class=\"comment\">% convert the sparse matrix to full matrix<\/span>\r\nfigure\r\nimagesc(A)                  <span class=\"comment\">% plot the matrix<\/span>\r\naxis <span class=\"string\">square<\/span>\r\ntitle(<span class=\"string\">'Adjacency Matrix'<\/span>)\r\n<\/pre>\r\n<img decoding=\"async\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/breakup_05.png\" alt=\"\" hspace=\"5\" vspace=\"5\" \/>\r\n<h4>Hierarchical Clustering Approach<a name=\"b08c29ae-5070-41ad-81c6-e57cec9b9b25\"><\/a><\/h4>\r\nIn machine learning, this can be handled as a clustering problem. Let's try Hierarchical Clustering with the number of shared connections as distance metric \u2013 the more connections you share, the closer. Since we have binary data, we will use the Jaccard distance.\r\n<pre class=\"codeinput\">dist = pdist(A, <span class=\"string\">'jaccard'<\/span>); <span class=\"comment\">% use Jaccard distance to score connections<\/span>\r\nZ = linkage(dist);          <span class=\"comment\">% create tree<\/span>\r\nfigure\r\ndendrogram(Z)               <span class=\"comment\">% plot dendrogram<\/span>\r\ntitle(<span class=\"string\">'Dendrogram Plot'<\/span>)\r\nxlabel(<span class=\"string\">'Node Id'<\/span>); ylabel(<span class=\"string\">'Distance'<\/span>)\r\n<\/pre>\r\n<img decoding=\"async\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/breakup_06.png\" alt=\"\" hspace=\"5\" vspace=\"5\" \/>\r\n\r\nThis doesn\u2019t look right, because it completely ignores the structure of connections in the network. This is a good example where the traditional machine learning approach alone cannot solve the problem.\r\n<h4>Graph Partitioning Approach<a name=\"f5519ba8-f7fc-4baf-8437-ceded4d239ee\"><\/a><\/h4>\r\nWhat can we use from graph theory to partition the graph? One approach, suggested by <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/536326-pat-quillen\">Pat Quillen<\/a>, uses <a href=\"https:\/\/en.wikipedia.org\/wiki\/Algebraic_connectivity\">Algebraic connectivity<\/a> to weight the nodes based on the minimum number of cuts you need to make to create two subsets, and the Fielder Vector is used for this purpose. In a typical use, we want to partition nodes into two groups of equal size. Let's partition the nodes with the median value. The nodes with lower-than-median values are less well connected than those with higher values.\r\n\r\nNow let's use the Fielder Vector to identify the graph partition and compare the result to the actual split.\r\n<pre class=\"codeinput\">L = laplacian(G);                   <span class=\"comment\">% get Laplacian Matrix of G<\/span>\r\n[V, ~] = eig(full(L),<span class=\"string\">'vector'<\/span>);     <span class=\"comment\">% get eigenvectors from L<\/span>\r\nw = V(:,2);                         <span class=\"comment\">% the Fiedler Vector<\/span>\r\nP1nodes = find(w &lt; median(w));      <span class=\"comment\">% select nodes below median value<\/span>\r\nerrors = setdiff(G1nodes, P1nodes)  <span class=\"comment\">% any diff from the actual split?<\/span>\r\n<\/pre>\r\n<pre class=\"codeoutput\">errors =\r\n   Empty matrix: 1-by-0\r\n<\/pre>\r\nYou can see that Fielder Vector did a perfect good job identifying the partition of this graph. If you sort the Adjacency Matrix by Fielder Vector, you can also see this split.\r\n<pre class=\"codeinput\">[~, order] = sort(w);                   <span class=\"comment\">% sort the weights by value<\/span>\r\nsortedA = A(order, order);              <span class=\"comment\">% apply the sort order to A<\/span>\r\n\r\nfigure\r\nimagesc(sortedA)\r\ntitle(<span class=\"string\">'Sorted Adjacency Matrix'<\/span>)\r\n<\/pre>\r\n<img decoding=\"async\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/breakup_07.png\" alt=\"\" hspace=\"5\" vspace=\"5\" \/>\r\n<h4>Summary - how does this compare to the Facebook example?<a name=\"05eeeed4-bae9-4872-b18e-cafbdfbecc69\"><\/a><\/h4>\r\nIn this case, the Fielder Vector was all we needed to partition the graph. But in more complicated cases you may be able to use it as a distance metric for cluster analysis, replacing the Jaccard distance in the above example. Social network analysis and machine learning should be complementary tools in your data science toolkit.\r\n\r\nHow does this apply to the situation of how Facebook predicts breakups between couples? Think of Nodes 1 and 34 as a couple, and the graph as their Facebook friends. It looks surprisingly similar to the graph in the article cited above, except that it has more than two clusters. If you think about it, you have old classmates, coworkers, family members, and other social activities and they usually form distinct clusters.\r\n\r\nInterestingly, the article says that total number of mutual friends a couple share is not a good indicator of romantic relationships. It is better if their mutual friends are less well connected. Apparently, romantic relationships and karate clubs don't share the same social dynamics.\r\n\r\nTry Release R2015b, and check out this new graph feature and let us know what you think <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=1227#respond\">here<\/a>.\r\n\r\n<p style=\"text-align: right; font-size: xx-small; font-weight: lighter; font-style: italic; color: gray;\">\r\nPublished with MATLAB\u00ae R2015b<\/p>\r\n\r\n<\/div>\r\n<div class=\"pull-right\"><div class=\"col-xs-12 containing-block\"><a href=\"#\" class=\"btn btn-sm btn_color_blue add_margin_20  hidden-xs try_live_editor_example\" data-liveeditorexample=\"{\n  &quot;repository&quot; : &quot;Blogs&quot;,\n  &quot;id&quot; : &quot;\/loren\/files\/predict_breakup_social_network_analysis.mlx&quot;\n}\"><span class=\"icon-edit icon_16\"><\/span>Run in your browser<span style=\"color:grey\" class=\"series\"><\/span><\/a><\/div><\/div>","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/breakup_07.png\" onError=\"this.style.display ='none';\" \/><\/div>... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/09\/30\/can-we-predict-a-breakup-social-network-analysis-with-matlab\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[62,66,43,6,61,1],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1227"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=1227"}],"version-history":[{"count":15,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1227\/revisions"}],"predecessor-version":[{"id":5247,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1227\/revisions\/5247"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=1227"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=1227"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=1227"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}