{"id":1692,"date":"2016-08-08T10:22:40","date_gmt":"2016-08-08T15:22:40","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=1692"},"modified":"2016-07-19T12:32:06","modified_gmt":"2016-07-19T17:32:06","slug":"text-mining-machine-learning-research-papers-with-matlab","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2016\/08\/08\/text-mining-machine-learning-research-papers-with-matlab\/","title":{"rendered":"Text Mining Machine Learning Research Papers with MATLAB"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Publish_or_perish\">Publish or perish<\/a>, they say in academia, and you can learn trends in academic research through analysis of published papers. Today's guest blogger, Toshi, came across a dataset of machine learning papers presented in a conference. Let's see what he found!<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/wordle.png\" alt=\"\"> <\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#06d3b687-7278-4dc3-a2f1-f3a0fe81af54\">NIPS 2015 Papers<\/a><\/li><li><a href=\"#25978efc-f868-4a55-a5fb-cb2c95a6fa2c\">Paper Author Affiliation<\/a><\/li><li><a href=\"#279b462f-405d-4517-a1f7-b1d28b816c0e\">Paper Coauthorship<\/a><\/li><li><a href=\"#14f3efc5-4d27-4f21-ba67-dfcedad2a2cc\">Paper Topics<\/a><\/li><li><a href=\"#90065b41-edd7-4516-a798-55951045159d\">Topic Grouping by Principal Componet Analysis<\/a><\/li><li><a href=\"#64e19e03-f126-461e-8c08-15c700fa9ea1\">Deep Learning<\/a><\/li><li><a href=\"#b1fe2719-0639-4c94-bee8-0aa9217fb4af\">Core Algorithms<\/a><\/li><li><a href=\"#2eb0b04d-5942-4faa-910e-146086086949\">Commercial Research<\/a><\/li><li><a href=\"#aa78ddfd-7743-4cdd-8886-49dcfbd0755a\">Top 10 Authors in NIPS 2015<\/a><\/li><li><a href=\"#add0e044-2165-43bc-a646-5efe8a00c2e8\">Summary<\/a><\/li><\/ul><\/div><h4>NIPS 2015 Papers<a name=\"06d3b687-7278-4dc3-a2f1-f3a0fe81af54\"><\/a><\/h4><p>NIPS (which stands for \"Neural Information Processing Systems\") is an annual conference on machine learning and computational neuroscience, and papers presented there reveal what experts in the field are working on. Conveniently, you can find the data from the 2015 conference from Kaggle's <a href=\"https:\/\/www.kaggle.com\/benhamner\/nips-2015-papers\">NIPS 2015 Papers<\/a> page.<\/p><p>Let's load the data downloaded from Kaggle to the current folder. Kaggle provides an SQLite database file in addition to usual CSV files and the SQLite file contains all the data in CSV files. Since we now have configuration-free SQLite support in <a href=\"https:\/\/www.mathworks.com\/products\/database\/\">Database Toolbox<\/a> in <a href=\"https:\/\/www.mathworks.com\/products\/new_products\/latest_features.html\">R2016a<\/a>, let's give that a try. Once you establish a connection to the databasefile with <tt><a href=\"https:\/\/www.mathworks.com\/help\/database\/ug\/sqlite.html\">sqlite<\/a><\/tt>,  you can use SQL commands like '|SELECT * FROM Authors|'. For more details about SQLite support, read <a href=\"https:\/\/www.mathworks.com\/help\/database\/ug\/working-with-the-matlab-interface-to-sqlite.html\">Working with the MATLAB Interface to SQLite<\/a>. If you don't have Database Toolbox, you can try <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/readtable.html\">readtable<\/a><\/tt> to read CSV files.<\/p><p>I also wrote a script <tt><a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/nips2015_parse_html.m\">nips2015_parse_html<\/a><\/tt> in order to parse the HTML file \"accepted_papers.html\" that contains the affiliation of the authors. Check it out if you are interested in data wrangling with MATLAB.<\/p><pre class=\"codeinput\">db = <span class=\"string\">'output\/database.sqlite'<\/span>;                              <span class=\"comment\">% database file<\/span>\r\nconn = sqlite(db,<span class=\"string\">'readonly'<\/span>);                               <span class=\"comment\">% create connection<\/span>\r\nAuthors = fetch(conn,<span class=\"string\">'SELECT * FROM Authors'<\/span>);              <span class=\"comment\">% get data with SQL command<\/span>\r\nPapers = fetch(conn,<span class=\"string\">'SELECT * FROM Papers'<\/span>);                <span class=\"comment\">% get data with SQL command<\/span>\r\nPaperAuthors = fetch(conn,<span class=\"string\">'SELECT * FROM PaperAuthors'<\/span>);    <span class=\"comment\">% get data with SQL command<\/span>\r\nclose(conn)                                                 <span class=\"comment\">% close connection<\/span>\r\nAuthors = cell2table(Authors,<span class=\"string\">'VariableNames'<\/span>,{<span class=\"string\">'ID'<\/span>,<span class=\"string\">'Name'<\/span>});<span class=\"comment\">% convert to table<\/span>\r\nPapers = cell2table(Papers,<span class=\"string\">'VariableNames'<\/span>, <span class=\"keyword\">...<\/span><span class=\"comment\">             % convert to table<\/span>\r\n    {<span class=\"string\">'ID'<\/span>,<span class=\"string\">'Title'<\/span>,<span class=\"string\">'EventType'<\/span>,<span class=\"string\">'PdfName'<\/span>,<span class=\"string\">'Abstract'<\/span>,<span class=\"string\">'PaperText'<\/span>});\r\nPaperAuthors = cell2table(PaperAuthors,<span class=\"string\">'VariableNames'<\/span>, <span class=\"keyword\">...<\/span><span class=\"comment\"> % convert to table<\/span>\r\n    {<span class=\"string\">'ID'<\/span>,<span class=\"string\">'PaperID'<\/span>,<span class=\"string\">'AuthorID'<\/span>});\r\nhtml = fileread(<span class=\"string\">'output\/accepted_papers.html'<\/span>);             <span class=\"comment\">% load text from html<\/span>\r\nnips2015_parse_html                                         <span class=\"comment\">% parse html text<\/span>\r\n<\/pre><h4>Paper Author Affiliation<a name=\"25978efc-f868-4a55-a5fb-cb2c95a6fa2c\"><\/a><\/h4><p>We can visualize which organization the authors of accepted papers belong to using <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/graph-and-network-algorithms.html\">graphs<\/a>. Let's create a <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/digraph.html\">directed graph<\/a> with authors and their affiliation as nodes. We limit the plot to organizations with 10 or more authors, but some smaller organizatoins are also included if authors have multiple affiliations that include smaller organizations. The top 20 organizations, in terms of number of affiliated authors, are colored in bright orange while others are colored in a yellowish orange. Small blue dots are individual authors. This is just based on papers from one conference in 2015, so the top ranking organizations may be different from year to year.<\/p><pre class=\"codeinput\">T = AcceptedPapers(:,{<span class=\"string\">'Name'<\/span>,<span class=\"string\">'Org'<\/span>});                       <span class=\"comment\">% subset table<\/span>\r\n[T, ~, idx] = unique(T,<span class=\"string\">'rows'<\/span>);                             <span class=\"comment\">% remove duplicates<\/span>\r\nauth = T.(1);                                               <span class=\"comment\">% authors<\/span>\r\norg = cellstr(T.(2));                                       <span class=\"comment\">% organizations<\/span>\r\nw = accumarray(idx, 1);                                     <span class=\"comment\">% count of papers<\/span>\r\nG = digraph(auth,org,w);                                    <span class=\"comment\">% create directed graph<\/span>\r\nG.Nodes.Degree = indegree(G);                               <span class=\"comment\">% add indegree<\/span>\r\nbins = conncomp(G,<span class=\"string\">'OutputForm'<\/span>,<span class=\"string\">'cell'<\/span>,<span class=\"string\">'Type'<\/span>,<span class=\"string\">'weak'<\/span>);       <span class=\"comment\">% get connected components<\/span>\r\nbinsizes = cellfun(@length,bins);                           <span class=\"comment\">% get bin sizes<\/span>\r\nsmall = bins(binsizes &lt; 10);                                <span class=\"comment\">% if bin has less than 10<\/span>\r\nsmall = unique([small{:}]);                                 <span class=\"comment\">% it is small<\/span>\r\nG = rmnode(G, small);                                       <span class=\"comment\">% remove nodes in the bin<\/span>\r\norg = G.Nodes.Name(ismember(G.Nodes.Name,org));             <span class=\"comment\">% get org nodes<\/span>\r\ndeg = G.Nodes.Degree(ismember(G.Nodes.Name,org));           <span class=\"comment\">% get org node indegrees<\/span>\r\n[~, ranking] = sort(deg,<span class=\"string\">'descend'<\/span>);                         <span class=\"comment\">% rank by indegrees<\/span>\r\ntopN = org(ranking(1:20));                                  <span class=\"comment\">% select top 20<\/span>\r\nothers = org(~ismember(org,topN));                          <span class=\"comment\">% select others<\/span>\r\nmarkersize = log(G.Nodes.Degree + 2)*3;                     <span class=\"comment\">% indeg for marker size<\/span>\r\nlinewidth = 5*G.Edges.Weight\/max(G.Edges.Weight);           <span class=\"comment\">% weight for line width<\/span>\r\nfigure                                                      <span class=\"comment\">% create new figure<\/span>\r\nh = plot(G,<span class=\"string\">'MarkerSize'<\/span>,markersize,<span class=\"string\">'LineWidth'<\/span>,linewidth,<span class=\"string\">'EdgeAlpha'<\/span>,0.3); <span class=\"comment\">% plot graph<\/span>\r\nhighlight(h, topN,<span class=\"string\">'NodeColor'<\/span>,[.85 .33 .1])                 <span class=\"comment\">% highlight top 20 nodes<\/span>\r\nhighlight(h, others,<span class=\"string\">'NodeColor'<\/span>,[.93 .69 .13])              <span class=\"comment\">% highlight others<\/span>\r\nlabelnode(h,org,org)                                        <span class=\"comment\">% label nodes<\/span>\r\ntitle({<span class=\"string\">'NIPS 2015 Paper Author Affiliation'<\/span>;<span class=\"string\">'with 10 or more authors'<\/span>}) <span class=\"comment\">% add title<\/span>\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/nips2015Final_01.png\" alt=\"\"> <h4>Paper Coauthorship<a name=\"279b462f-405d-4517-a1f7-b1d28b816c0e\"><\/a><\/h4><p>Coauthors of a paper may come from different organizations, and this gives us an opportunity to see the relationship among those organizations. Let's create a directed graph with authors and their papers as nodes. We limit the plot to a cluster of organizations with 5 or more nodes by separating the graph into connnected components with <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/graph.conncomp.html\">conncomp<\/a><\/tt>.  The same top 20 organizations in terms of number of affiliated authors are again colored in bright orange while others are colored in a yellowish orange. Interestingly, the plot shows that all orange dots are located in the same cluster - so all top 20 organizations belong to a network of coauthors. Again, this is based on papers from a single conference, so the results may be very different for other years. If we track paper coauthorship over multiple years, we may find some hidden connections that we don't see here.<\/p><pre class=\"codeinput\">T = AcceptedPapers(:,{<span class=\"string\">'Org'<\/span>,<span class=\"string\">'Title'<\/span>});                      <span class=\"comment\">% subset table<\/span>\r\n[T, ~, idx] = unique(T,<span class=\"string\">'rows'<\/span>);                             <span class=\"comment\">% remove duplicates<\/span>\r\norg = cellstr(T.(1));                                       <span class=\"comment\">% organizations<\/span>\r\npaper = T.(2);                                              <span class=\"comment\">% papers<\/span>\r\nw = accumarray(idx, 1);                                     <span class=\"comment\">% count of papers<\/span>\r\nG = digraph(paper,org,w);                                   <span class=\"comment\">% create directed graph<\/span>\r\nG.Nodes.Degree = indegree(G);                               <span class=\"comment\">% add indegree<\/span>\r\nbins = conncomp(G,<span class=\"string\">'OutputForm'<\/span>,<span class=\"string\">'cell'<\/span>,<span class=\"string\">'Type'<\/span>,<span class=\"string\">'weak'<\/span>);       <span class=\"comment\">% get connected components<\/span>\r\nbinsizes = cellfun(@length,bins);                           <span class=\"comment\">% get bin sizes<\/span>\r\nsmall = bins(binsizes &lt; 5);                                 <span class=\"comment\">% if bin has less than 5<\/span>\r\nsmall = unique([small{:}]);                                 <span class=\"comment\">% it is small<\/span>\r\nG = rmnode(G, small);                                       <span class=\"comment\">% remove nodes in the bin<\/span>\r\norg = G.Nodes.Name(ismember(G.Nodes.Name,org));             <span class=\"comment\">% get org nodes<\/span>\r\n[~,maxBinIdx] = max(binsizes);                              <span class=\"comment\">% index of largest component<\/span>\r\ntopDocs = setdiff(bins{maxBinIdx},org);                     <span class=\"comment\">% get docs in largest component<\/span>\r\nisTopDoc = ismember(AcceptedPapers.Title,topDocs);          <span class=\"comment\">% get indices of those docs<\/span>\r\ntopDocIds = unique(AcceptedPapers.PaperID(isTopDoc));       <span class=\"comment\">% get the paper ids of those docs<\/span>\r\nisTopDoc = ismember(Papers.ID,topDocIds);                   <span class=\"comment\">% get indices of those docs<\/span>\r\nmarkersize = log(G.Nodes.Degree + 2)*3;                     <span class=\"comment\">% get org nodes<\/span>\r\nlinewidth = 10*G.Edges.Weight\/max(G.Edges.Weight);          <span class=\"comment\">% indeg for marker size<\/span>\r\nfigure                                                      <span class=\"comment\">% create new figure<\/span>\r\nh = plot(G,<span class=\"string\">'MarkerSize'<\/span>,markersize,<span class=\"string\">'LineWidth'<\/span>,linewidth,<span class=\"string\">'EdgeAlpha'<\/span>,0.3); <span class=\"comment\">% plot graph<\/span>\r\nhighlight(h, topN,<span class=\"string\">'NodeColor'<\/span>,[.85 .33 .1])                 <span class=\"comment\">% highlight top 20 nodes<\/span>\r\nothers = org(~ismember(org,topN));                          <span class=\"comment\">% select others<\/span>\r\nhighlight(h, others,<span class=\"string\">'NodeColor'<\/span>,[.93 .69 .13])              <span class=\"comment\">% highlight others<\/span>\r\nlabelnode(h,topN(1),<span class=\"string\">'Top 20'<\/span>)                               <span class=\"comment\">% label nodes<\/span>\r\nlabelnode(h,others([1,4,18,29,72,92,96,99]),<span class=\"string\">'Others'<\/span>)       <span class=\"comment\">% label nodes<\/span>\r\ntitle({<span class=\"string\">'NIPS 2015 Paper Coauthorship By Affiliation'<\/span>;<span class=\"string\">'with 5 or more nodes'<\/span>}) <span class=\"comment\">% add title<\/span>\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/nips2015Final_02.png\" alt=\"\"> <h4>Paper Topics<a name=\"14f3efc5-4d27-4f21-ba67-dfcedad2a2cc\"><\/a><\/h4><p>To find the topics of the papers, I quickly went through the titles of the accepted papers and chose 35 words that jumped out at me - see <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/nips2015_topics.xlsx\">nips2015_topics.xlsx<\/a>. If you check the word cloud at the top, you see some of those words. Obviously, this is a quick-and-dirty approach but I just wanted to get a quick sense of the popular topics for now. If you are interested in a more proper way to do it, please check out my earlier post <a href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/04\/08\/can-you-find-love-through-text-analytics\/\">Can You Find Love through Text Analytics?<\/a><\/p><p>The <tt>paper<\/tt> table contains <tt>Title<\/tt>, <tt>Abstract<\/tt> and <tt>PaperText<\/tt> columns. Which should we use to analyze the paper topics? Titles tend to be short. Abstracts are longer, and are more likely to contain key phrases that represent the paper topic because abstracts are, by definition, a high level overview of the content of the paper. Actual content of the paper naturally covers more details and that may obscure the main topics of the paper. Let's use <tt>Abstract<\/tt> to generate our word count.<\/p><p>Then let's compare the relative frequency of those terms between the docs in the largest connected component we saw earlier and other docs. You can see some differences between those two groups. For example, a lot of pagers discuss 'image', but papers from the top 20 organizations talk about it less frequently, and they talk more about 'graph'.<\/p><pre class=\"codeinput\">Topics = readtable(<span class=\"string\">'nips2015_topics.xlsx'<\/span>);                 <span class=\"comment\">% load preselected topics<\/span>\r\nDTM = zeros(height(Papers),height(Topics));                 <span class=\"comment\">% document term matrix<\/span>\r\n<span class=\"keyword\">for<\/span> i = 1:height(Topics)                                    <span class=\"comment\">% loop over topics<\/span>\r\n    DTM(:,i) = cellfun(@length, <span class=\"keyword\">...<\/span><span class=\"comment\">                         % get number of matches<\/span>\r\n        regexpi(Papers.Abstract,Topics.Regex{i}));          <span class=\"comment\">% find the word in abstract<\/span>\r\n<span class=\"keyword\">end<\/span>\r\ntopDocTopics = sum(DTM(isTopDoc,:));                        <span class=\"comment\">% word count in largest component<\/span>\r\ntopDocTopics = topDocTopics .\/ sum(topDocTopics) *100;      <span class=\"comment\">% convert it into relative percentage<\/span>\r\notherDocTopics = sum(DTM(~isTopDoc,:));                     <span class=\"comment\">% word count in others<\/span>\r\notherDocTopics = otherDocTopics .\/ sum(otherDocTopics) *100;<span class=\"comment\">% convert it into relative percentage<\/span>\r\nfigure                                                      <span class=\"comment\">% create new figure<\/span>\r\nbar([topDocTopics; otherDocTopics]')                        <span class=\"comment\">% bar chart<\/span>\r\nax = gca;                                                   <span class=\"comment\">% get current axes handle<\/span>\r\nax.XTick = 1:height(Topics);                                <span class=\"comment\">% set X-axis tick<\/span>\r\nax.XTickLabel = Topics.Keyword;                             <span class=\"comment\">% set X-axis tick label<\/span>\r\nax.XTickLabelRotation = 90;                                 <span class=\"comment\">% rotate X-axis tick label<\/span>\r\ntitle(<span class=\"string\">'Relative Term Frequency by Document Groups'<\/span>)         <span class=\"comment\">% add title<\/span>\r\nlegend(<span class=\"string\">'Docs in the Largest Cluster'<\/span>,<span class=\"string\">'Other Docs'<\/span>)          <span class=\"comment\">% add legend<\/span>\r\nxlim([0 height(Topics) + 1])                                <span class=\"comment\">% set x-axis limits<\/span>\r\nylabel(<span class=\"string\">'Percentage'<\/span>)                                        <span class=\"comment\">% add y-axis label<\/span>\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/nips2015Final_03.png\" alt=\"\"> <h4>Topic Grouping by Principal Componet Analysis<a name=\"90065b41-edd7-4516-a798-55951045159d\"><\/a><\/h4><p>Let's visualize the relationship between topics using <a href=\"https:\/\/www.mathworks.com\/help\/stats\/principal-component-analysis-pca.html\">Principal Component Analysis<\/a>. The resulting <tt><a href=\"https:\/\/www.mathworks.com\/help\/stats\/biplot.html\">biplot<\/a><\/tt> of the first and second components shows roughly three clusters of related topics. Topics popular in the largest connected component are highlighted in orange and they seem to span across all three clusters.<\/p><p>The purple cluster is dominated by topics favored by the largest connected component and focuses on topics likes <a href=\"https:\/\/en.wikipedia.org\/wiki\/Markov_chain_Monte_Carlo\">Markov Chain Monte Carlo<\/a> (MCMC), <a href=\"https:\/\/en.wikipedia.org\/wiki\/Bayesian_statistics\">Bayesian Statistics<\/a> and Stochastic Gradient MCMC.<\/p><p>The blue cluster seems to focus on the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Multi-armed_bandit\">multi-armed bandits<\/a> problem which is related to a field of machine learning called <a href=\"https:\/\/en.wikipedia.org\/wiki\/Reinforcement_learning\">Reinforcement Learning<\/a>. Topics like 'market' and 'risk' are highlighted in orange, indicating that papers on these topics from the top 20 organizations probably focused on financial applications.<\/p><pre class=\"codeinput\">w = 1 .\/ var(DTM);                                          <span class=\"comment\">% inverse variable variances<\/span>\r\n[wcoeff, score, latent, tsquared, explained] = <span class=\"keyword\">...<\/span><span class=\"comment\">          % weighted PCA with w<\/span>\r\n    pca(DTM, <span class=\"string\">'VariableWeights'<\/span>, w);\r\ncoefforth = diag(sqrt(w)) * wcoeff;                         <span class=\"comment\">% turn wcoeff to orthonormal<\/span>\r\nlabels = Topics.Keyword;                                    <span class=\"comment\">% Topics as labels<\/span>\r\ntopT = Topics.Keyword((topDocTopics - otherDocTopics) &gt; 1); <span class=\"comment\">% topics popular in top cluster<\/span>\r\nfigure                                                      <span class=\"comment\">% new figure<\/span>\r\nbiplot(coefforth(:,1:2), <span class=\"string\">'Scores'<\/span>, score(:,1:2), <span class=\"keyword\">...<\/span><span class=\"comment\">        % 2D biplot with the first two comps<\/span>\r\n    <span class=\"string\">'VarLabels'<\/span>, labels)\r\ntitle(<span class=\"string\">'Principal Components Analysis of Paper Topics'<\/span>)      <span class=\"comment\">% add title<\/span>\r\n<span class=\"keyword\">for<\/span> i =  1:length(topT)                                     <span class=\"comment\">% loop over popular topics<\/span>\r\n    htext = findobj(gca,<span class=\"string\">'String'<\/span>,topT{i});                  <span class=\"comment\">% find text object<\/span>\r\n    htext.Color = [.85 .33 .1];                             <span class=\"comment\">% highlight by color<\/span>\r\n<span class=\"keyword\">end<\/span>\r\nrectangle(<span class=\"string\">'Position'<\/span>,[.05 -.1 .5 .3],<span class=\"string\">'Curvature'<\/span>,1, <span class=\"keyword\">...<\/span><span class=\"comment\">     % add rectagle<\/span>\r\n    <span class=\"string\">'EdgeColor'<\/span>,[0 .5 0])\r\nrectangle(<span class=\"string\">'Position'<\/span>,[-.23 .05 .25 .55],<span class=\"string\">'Curvature'<\/span>,1, <span class=\"keyword\">...<\/span><span class=\"comment\">  % add rectangle<\/span>\r\n    <span class=\"string\">'EdgeColor'<\/span>,[.6 .1 .5])\r\nrectangle(<span class=\"string\">'Position'<\/span>,[-.35 -.4 .34 .42],<span class=\"string\">'Curvature'<\/span>,1, <span class=\"keyword\">...<\/span><span class=\"comment\">  % add rectangle<\/span>\r\n    <span class=\"string\">'EdgeColor'<\/span>,[.1 .2 .6])\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/nips2015Final_04.png\" alt=\"\"> <h4>Deep Learning<a name=\"64e19e03-f126-461e-8c08-15c700fa9ea1\"><\/a><\/h4><p>The green cluster is very busy and hard to see. Let's zoom into the green cluster to see more details using <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/axis.html\">axis<\/a><\/tt>. <a href=\"https:\/\/www.mathworks.com\/discovery\/deep-learning.html\">Deep learning<\/a> seems to be the main topic of this cluster. CNN (<a title=\"https:\/\/www.mathworks.com\/help\/nnet\/convolutional-neural-networks.html (link no longer works)\">Convolutional Neural Networks<\/a>) is a deep learning algorithm often used for image classification. It makes sense that it is close to the 'image' topic in the biplot. RNN (<a href=\"https:\/\/en.wikipedia.org\/wiki\/Recurrent_neural_network\">Recurrent Neural Networks<\/a>) tends to be used in Natural Language Processing and it appears close to 'text'. <a href=\"https:\/\/www.mathworks.com\/help\/nnet\/autoencoders.html\">Autoencoders<\/a> and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Long_short-term_memory\">LSTM<\/a> (Long Short-Term Memory) are also deep learning algorithms. <a href=\"https:\/\/en.wikipedia.org\/wiki\/Maximum_a_posteriori_estimation\">MAP<\/a> (Maximum A Posteriori) and 'deep neural networks' are the only topics popular in the top 20 organizations that are found in this cluster. Because we are just comparing the relative frequency of word occurrence, if a lot of papers talk about deep learning related topics, then there are barely any significant frequency differences between the top 20 and others, and those words won't be highlighted.<\/p><pre class=\"codeinput\">axis([-0.1 0.5 -0.1 0.2]);                                  <span class=\"comment\">% define axis limits<\/span>\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/nips2015Final_05.png\" alt=\"\"> <h4>Core Algorithms<a name=\"b1fe2719-0639-4c94-bee8-0aa9217fb4af\"><\/a><\/h4><p>The topics found at the center of the biplot are related to more established core machine learning techniques such as <a href=\"https:\/\/www.mathworks.com\/discovery\/support-vector-machine.html\">Support Vector Machines (SVM<\/a>), <a href=\"https:\/\/www.mathworks.com\/help\/stats\/principal-component-analysis-pca.html\">Principal Component Analysis (PCA)<\/a>, <a href=\"https:\/\/www.mathworks.com\/help\/stats\/hidden-markov-models.html\">Hidden Markov Models (HMM)<\/a> or <a href=\"https:\/\/www.mathworks.com\/help\/stats\/lasso-regularization.html\">Least Absolute Shrinkage And Selection Operator (LASSO)<\/a>. Papers from the top 20 organizations seem be interested in tensors and multi-class classification problems, along with graphs and <a href=\"https:\/\/www.mathworks.com\/help\/stats\/gaussian-process-regression.html\">Gaussian Process<\/a>.<\/p><pre class=\"codeinput\">axis([-0.06 0.06 -0.06 0.06]);                                  <span class=\"comment\">% define axis limits<\/span>\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/nips2015Final_06.png\" alt=\"\"> <h4>Commercial Research<a name=\"2eb0b04d-5942-4faa-910e-146086086949\"><\/a><\/h4><p>The top 20 organizations include some commercial entities such as Google, IBM and Microsoft. The topics of their research papers probably reflects commercial interests in the field of machine learning. We can use the same biplot and highlight the topics that frequently appear in the papers affiliated with them. The plot shows that the three companies tend to cover different topics while they all engage in some deep learning related research. You can also see that Google tends to cover multiple fields while IBM and Microsft seem to have a narrower focus.<\/p><pre class=\"codeinput\">isGoogler = AcceptedPapers.Org == <span class=\"string\">'Google'<\/span>;                 <span class=\"comment\">% find indices of Google authors<\/span>\r\nGooglePaperIds = unique(AcceptedPapers.PaperID(isGoogler)); <span class=\"comment\">% find their paper ids<\/span>\r\nisGooglePaper = ismember(Papers.ID,GooglePaperIds);         <span class=\"comment\">% get the paper indices<\/span>\r\nGoogleTopics = sum(DTM(isGooglePaper,:));                   <span class=\"comment\">% sum Google rows<\/span>\r\nGoogleTopics = GoogleTopics .\/ sum(GoogleTopics) *100;      <span class=\"comment\">% convert it into relative percentage<\/span>\r\nisIBMer = AcceptedPapers.Org == <span class=\"string\">'IBM'<\/span>;                      <span class=\"comment\">% find indices of IBM authors<\/span>\r\nIBMPaperIds = unique(AcceptedPapers.PaperID(isIBMer));      <span class=\"comment\">% find their paper ids<\/span>\r\nisIBMPaper = ismember(Papers.ID,IBMPaperIds);               <span class=\"comment\">% get the paper indices<\/span>\r\nIBMTopics = sum(DTM(isIBMPaper,:));                         <span class=\"comment\">% sum IBM rows<\/span>\r\nIBMTopics = IBMTopics .\/ sum(IBMTopics) *100;               <span class=\"comment\">% convert it into relative percentage<\/span>\r\nisMSofter = AcceptedPapers.Org == <span class=\"string\">'Microsoft'<\/span>;              <span class=\"comment\">% find indices of Mirosoft authors<\/span>\r\nMSPaperIds = unique(AcceptedPapers.PaperID(isMSofter));     <span class=\"comment\">% find their paper ids<\/span>\r\nisMSPaper = ismember(Papers.ID,MSPaperIds);                 <span class=\"comment\">% get the paper indices<\/span>\r\nMSTopics = sum(DTM(isMSPaper,:));                           <span class=\"comment\">% sum Microsoft rows<\/span>\r\nMSTopics = MSTopics .\/ sum(MSTopics) *100;                  <span class=\"comment\">% convert it into relative percentage<\/span>\r\ncommercialTopics = [GoogleTopics; IBMTopics; MSTopics];     <span class=\"comment\">% combine all<\/span>\r\nfigure                                                      <span class=\"comment\">% new figure<\/span>\r\nbiplot(coefforth(:,1:2), <span class=\"string\">'Scores'<\/span>, score(:,1:2), <span class=\"keyword\">...<\/span><span class=\"comment\">        % 2D biplot with the first two comps<\/span>\r\n    <span class=\"string\">'VarLabels'<\/span>, labels)\r\nhline = findobj(gca,<span class=\"string\">'LineStyle'<\/span>,<span class=\"string\">'none'<\/span>);                    <span class=\"comment\">% get line handles of observations<\/span>\r\n<span class=\"keyword\">for<\/span> i = 1:length(hline)                                     <span class=\"comment\">% loop over observatoins<\/span>\r\n    hline(i).Visible = <span class=\"string\">'off'<\/span>;                               <span class=\"comment\">% make it invisible<\/span>\r\n<span class=\"keyword\">end<\/span>\r\nhtext = findobj(gca,<span class=\"string\">'Type'<\/span>,<span class=\"string\">'text'<\/span>);                         <span class=\"comment\">% get text handles<\/span>\r\ntcolor = [0 .5 0;.85 .33 .1; .1 .2 .6];                     <span class=\"comment\">% define text color<\/span>\r\n<span class=\"keyword\">for<\/span> i = 1:length(htext)                                     <span class=\"comment\">% loop over text<\/span>\r\n   r = commercialTopics(:,strcmp(labels,htext(i).String));  <span class=\"comment\">% get ratios<\/span>\r\n   <span class=\"keyword\">if<\/span> sum(r) == 0                                           <span class=\"comment\">% if all rows are zero<\/span>\r\n       htext(i).Visible = <span class=\"string\">'off'<\/span>;                            <span class=\"comment\">% make it invisible<\/span>\r\n   <span class=\"keyword\">else<\/span>                                                     <span class=\"comment\">% otherwise<\/span>\r\n       [~,idx] = max(r);                                    <span class=\"comment\">% get max row<\/span>\r\n       htext(i).Color = tcolor(idx,:);                      <span class=\"comment\">% use matching color<\/span>\r\n   <span class=\"keyword\">end<\/span>\r\n<span class=\"keyword\">end<\/span>\r\ntext(-.4,.3,<span class=\"string\">'Google'<\/span>,<span class=\"string\">'Color'<\/span>,tcolor(1,:),<span class=\"string\">'FontSize'<\/span>,14)     <span class=\"comment\">% annotate<\/span>\r\ntext(.3,-.1,<span class=\"string\">'IBM'<\/span>,<span class=\"string\">'Color'<\/span>,tcolor(2,:),<span class=\"string\">'FontSize'<\/span>,14)        <span class=\"comment\">% annotate<\/span>\r\ntext(-.4,-.2,<span class=\"string\">'Microsoft'<\/span>,<span class=\"string\">'Color'<\/span>,tcolor(3,:),<span class=\"string\">'FontSize'<\/span>,14) <span class=\"comment\">% annotate<\/span>\r\ntitle({<span class=\"string\">'Principal Components Analysis of Paper Topics'<\/span>;     <span class=\"comment\">% add title<\/span>\r\n    <span class=\"string\">'highlighting Google, IBM and Microsoft topics'<\/span>})\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/nips2015Final_07.png\" alt=\"\"> <h4>Top 10 Authors in NIPS 2015<a name=\"aa78ddfd-7743-4cdd-8886-49dcfbd0755a\"><\/a><\/h4><p>Some authors got multiple papers accepted by NIPS.  Are there any specific topics that gives them an edge? Let's take a look at the top 10 authors in terms of number of accepted papers and see what topics come up in those papers. It turns out the topics of the op 10 authors don't belong to specifc clusters, and the vectors of those topics are shorter - meaning they are not so uncommon but not as frequently discussed as other topics, such as 'bandits', 'CNN' or 'MCMC', either.<\/p><pre class=\"codeinput\">[auth_ids,~,idx] = unique(AcceptedPapers.ID);               <span class=\"comment\">% get author ids<\/span>\r\ncount = accumarray(idx,1);                                  <span class=\"comment\">% get count<\/span>\r\n[~,ranking] = sort(count,<span class=\"string\">'descend'<\/span>);                        <span class=\"comment\">% get ranking<\/span>\r\ntop10_ids = auth_ids(ranking(1:10));                        <span class=\"comment\">% get top 10 ids<\/span>\r\nisTop10 = ismember(AcceptedPapers.ID,top10_ids);            <span class=\"comment\">% get row indices<\/span>\r\ntop10_paper_ids = unique(AcceptedPapers.PaperID(isTop10));  <span class=\"comment\">% get top 10 papaer ids<\/span>\r\nisTop10paper = ismember(Papers.ID,top10_paper_ids);         <span class=\"comment\">% get row indices<\/span>\r\ntop10Topics = sum(DTM(isTop10paper ,:));                    <span class=\"comment\">% sum top 10 rows<\/span>\r\ntop10Topics = top10Topics .\/ sum(top10Topics) *100;         <span class=\"comment\">% convert it into relative percentage<\/span>\r\nnotTop10Topics = sum(DTM(~top10Topics,:));                  <span class=\"comment\">% word count in others<\/span>\r\nnotTop10Topics = notTop10Topics .\/ sum(notTop10Topics) *100;<span class=\"comment\">% convert it into relative percentage<\/span>\r\ncombined = [top10Topics;notTop10Topics];                    <span class=\"comment\">% combine all<\/span>\r\n[isTop10Author,order] = ismember(Authors.ID,top10_ids);     <span class=\"comment\">% get indices of top 10 authors<\/span>\r\n[~,order] = sort(order(isTop10Author));                     <span class=\"comment\">% get ranking<\/span>\r\nnames = Authors.Name(isTop10Author);                        <span class=\"comment\">% get names<\/span>\r\nfigure                                                      <span class=\"comment\">% create new figure<\/span>\r\nbiplot(coefforth(:,1:2), <span class=\"string\">'Scores'<\/span>, score(:,1:2), <span class=\"keyword\">...<\/span><span class=\"comment\">        % 2D biplot with the first two comps<\/span>\r\n    <span class=\"string\">'VarLabels'<\/span>, labels)\r\nhline = findobj(gca,<span class=\"string\">'LineStyle'<\/span>,<span class=\"string\">'none'<\/span>);                    <span class=\"comment\">% get line handles of observations<\/span>\r\n<span class=\"keyword\">for<\/span> i = 1:length(hline)                                     <span class=\"comment\">% loop over observatoins<\/span>\r\n    hline(i).Visible = <span class=\"string\">'off'<\/span>;                               <span class=\"comment\">% make it invisible<\/span>\r\n<span class=\"keyword\">end<\/span>\r\nhtext = findobj(gca,<span class=\"string\">'Type'<\/span>,<span class=\"string\">'text'<\/span>);                         <span class=\"comment\">% get text handles<\/span>\r\ntcolor = [.85 .33 .1; .1 .2 .6];                            <span class=\"comment\">% define text color<\/span>\r\n<span class=\"keyword\">for<\/span> i = 1:length(htext)                                     <span class=\"comment\">% loop over text<\/span>\r\n   r = combined(:,strcmp(labels,htext(i).String));          <span class=\"comment\">% get ratios<\/span>\r\n   [~,idx] = max(r);                                        <span class=\"comment\">% get max row<\/span>\r\n   <span class=\"keyword\">if<\/span> idx == 1 &amp;&amp; r(1) &gt; 3                                  <span class=\"comment\">% if row 1 &amp; r &gt; 3<\/span>\r\n       htext(i).Color = [.85 .33 .1];                       <span class=\"comment\">% highlight text<\/span>\r\n   <span class=\"keyword\">else<\/span>\r\n        htext(i).Color = [.6 .6 .6];                        <span class=\"comment\">% ghost text<\/span>\r\n   <span class=\"keyword\">end<\/span>\r\n<span class=\"keyword\">end<\/span>\r\ntitle({<span class=\"string\">'Principal Components Analysis of Paper Topics'<\/span>;     <span class=\"comment\">% add title<\/span>\r\n    <span class=\"string\">'highlighting topics by top 10 authors'<\/span>})\r\ntext(-.5,.3,<span class=\"string\">'Top 10 '<\/span>,<span class=\"string\">'FontWeight'<\/span>,<span class=\"string\">'bold'<\/span>)                  <span class=\"comment\">% annotate<\/span>\r\ntext(-.5,0,names(order))                                    <span class=\"comment\">% annotate<\/span>\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/nips2015Final_08.png\" alt=\"\"> <h4>Summary<a name=\"add0e044-2165-43bc-a646-5efe8a00c2e8\"><\/a><\/h4><p>This is a fairly simple, quick exploration of the dataset, but we got some interesting insights about the current state of machine learning research presented at NIPS 2015. Maybe you can find even more if you dig deeper. Perhaps you can use the technique in the post <a href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/04\/08\/can-you-find-love-through-text-analytics\/\">Can You Find Love through Text Analytics?<\/a> to cluster similar papers. You can use word-based tokens, but you may want to use an n-gram approach described in another post, <a href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/09\/09\/text-mining-shakespeare-with-matlab\/\">Text Mining Shakespeare with MATLAB<\/a>. Give it a try and share your findings <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=1692#respond\">here<\/a>!<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_e9ed5c7f2df44be9b78ff6e976a2c963() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='e9ed5c7f2df44be9b78ff6e976a2c963 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' e9ed5c7f2df44be9b78ff6e976a2c963';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2016 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_e9ed5c7f2df44be9b78ff6e976a2c963()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2016a<br><\/p><\/div><!--\r\ne9ed5c7f2df44be9b78ff6e976a2c963 ##### SOURCE BEGIN #####\r\n%% Text Mining Machine Learning Research Papers with MATLAB\r\n% <https:\/\/en.wikipedia.org\/wiki\/Publish_or_perish Publish or perish>, they \r\n% say in academia, and you can learn trends in academic research through analysis \r\n% of published papers. Today's guest blogger, Toshi, came across a dataset of \r\n% machine learning papers presented in a conference. Let's see what he found! \r\n% \r\n% <<wordle.png>>\r\n%  \r\n%% NIPS 2015 Papers\r\n% NIPS (which stands for \"Neural Information Processing Systems\") is an annual \r\n% conference on machine learning and computational neuroscience, and papers presented \r\n% there reveal what experts in the field are working on. Conveniently, you can \r\n% find the data from the 2015 conference from Kaggle's <https:\/\/www.kaggle.com\/benhamner\/nips-2015-papers \r\n% NIPS 2015 Papers> page. \r\n% \r\n% Let's load the data downloaded from Kaggle to the current folder. Kaggle \r\n% provides an SQLite database file in addition to usual CSV files and the SQLite \r\n% file contains all the data in CSV files. Since we now have configuration-free \r\n% SQLite support in <https:\/\/www.mathworks.com\/products\/database\/ Database Toolbox> \r\n% in <https:\/\/www.mathworks.com\/products\/new_products\/latest_features.html R2016a>, \r\n% let's give that a try. Once you establish a connection to the databasefile with \r\n% |<https:\/\/www.mathworks.com\/help\/database\/ug\/sqlite.html sqlite>|,  you can use \r\n% SQL commands like '|SELECT * FROM Authors|'. For more details about SQLite support, \r\n% read <https:\/\/www.mathworks.com\/help\/database\/ug\/working-with-the-matlab-interface-to-sqlite.html \r\n% Working with the MATLAB Interface to SQLite>. If you don't have Database Toolbox, \r\n% you can try |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/readtable.html readtable>| \r\n% to read CSV files.\r\n% \r\n% I also wrote a script |<https:\/\/blogs.mathworks.com\/images\/loren\/2016\/nips2015_parse_html.m \r\n% nips2015_parse_html>| in order to parse the HTML file \"accepted_papers.html\" \r\n% that contains the affiliation of the authors. Check it out if you are interested \r\n% in data wrangling with MATLAB. \r\n\r\ndb = 'output\/database.sqlite';                              % database file\r\nconn = sqlite(db,'readonly');                               % create connection\r\nAuthors = fetch(conn,'SELECT * FROM Authors');              % get data with SQL command\r\nPapers = fetch(conn,'SELECT * FROM Papers');                % get data with SQL command\r\nPaperAuthors = fetch(conn,'SELECT * FROM PaperAuthors');    % get data with SQL command\r\nclose(conn)                                                 % close connection\r\nAuthors = cell2table(Authors,'VariableNames',{'ID','Name'});% convert to table\r\nPapers = cell2table(Papers,'VariableNames', ...             % convert to table\r\n    {'ID','Title','EventType','PdfName','Abstract','PaperText'});\r\nPaperAuthors = cell2table(PaperAuthors,'VariableNames', ... % convert to table\r\n    {'ID','PaperID','AuthorID'});\r\nhtml = fileread('output\/accepted_papers.html');             % load text from html\r\nnips2015_parse_html                                         % parse html text\r\n%% Paper Author Affiliation\r\n% We can visualize which organization the authors of accepted papers belong \r\n% to using <https:\/\/www.mathworks.com\/help\/matlab\/graph-and-network-algorithms.html \r\n% graphs>. Let's create a <https:\/\/www.mathworks.com\/help\/matlab\/ref\/digraph.html \r\n% directed graph> with authors and their affiliation as nodes. We limit the plot \r\n% to organizations with 10 or more authors, but some smaller organizatoins are \r\n% also included if authors have multiple affiliations that include smaller organizations. \r\n% The top 20 organizations, in terms of number of affiliated authors, are colored \r\n% in bright orange while others are colored in a yellowish orange. Small blue \r\n% dots are individual authors. This is just based on papers from one conference \r\n% in 2015, so the top ranking organizations may be different from year to year. \r\n\r\nT = AcceptedPapers(:,{'Name','Org'});                       % subset table\r\n[T, ~, idx] = unique(T,'rows');                             % remove duplicates\r\nauth = T.(1);                                               % authors\r\norg = cellstr(T.(2));                                       % organizations\r\nw = accumarray(idx, 1);                                     % count of papers\r\nG = digraph(auth,org,w);                                    % create directed graph\r\nG.Nodes.Degree = indegree(G);                               % add indegree\r\nbins = conncomp(G,'OutputForm','cell','Type','weak');       % get connected components\r\nbinsizes = cellfun(@length,bins);                           % get bin sizes\r\nsmall = bins(binsizes < 10);                                % if bin has less than 10\r\nsmall = unique([small{:}]);                                 % it is small\r\nG = rmnode(G, small);                                       % remove nodes in the bin\r\norg = G.Nodes.Name(ismember(G.Nodes.Name,org));             % get org nodes\r\ndeg = G.Nodes.Degree(ismember(G.Nodes.Name,org));           % get org node indegrees\r\n[~, ranking] = sort(deg,'descend');                         % rank by indegrees\r\ntopN = org(ranking(1:20));                                  % select top 20\r\nothers = org(~ismember(org,topN));                          % select others\r\nmarkersize = log(G.Nodes.Degree + 2)*3;                     % indeg for marker size\r\nlinewidth = 5*G.Edges.Weight\/max(G.Edges.Weight);           % weight for line width\r\nfigure                                                      % create new figure\r\nh = plot(G,'MarkerSize',markersize,'LineWidth',linewidth,'EdgeAlpha',0.3); % plot graph\r\nhighlight(h, topN,'NodeColor',[.85 .33 .1])                 % highlight top 20 nodes\r\nhighlight(h, others,'NodeColor',[.93 .69 .13])              % highlight others\r\nlabelnode(h,org,org)                                        % label nodes\r\ntitle({'NIPS 2015 Paper Author Affiliation';'with 10 or more authors'}) % add title\r\n%% Paper Coauthorship\r\n% Coauthors of a paper may come from different organizations, and this gives \r\n% us an opportunity to see the relationship among those organizations. Let's create \r\n% a directed graph with authors and their papers as nodes. We limit the plot to \r\n% a cluster of organizations with 5 or more nodes by separating the graph into \r\n% connnected components with |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/graph.conncomp.html \r\n% conncomp>|.  The same top 20 organizations in terms of number of affiliated \r\n% authors are again colored in bright orange while others are colored in a yellowish \r\n% orange. Interestingly, the plot shows that all orange dots are located in the \r\n% same cluster - so all top 20 organizations belong to a network of coauthors. \r\n% Again, this is based on papers from a single conference, so the results may be \r\n% very different for other years. If we track paper coauthorship over multiple \r\n% years, we may find some hidden connections that we don't see here.\r\n\r\nT = AcceptedPapers(:,{'Org','Title'});                      % subset table\r\n[T, ~, idx] = unique(T,'rows');                             % remove duplicates\r\norg = cellstr(T.(1));                                       % organizations\r\npaper = T.(2);                                              % papers\r\nw = accumarray(idx, 1);                                     % count of papers\r\nG = digraph(paper,org,w);                                   % create directed graph\r\nG.Nodes.Degree = indegree(G);                               % add indegree\r\nbins = conncomp(G,'OutputForm','cell','Type','weak');       % get connected components\r\nbinsizes = cellfun(@length,bins);                           % get bin sizes\r\nsmall = bins(binsizes < 5);                                 % if bin has less than 5\r\nsmall = unique([small{:}]);                                 % it is small\r\nG = rmnode(G, small);                                       % remove nodes in the bin\r\norg = G.Nodes.Name(ismember(G.Nodes.Name,org));             % get org nodes\r\n[~,maxBinIdx] = max(binsizes);                              % index of largest component\r\ntopDocs = setdiff(bins{maxBinIdx},org);                     % get docs in largest component\r\nisTopDoc = ismember(AcceptedPapers.Title,topDocs);          % get indices of those docs\r\ntopDocIds = unique(AcceptedPapers.PaperID(isTopDoc));       % get the paper ids of those docs\r\nisTopDoc = ismember(Papers.ID,topDocIds);                   % get indices of those docs\r\nmarkersize = log(G.Nodes.Degree + 2)*3;                     % get org nodes\r\nlinewidth = 10*G.Edges.Weight\/max(G.Edges.Weight);          % indeg for marker size\r\nfigure                                                      % create new figure\r\nh = plot(G,'MarkerSize',markersize,'LineWidth',linewidth,'EdgeAlpha',0.3); % plot graph\r\nhighlight(h, topN,'NodeColor',[.85 .33 .1])                 % highlight top 20 nodes\r\nothers = org(~ismember(org,topN));                          % select others\r\nhighlight(h, others,'NodeColor',[.93 .69 .13])              % highlight others\r\nlabelnode(h,topN(1),'Top 20')                               % label nodes\r\nlabelnode(h,others([1,4,18,29,72,92,96,99]),'Others')       % label nodes\r\ntitle({'NIPS 2015 Paper Coauthorship By Affiliation';'with 5 or more nodes'}) % add title\r\n%% Paper Topics\r\n% To find the topics of the papers, I quickly went through the titles of the \r\n% accepted papers and chose 35 words that jumped out at me - see <https:\/\/blogs.mathworks.com\/images\/loren\/2016\/nips2015_topics.xlsx \r\n% nips2015_topics.xlsx>. If you check the word cloud at the top, you \r\n% see some of those words. Obviously, this is a quick-and-dirty approach but I \r\n% just wanted to get a quick sense of the popular topics for now. If you are interested \r\n% in a more proper way to do it, please check out my earlier post <https:\/\/blogs.mathworks.com\/loren\/2015\/04\/08\/can-you-find-love-through-text-analytics\/ \r\n% Can You Find Love through Text Analytics?>\r\n% \r\n% The |paper| table contains |Title|, |Abstract| and |PaperText| columns. Which \r\n% should we use to analyze the paper topics? Titles tend to be short. Abstracts \r\n% are longer, and are more likely to contain key phrases that represent the \r\n% paper topic because abstracts are, by definition, a high level overview of the \r\n% content of the paper. Actual content of the paper naturally covers more details \r\n% and that may obscure the main topics of the paper. Let's use |Abstract| \r\n% to generate our word count. \r\n% \r\n% Then let's compare the relative frequency of those terms between the docs \r\n% in the largest connected component we saw earlier and other docs. You can see \r\n% some differences between those two groups. For example, a lot of pagers discuss \r\n% 'image', but papers from the top 20 organizations talk about it less frequently, \r\n% and they talk more about 'graph'.\r\n\r\nTopics = readtable('nips2015_topics.xlsx');                 % load preselected topics\r\nDTM = zeros(height(Papers),height(Topics));                 % document term matrix\r\nfor i = 1:height(Topics)                                    % loop over topics\r\n    DTM(:,i) = cellfun(@length, ...                         % get number of matches\r\n        regexpi(Papers.Abstract,Topics.Regex{i}));          % find the word in abstract\r\nend\r\ntopDocTopics = sum(DTM(isTopDoc,:));                        % word count in largest component\r\ntopDocTopics = topDocTopics .\/ sum(topDocTopics) *100;      % convert it into relative percentage\r\notherDocTopics = sum(DTM(~isTopDoc,:));                     % word count in others\r\notherDocTopics = otherDocTopics .\/ sum(otherDocTopics) *100;% convert it into relative percentage\r\nfigure                                                      % create new figure\r\nbar([topDocTopics; otherDocTopics]')                        % bar chart\r\nax = gca;                                                   % get current axes handle\r\nax.XTick = 1:height(Topics);                                % set X-axis tick\r\nax.XTickLabel = Topics.Keyword;                             % set X-axis tick label \r\nax.XTickLabelRotation = 90;                                 % rotate X-axis tick label\r\ntitle('Relative Term Frequency by Document Groups')         % add title\r\nlegend('Docs in the Largest Cluster','Other Docs')          % add legend\r\nxlim([0 height(Topics) + 1])                                % set x-axis limits   \r\nylabel('Percentage')                                        % add y-axis label\r\n%% Topic Grouping by Principal Componet Analysis\r\n% Let's visualize the relationship between topics using <https:\/\/www.mathworks.com\/help\/stats\/principal-component-analysis-pca.html \r\n% Principal Component Analysis>. The resulting |<https:\/\/www.mathworks.com\/help\/stats\/biplot.html \r\n% biplot>| of the first and second components shows roughly three clusters of \r\n% related topics. Topics popular in the largest connected component are highlighted \r\n% in orange and they seem to span across all three clusters. \r\n% \r\n% The purple cluster is dominated by topics favored by the largest connected \r\n% component and focuses on topics likes <https:\/\/en.wikipedia.org\/wiki\/Markov_chain_Monte_Carlo \r\n% Markov Chain Monte Carlo> (MCMC), <https:\/\/en.wikipedia.org\/wiki\/Bayesian_statistics \r\n% Bayesian Statistics> and Stochastic Gradient MCMC.\r\n% \r\n% The blue cluster seems to focus on the <https:\/\/en.wikipedia.org\/wiki\/Multi-armed_bandit \r\n% multi-armed bandits> problem which is related to a field of machine learning \r\n% called <https:\/\/en.wikipedia.org\/wiki\/Reinforcement_learning Reinforcement Learning>. \r\n% Topics like 'market' and 'risk' are highlighted in orange, indicating that papers \r\n% on these topics from the top 20 organizations probably focused on financial \r\n% applications.  \r\n\r\nw = 1 .\/ var(DTM);                                          % inverse variable variances\r\n[wcoeff, score, latent, tsquared, explained] = ...          % weighted PCA with w\r\n    pca(DTM, 'VariableWeights', w);\r\ncoefforth = diag(sqrt(w)) * wcoeff;                         % turn wcoeff to orthonormal\r\nlabels = Topics.Keyword;                                    % Topics as labels\r\ntopT = Topics.Keyword((topDocTopics - otherDocTopics) > 1); % topics popular in top cluster\r\nfigure                                                      % new figure\r\nbiplot(coefforth(:,1:2), 'Scores', score(:,1:2), ...        % 2D biplot with the first two comps\r\n    'VarLabels', labels)\r\ntitle('Principal Components Analysis of Paper Topics')      % add title\r\nfor i =  1:length(topT)                                     % loop over popular topics\r\n    htext = findobj(gca,'String',topT{i});                  % find text object\r\n    htext.Color = [.85 .33 .1];                             % highlight by color\r\nend\r\nrectangle('Position',[.05 -.1 .5 .3],'Curvature',1, ...     % add rectagle\r\n    'EdgeColor',[0 .5 0])\r\nrectangle('Position',[-.23 .05 .25 .55],'Curvature',1, ...  % add rectangle\r\n    'EdgeColor',[.6 .1 .5])\r\nrectangle('Position',[-.35 -.4 .34 .42],'Curvature',1, ...  % add rectangle\r\n    'EdgeColor',[.1 .2 .6])\r\n%% Deep Learning\r\n% The green cluster is very busy and hard to see. Let's zoom into the green \r\n% cluster to see more details using |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/axis.html \r\n% axis>|. <https:\/\/www.mathworks.com\/discovery\/deep-learning.html Deep learning> \r\n% seems to be the main topic of this cluster. CNN (<https:\/\/www.mathworks.com\/help\/nnet\/convolutional-neural-networks.html \r\n% Convolutional Neural Networks>) is a deep learning algorithm often used for \r\n% image classification. It makes sense that it is close to the 'image' topic in \r\n% the biplot. RNN (<https:\/\/en.wikipedia.org\/wiki\/Recurrent_neural_network Recurrent \r\n% Neural Networks>) tends to be used in Natural Language Processing and it appears \r\n% close to 'text'. <https:\/\/www.mathworks.com\/help\/nnet\/autoencoders.html Autoencoders> \r\n% and <https:\/\/en.wikipedia.org\/wiki\/Long_short-term_memory LSTM> (Long Short-Term \r\n% Memory) are also deep learning algorithms. <https:\/\/en.wikipedia.org\/wiki\/Maximum_a_posteriori_estimation \r\n% MAP> (Maximum A Posteriori) and 'deep neural networks' are the only topics popular \r\n% in the top 20 organizations that are found in this cluster. Because we are just \r\n% comparing the relative frequency of word occurrence, if a lot of papers talk about \r\n% deep learning related topics, then there are barely any significant frequency \r\n% differences between the top 20 and others, and those words won't be highlighted. \r\n\r\naxis([-0.1 0.5 -0.1 0.2]);                                  % define axis limits\r\n%% Core Algorithms\r\n% The topics found at the center of the biplot are related to more established \r\n% core machine learning techniques such as <https:\/\/www.mathworks.com\/discovery\/support-vector-machine.html \r\n% Support Vector Machines (SVM>), <https:\/\/www.mathworks.com\/help\/stats\/principal-component-analysis-pca.html \r\n% Principal Component Analysis (PCA)>, <https:\/\/www.mathworks.com\/help\/stats\/hidden-markov-models.html \r\n% Hidden Markov Models (HMM)> or <https:\/\/www.mathworks.com\/help\/stats\/lasso-regularization.html \r\n% Least Absolute Shrinkage And Selection Operator (LASSO)>. Papers from the top \r\n% 20 organizations seem be interested in tensors and multi-class classification \r\n% problems, along with graphs and <https:\/\/www.mathworks.com\/help\/stats\/gaussian-process-regression.html \r\n% Gaussian Process>.\r\n\r\naxis([-0.06 0.06 -0.06 0.06]);                                  % define axis limits\r\n%% Commercial Research\r\n% The top 20 organizations include some commercial entities such as Google, \r\n% IBM and Microsoft. The topics of their research papers probably reflects commercial \r\n% interests in the field of machine learning. We can use the same biplot and highlight \r\n% the topics that frequently appear in the papers affiliated with them. The plot \r\n% shows that the three companies tend to cover different topics while they all \r\n% engage in some deep learning related research. You can also see that Google \r\n% tends to cover multiple fields while IBM and Microsft seem to have a narrower \r\n% focus. \r\n\r\nisGoogler = AcceptedPapers.Org == 'Google';                 % find indices of Google authors\r\nGooglePaperIds = unique(AcceptedPapers.PaperID(isGoogler)); % find their paper ids\r\nisGooglePaper = ismember(Papers.ID,GooglePaperIds);         % get the paper indices\r\nGoogleTopics = sum(DTM(isGooglePaper,:));                   % sum Google rows\r\nGoogleTopics = GoogleTopics .\/ sum(GoogleTopics) *100;      % convert it into relative percentage\r\nisIBMer = AcceptedPapers.Org == 'IBM';                      % find indices of IBM authors\r\nIBMPaperIds = unique(AcceptedPapers.PaperID(isIBMer));      % find their paper ids\r\nisIBMPaper = ismember(Papers.ID,IBMPaperIds);               % get the paper indices\r\nIBMTopics = sum(DTM(isIBMPaper,:));                         % sum IBM rows\r\nIBMTopics = IBMTopics .\/ sum(IBMTopics) *100;               % convert it into relative percentage\r\nisMSofter = AcceptedPapers.Org == 'Microsoft';              % find indices of Mirosoft authors\r\nMSPaperIds = unique(AcceptedPapers.PaperID(isMSofter));     % find their paper ids\r\nisMSPaper = ismember(Papers.ID,MSPaperIds);                 % get the paper indices\r\nMSTopics = sum(DTM(isMSPaper,:));                           % sum Microsoft rows\r\nMSTopics = MSTopics .\/ sum(MSTopics) *100;                  % convert it into relative percentage\r\ncommercialTopics = [GoogleTopics; IBMTopics; MSTopics];     % combine all\r\nfigure                                                      % new figure\r\nbiplot(coefforth(:,1:2), 'Scores', score(:,1:2), ...        % 2D biplot with the first two comps\r\n    'VarLabels', labels)\r\nhline = findobj(gca,'LineStyle','none');                    % get line handles of observations\r\nfor i = 1:length(hline)                                     % loop over observatoins\r\n    hline(i).Visible = 'off';                               % make it invisible\r\nend\r\nhtext = findobj(gca,'Type','text');                         % get text handles\r\ntcolor = [0 .5 0;.85 .33 .1; .1 .2 .6];                     % define text color\r\nfor i = 1:length(htext)                                     % loop over text\r\n   r = commercialTopics(:,strcmp(labels,htext(i).String));  % get ratios\r\n   if sum(r) == 0                                           % if all rows are zero\r\n       htext(i).Visible = 'off';                            % make it invisible\r\n   else                                                     % otherwise\r\n       [~,idx] = max(r);                                    % get max row\r\n       htext(i).Color = tcolor(idx,:);                      % use matching color\r\n   end\r\nend\r\ntext(-.4,.3,'Google','Color',tcolor(1,:),'FontSize',14)     % annotate\r\ntext(.3,-.1,'IBM','Color',tcolor(2,:),'FontSize',14)        % annotate\r\ntext(-.4,-.2,'Microsoft','Color',tcolor(3,:),'FontSize',14) % annotate\r\ntitle({'Principal Components Analysis of Paper Topics';     % add title\r\n    'highlighting Google, IBM and Microsoft topics'})      \r\n%% Top 10 Authors in NIPS 2015\r\n% Some authors got multiple papers accepted by NIPS.  Are there any specific \r\n% topics that gives them an edge? Let's take a look at the top 10 authors in terms \r\n% of number of accepted papers and see what topics come up in those papers. It \r\n% turns out the topics of the op 10 authors don't belong to specifc clusters, \r\n% and the vectors of those topics are shorter - meaning they are not so uncommon \r\n% but not as frequently discussed as other topics, such as 'bandits', 'CNN' or \r\n% 'MCMC', either. \r\n\r\n[auth_ids,~,idx] = unique(AcceptedPapers.ID);               % get author ids\r\ncount = accumarray(idx,1);                                  % get count\r\n[~,ranking] = sort(count,'descend');                        % get ranking\r\ntop10_ids = auth_ids(ranking(1:10));                        % get top 10 ids\r\nisTop10 = ismember(AcceptedPapers.ID,top10_ids);            % get row indices\r\ntop10_paper_ids = unique(AcceptedPapers.PaperID(isTop10));  % get top 10 papaer ids\r\nisTop10paper = ismember(Papers.ID,top10_paper_ids);         % get row indices\r\ntop10Topics = sum(DTM(isTop10paper ,:));                    % sum top 10 rows\r\ntop10Topics = top10Topics .\/ sum(top10Topics) *100;         % convert it into relative percentage\r\nnotTop10Topics = sum(DTM(~top10Topics,:));                  % word count in others\r\nnotTop10Topics = notTop10Topics .\/ sum(notTop10Topics) *100;% convert it into relative percentage\r\ncombined = [top10Topics;notTop10Topics];                    % combine all\r\n[isTop10Author,order] = ismember(Authors.ID,top10_ids);     % get indices of top 10 authors\r\n[~,order] = sort(order(isTop10Author));                     % get ranking\r\nnames = Authors.Name(isTop10Author);                        % get names\r\nfigure                                                      % create new figure\r\nbiplot(coefforth(:,1:2), 'Scores', score(:,1:2), ...        % 2D biplot with the first two comps\r\n    'VarLabels', labels)\r\nhline = findobj(gca,'LineStyle','none');                    % get line handles of observations\r\nfor i = 1:length(hline)                                     % loop over observatoins\r\n    hline(i).Visible = 'off';                               % make it invisible\r\nend\r\nhtext = findobj(gca,'Type','text');                         % get text handles\r\ntcolor = [.85 .33 .1; .1 .2 .6];                            % define text color\r\nfor i = 1:length(htext)                                     % loop over text\r\n   r = combined(:,strcmp(labels,htext(i).String));          % get ratios\r\n   [~,idx] = max(r);                                        % get max row\r\n   if idx == 1 && r(1) > 3                                  % if row 1 & r > 3\r\n       htext(i).Color = [.85 .33 .1];                       % highlight text\r\n   else\r\n        htext(i).Color = [.6 .6 .6];                        % ghost text\r\n   end\r\nend\r\ntitle({'Principal Components Analysis of Paper Topics';     % add title\r\n    'highlighting topics by top 10 authors'})\r\ntext(-.5,.3,'Top 10 ','FontWeight','bold')                  % annotate\r\ntext(-.5,0,names(order))                                    % annotate\r\n%% Summary\r\n% This is a fairly simple, quick exploration of the dataset, but we got some \r\n% interesting insights about the current state of machine learning research presented \r\n% at NIPS 2015. Maybe you can find even more if you dig deeper. Perhaps you can \r\n% use the technique in the post <https:\/\/blogs.mathworks.com\/loren\/2015\/04\/08\/can-you-find-love-through-text-analytics\/ \r\n% Can You Find Love through Text Analytics?> to cluster similar papers. You can \r\n% use word-based tokens, but you may want to use an n-gram approach described \r\n% in another post, <https:\/\/blogs.mathworks.com\/loren\/2015\/09\/09\/text-mining-shakespeare-with-matlab\/ \r\n% Text Mining Shakespeare with MATLAB>. Give it a try and share your\r\n% findings <https:\/\/blogs.mathworks.com\/loren\/?p=1692#respond here>!\r\n% \r\n%\r\n##### SOURCE END ##### e9ed5c7f2df44be9b78ff6e976a2c963\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/nips2015Final_08.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Publish_or_perish\">Publish or perish<\/a>, they say in academia, and you can learn trends in academic research through analysis of published papers. Today's guest blogger, Toshi, came across a dataset of machine learning papers presented in a conference. Let's see what he found!... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2016\/08\/08\/text-mining-machine-learning-research-papers-with-matlab\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[66,43,61,48],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1692"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=1692"}],"version-history":[{"count":2,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1692\/revisions"}],"predecessor-version":[{"id":1694,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1692\/revisions\/1694"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=1692"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=1692"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=1692"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}