{"id":2346,"date":"2017-07-10T10:39:32","date_gmt":"2017-07-10T15:39:32","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=2346"},"modified":"2017-07-12T12:56:22","modified_gmt":"2017-07-12T17:56:22","slug":"web-scraping-and-mining-unstructured-data-with-matlab","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2017\/07\/10\/web-scraping-and-mining-unstructured-data-with-matlab\/","title":{"rendered":"Web Scraping and Mining Unstructured Data with MATLAB"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p>A lot of information is shared on the web and a lot of people are interested in taking advantage of it. It can be used to enrich the existing data, for example. However, information is buries in HTML tags and it is not easy to extract useful information. Today's guest blogger, <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/951521\">Toshi Takeuchi<\/a> shows us how he uses MATLAB for <a href=\"https:\/\/en.wikipedia.org\/wiki\/Web_scraping\">web scraping<\/a> to harvest useful data from the web and then uses fuzzy string match to enrich existing data.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/nips2015.gif\" alt=\"\"> <\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#05c75ff3-1b78-4f76-a826-4c6f9f5db7ea\">NIPS 2015 Papers<\/a><\/li><li><a href=\"#0b4dd3c5-8737-47ca-b0b0-0cf5c43ed2da\">Scraping Data from a Web Page<\/a><\/li><li><a href=\"#1e229b43-93a1-4e6d-b392-76e625160436\">Matching Scraped Data to Database Table<\/a><\/li><li><a href=\"#ec389ec7-d6f6-4fcd-88f2-b294d31805be\">Fuzzy String Matching<\/a><\/li><li><a href=\"#dbaaf030-f2fb-4c5b-8045-e2a04922fa5f\">Validating Fuzzy Match Results<\/a><\/li><li><a href=\"#db4febce-1ce0-4831-8009-ab3b140f57bc\">Updating Missing Values with Fuzzy Match Result<\/a><\/li><li><a href=\"#fd6f356a-6412-4ea7-b3c9-fad0c4412864\">Reviewing Unmatched<\/a><\/li><li><a href=\"#52bfa40f-7b93-4c49-ba57-605d65d73933\">Substring Match<\/a><\/li><li><a href=\"#aae78a85-d666-4721-b297-32c70fa2b96c\">Updating Missing Values with Substring Match Result<\/a><\/li><li><a href=\"#dae88173-8579-47c2-bfa3-52d988c2d4ab\">Visualizing Paper Author Affiliation<\/a><\/li><li><a href=\"#8b68f342-5771-48a0-82ea-faed60b581e7\">Summary<\/a><\/li><\/ul><\/div><h4>NIPS 2015 Papers<a name=\"05c75ff3-1b78-4f76-a826-4c6f9f5db7ea\"><\/a><\/h4><p>Web scraping is actually pretty easy with MATLAB thanks to new string fucntions introdiced in R2016B.<\/p><p>I am going to use as an example the same data used in <a title=\"https:\/\/blogs.mathworks.com\/loren\/2016\/08\/08\/text-mining-machine-learning-research-papers-with-matlab\/ (link no longer works)\">Text Mining Machine Learning Research Papers with MATLAB<\/a>.<\/p><p>If you would like to follow along, please download<\/p><div><ul><li>the source of this post by clicking on \"Get the MATLAB code\" at the bottom of this page<\/li><li>the data from Kaggle's <a href=\"https:\/\/www.kaggle.com\/benhamner\/nips-2015-papers\">NIPS 2015 Papers<\/a> page<\/li><li>my custom function <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/levenshtein.m\">levenshtein.m<\/a><\/li><li>my custom script to generate GIF animation <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/animateNIPS2015.m\">animateNIPS2015.m<\/a><\/li><\/ul><\/div><p>Here I am using <tt><a href=\"https:\/\/www.mathworks.com\/help\/database\/ug\/sqlite.html\">sqlite<\/a><\/tt> in <a href=\"https:\/\/www.mathworks.com\/products\/database\/\">Database Toolbox<\/a> to load data from a sqlite file. If you don't have Database Toolbox, you can try <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/readtable.html\">readtable<\/a><\/tt> to read CSV files. <tt>Authors<\/tt> table only list names, but I want to enrich it with authors' affilation to see which organizations are active in this academic conference.<\/p><pre class=\"codeinput\">db = <span class=\"string\">'output\/database.sqlite'<\/span>;                                  <span class=\"comment\">% database file<\/span>\r\nconn = sqlite(db,<span class=\"string\">'readonly'<\/span>);                                   <span class=\"comment\">% create connection<\/span>\r\nAuthors = fetch(conn,<span class=\"string\">'SELECT * FROM Authors'<\/span>);                  <span class=\"comment\">% get data with SQL command<\/span>\r\nPapers = fetch(conn,<span class=\"string\">'SELECT * FROM Papers'<\/span>);                    <span class=\"comment\">% get data with SQL command<\/span>\r\nPaperAuthors = fetch(conn,<span class=\"string\">'SELECT * FROM PaperAuthors'<\/span>);        <span class=\"comment\">% get data with SQL command<\/span>\r\nclose(conn)                                                     <span class=\"comment\">% close connection<\/span>\r\nAuthors = cell2table(Authors,<span class=\"string\">'VariableNames'<\/span>,{<span class=\"string\">'ID'<\/span>,<span class=\"string\">'Name'<\/span>});    <span class=\"comment\">% convert to table<\/span>\r\nPapers = cell2table(Papers,<span class=\"string\">'VariableNames'<\/span>, <span class=\"keyword\">...<\/span><span class=\"comment\">                 % convert to table<\/span>\r\n    {<span class=\"string\">'ID'<\/span>,<span class=\"string\">'Title'<\/span>,<span class=\"string\">'EventType'<\/span>,<span class=\"string\">'PdfName'<\/span>,<span class=\"string\">'Abstract'<\/span>,<span class=\"string\">'PaperText'<\/span>});\r\nPaperAuthors = cell2table(PaperAuthors,<span class=\"string\">'VariableNames'<\/span>, <span class=\"keyword\">...<\/span><span class=\"comment\">     % convert to table<\/span>\r\n    {<span class=\"string\">'ID'<\/span>,<span class=\"string\">'PaperID'<\/span>,<span class=\"string\">'AuthorID'<\/span>});\r\nhead(Authors)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n  8&times;2 table\r\n    ID              Name         \r\n    ___    ______________________\r\n    178    'Yoshua Bengio'       \r\n    200    'Yann LeCun'          \r\n    205    'Avrim Blum'          \r\n    347    'Jonathan D. Cohen'   \r\n    350    'Samy Bengio'         \r\n    521    'Alon Orlitsky'       \r\n    549    'Wulfram Gerstner'    \r\n    575    'Robert C. Williamson'\r\n<\/pre><p>Luckily, there is an HTML file that lists each paper with its authors and their affiliation. Each list item start with <tt>&lt;i&gt;&lt;span class=\"larger-font\"&gt;<\/tt> and ends with <tt>&lt;\/b&gt;&lt;br&gt;&lt;br&gt;<\/tt>, titles and authors are separated by <tt>&lt;\/span&gt;&lt;\/i&gt;&lt;br&gt;&lt;b&gt;<\/tt>, mutiple co-authors by semicolon,  and finally the names and affiliation by comma.<\/p><pre class=\"codeinput\">dbtype <span class=\"string\">output\/accepted_papers.html<\/span> <span class=\"string\">397:400<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\n397           &lt;div&gt;&lt;div&gt;&lt;h3&gt;NIPS 2015 Accepted Papers&lt;\/h3&gt;&lt;p&gt;&lt;br&gt;&lt;\/p&gt;\r\n398   &lt;i&gt;&lt;span class=\"larger-font\"&gt;Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing&lt;\/span&gt;&lt;\/i&gt;&lt;br&gt;&lt;b&gt;\r\n399   Nihar Shah*, UC Berkeley; Dengyong Zhou, MSR&lt;\/b&gt;&lt;br&gt;&lt;br&gt;&lt;i&gt;&lt;span class=\"larger-font\"&gt;Learning with Symmetric Label Noise: The Importance of Being Unhinged&lt;\/span&gt;&lt;\/i&gt;&lt;br&gt;&lt;b&gt;\r\n400   Brendan van Rooyen, NICTA; Aditya Menon*, NICTA; Robert Williamson, NICTA&lt;\/b&gt;&lt;br&gt;&lt;br&gt;&lt;i&gt;&lt;span class=\"larger-font\"&gt;Algorithmic Stability and Uniform Generalization&lt;\/span&gt;&lt;\/i&gt;&lt;br&gt;&lt;b&gt;\r\n<\/pre><p>Since I have the html file locally, I use <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/fileread.html\">fileread<\/a> to load text from it. If you want to scape a web page directly, you would use <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/webread.html\">webread<\/a> instead. Imported text is converted into <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/string.html\">string<\/a> to take advantage of built-in string functions,<\/p><pre class=\"codeinput\">html = string(fileread(<span class=\"string\">'output\/accepted_papers.html'<\/span>));         <span class=\"comment\">% load text from file<\/span>\r\n<span class=\"comment\">% html = string(webread('https:\/\/nips.cc\/Conferences\/2015\/AcceptedPapers'));<\/span>\r\n<\/pre><h4>Scraping Data from a Web Page<a name=\"0b4dd3c5-8737-47ca-b0b0-0cf5c43ed2da\"><\/a><\/h4><p>Usually scraping data from a web page or other unstructured text data sources requires <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html\">regular expressions<\/a> and many people find it powerful but very difficult to use. String functions in MATLAB like <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/extractbetween.html\">extractBetween<\/a><\/tt>, <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/extractbefore.html\">extractBefore<\/a><\/tt>, <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/extractafter.html\">extractAfter<\/a><\/tt>, <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/erase.html\">erase<\/a><\/tt>, and <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/replace.html\">replace<\/a><\/tt> makes it ridiculously simple!<\/p><pre class=\"codeinput\">pattern1 = <span class=\"string\">'&lt;div&gt;&lt;h3&gt;NIPS 2015 Accepted Papers&lt;\/h3&gt;&lt;p&gt;&lt;br&gt;&lt;\/p&gt;'<\/span>;<span class=\"comment\">% start of list<\/span>\r\npattern2 = <span class=\"string\">'&lt;\/div&gt;  &lt;!--div class=\"col-xs-12 col-sm-9\"--&gt;'<\/span>;     <span class=\"comment\">% end of list<\/span>\r\nlist = extractBetween(html, pattern1, pattern2);                <span class=\"comment\">% extract list<\/span>\r\npattern1 = <span class=\"string\">'&lt;i&gt;&lt;span class=\"larger-font\"&gt;'<\/span>;                     <span class=\"comment\">% start of list item<\/span>\r\npattern2 = <span class=\"string\">'&lt;\/b&gt;&lt;br&gt;&lt;br&gt;'<\/span>;                                      <span class=\"comment\">% end of list item<\/span>\r\nlistitems = extractBetween(list, pattern1, pattern2);           <span class=\"comment\">% extract list items<\/span>\r\npattern1 = [<span class=\"string\">'&lt;\/span&gt;&lt;\/i&gt;&lt;br&gt;&lt;b&gt;'<\/span> newline];                      <span class=\"comment\">% end of title<\/span>\r\ntitles = extractBefore(listitems,pattern1);                     <span class=\"comment\">% extract titles<\/span>\r\nnamesorgs = extractAfter(listitems,pattern1);                   <span class=\"comment\">% extract names orgs<\/span>\r\nnamesorgs = erase(namesorgs,<span class=\"string\">'*'<\/span>);                               <span class=\"comment\">% erase *<\/span>\r\nnamesorgs = erase(namesorgs,<span class=\"string\">'\"'<\/span>);                               <span class=\"comment\">% erase \"<\/span>\r\nnamesorgs = replace(namesorgs,<span class=\"string\">'  '<\/span>, <span class=\"string\">' '<\/span>);                       <span class=\"comment\">% remove double space<\/span>\r\ndisp([titles(1:2) namesorgs(1:2)])\r\n<\/pre><pre class=\"codeoutput\">    \"Double or Nothing: Multiplicati&#8230;\"    \"Nihar Shah, UC Berkeley; Dengyo&#8230;\"\r\n    \"Learning with Symmetric Label N&#8230;\"    \"Brendan van Rooyen, NICTA; Adit&#8230;\"\r\n<\/pre><p>Since multiple co-authors are still contained in a single string, let's separate them into a list of co-authors and their affilation. When you split a string, you get varying number of substrings depending on each row. So we need to use arrayfun to parse|UniformOutput| option set to <tt>false<\/tt> to split it row by row, trim the result with <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/strtrim.html\">strtrim<\/a><\/tt> the cell aray with <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/vertcat.html\">vertcat<\/a><\/tt> to get the list of coauthors with their affiliation.<\/p><pre class=\"codeinput\">namesorgs = replace(namesorgs,<span class=\"string\">'&amp;amp;'<\/span>,<span class=\"string\">'&amp;'<\/span>);                     <span class=\"comment\">% revert escaped &amp;<\/span>\r\nnamesorgs = erase(namesorgs,[char(194) <span class=\"string\">'&nbsp;&lt;\/b&gt;&lt;span&gt;&lt;strong&gt;'<\/span>]); <span class=\"comment\">% remove extra tags<\/span>\r\nnamesorgs = replace(namesorgs,[<span class=\"string\">'&lt;\/strong&gt;&lt;\/span&gt;&lt;b&gt;'<\/span> <span class=\"keyword\">...<\/span><span class=\"comment\">        % replace missing semicolon<\/span>\r\n    char(194)],<span class=\"string\">';'<\/span>);\r\ncoauth = arrayfun(@(x) strtrim(split(x,<span class=\"string\">';'<\/span>)), namesorgs, <span class=\"keyword\">...<\/span><span class=\"comment\">    % split by semicolon<\/span>\r\n    <span class=\"string\">'UniformOutput'<\/span>, false);                                    <span class=\"comment\">% and trim white space<\/span>\r\ncoauth = vertcat(coauth{:});                                    <span class=\"comment\">% unnest cell array<\/span>\r\ncoauth(1:5)\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n  5&times;1 string array\r\n    \"Nihar Shah, UC Berkeley\"\r\n    \"Dengyong Zhou, MSR\"\r\n    \"Brendan van Rooyen, NICTA\"\r\n    \"Aditya Menon, NICTA\"\r\n    \"Robert Williamson, NICTA\"\r\n<\/pre><h4>Matching Scraped Data to Database Table<a name=\"1e229b43-93a1-4e6d-b392-76e625160436\"><\/a><\/h4><p>You now see how easy web scraping is  with MATLAB.<\/p><p>Now that we have the list of names with affiliation, we just have to match it to the <tt>Authors<\/tt> table by name, right? Unfortunately, we see a lot of missing values because the names didn't match even with <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/contains.html\">contains<\/a><\/tt> partial match function.<\/p><p>The real hard part now is what to do after you scraped the data from the web.<\/p><pre class=\"codeinput\">authors = Authors.Name;                                         <span class=\"comment\">% author names<\/span>\r\nnames = strtrim(extractBefore(coauth,<span class=\"string\">','<\/span>));                     <span class=\"comment\">% extract and trim names<\/span>\r\norg = strings(length(authors),1);                               <span class=\"comment\">% initialize accumulator<\/span>\r\n<span class=\"keyword\">for<\/span> ii = 1:length(authors)                                      <span class=\"comment\">% for each name in |authors|<\/span>\r\n    res = coauth(contains(names,authors(ii)));                  <span class=\"comment\">% find match in |names|<\/span>\r\n    <span class=\"keyword\">if<\/span> isempty(res)                                             <span class=\"comment\">% if no match<\/span>\r\n        org(ii) = missing;                                      <span class=\"comment\">% mark it missing<\/span>\r\n    <span class=\"keyword\">end<\/span>\r\n    res = extractAfter(res,<span class=\"string\">','<\/span>);                                <span class=\"comment\">% extract after comma<\/span>\r\n    res = strtrim(res);                                         <span class=\"comment\">% remove white space<\/span>\r\n    res = unique(res);                                          <span class=\"comment\">% remove duplicates<\/span>\r\n    res(strlength(res) == 0) = [];                              <span class=\"comment\">% remove emptry string<\/span>\r\n    <span class=\"keyword\">if<\/span> length(res) == 1                                         <span class=\"comment\">% if single string<\/span>\r\n        org(ii) = res;                                          <span class=\"comment\">% use it as is<\/span>\r\n    <span class=\"keyword\">elseif<\/span> length(res) &gt; 1                                      <span class=\"comment\">% if multiple string<\/span>\r\n        org(ii) = join(res,<span class=\"string\">';'<\/span>);                                <span class=\"comment\">% join them with semicolon<\/span>\r\n    <span class=\"keyword\">else<\/span>                                                        <span class=\"comment\">% otherwise<\/span>\r\n        org(ii) = missing;                                      <span class=\"comment\">% mark it missing<\/span>\r\n    <span class=\"keyword\">end<\/span>\r\n<span class=\"keyword\">end<\/span>\r\nhead(table(authors, org, <span class=\"string\">'VariableNames'<\/span>,{<span class=\"string\">'Name'<\/span>,<span class=\"string\">'Org'<\/span>}))\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n  8&times;2 table\r\n             Name                              Org                 \r\n    ______________________    _____________________________________\r\n    'Yoshua Bengio'           \"U. Montreal\"                        \r\n    'Yann LeCun'              \"New York University\"                \r\n    'Avrim Blum'              &lt;missing&gt;                            \r\n    'Jonathan D. Cohen'       &lt;missing&gt;                            \r\n    'Samy Bengio'             \"Google Research\"                    \r\n    'Alon Orlitsky'           \"University of California, San Diego\"\r\n    'Wulfram Gerstner'        \"EPFL\"                               \r\n    'Robert C. Williamson'    &lt;missing&gt;                            \r\n<\/pre><p>For example, the partial match doesn't work if middle initial is missing or nicknames are used instead of full names. There can be other irregularities. Yikes, we are dealing with <a href=\"https:\/\/en.wikipedia.org\/wiki\/Unstructured_data\">unstructured data<\/a>!<\/p><pre class=\"codeinput\">[authors(4) coauth(contains(names,<span class=\"string\">'Jonathan Cohen'<\/span>));\r\n    authors(8) coauth(contains(names,<span class=\"string\">'Robert Williamson'<\/span>));\r\n    authors(406) coauth(contains(names,<span class=\"string\">'Sanmi Koyejo'<\/span>));\r\n    authors(440) coauth(contains(names,<span class=\"string\">'Danilo Rezende'<\/span>));\r\n    authors(743) coauth(contains(names,<span class=\"string\">'Bill Dally'<\/span>));\r\n    authors(769) coauth(contains(names,<span class=\"string\">'Julian Yarkony'<\/span>))]\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n  6&times;2 string array\r\n    \"Jonathan D. Cohen\"         \"Jonathan Cohen, Princeton Unive&#8230;\"\r\n    \"Robert C. Williamson\"      \"Robert Williamson, NICTA\"         \r\n    \"Oluwasanmi O. Koyejo\"      \"Sanmi Koyejo, Stanford University\"\r\n    \"Danilo Jimenez Rezende\"    \"Danilo Rezende, Google DeepMind\"  \r\n    \"William Dally\"             \"Bill Dally , Stanford Universit&#8230;\"\r\n    \"Julian E. Yarkony\"         \"Julian Yarkony, Dr.\"              \r\n<\/pre><h4>Fuzzy String Matching<a name=\"ec389ec7-d6f6-4fcd-88f2-b294d31805be\"><\/a><\/h4><p>What can we do when exact match approach doesn't work? Maybe we can come up with various rules to match strings using regular expressions, but that is very time consuming. Let's revisit the <a href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/10\/14\/40-year-old-algorithm-that-cannot-be-improved\/\">40-year-old Algorithm That Cannot Be Improved<\/a> to solve this problem. I created a new custom function <tt>levenshtein<\/tt> for this example. It calculates the edit distance that measures the minimum number of edit operations required to transform one string into another,as a way to quantify how similar or dissimilar they are. For more details of this algorithm, please check out the blog post linked above<\/p><p>Converting Sunday to Saturday requires 3 edit operations.<\/p><pre class=\"codeinput\">levenshtein(<span class=\"string\">'sunday'<\/span>, <span class=\"string\">'saturday'<\/span>)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n     3\r\n<\/pre><p>Perhaps it is easier to understand if I show how similar they are as match rate rather than number of edit operations?<\/p><pre class=\"codeinput\">levenshtein(<span class=\"string\">'sunday'<\/span>, <span class=\"string\">'saturday'<\/span>, <span class=\"string\">'ratio'<\/span>)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n      0.78571\r\n<\/pre><p>Now we can find \"Jonathan Cohen\" in the top 3 matches for \"Jonathan D. Cohen\".<\/p><pre class=\"codeinput\">fhandle = @(s) levenshtein(authors(4), s, <span class=\"string\">'ratio'<\/span>);             <span class=\"comment\">% function handle<\/span>\r\nratios = arrayfun(fhandle, extractBefore(coauth,<span class=\"string\">','<\/span>));          <span class=\"comment\">% get match rates<\/span>\r\n[~,idx] = sort(ratios,<span class=\"string\">'descend'<\/span>);                               <span class=\"comment\">% rank by match rate<\/span>\r\n[repmat(authors(4),[3,1]) coauth(idx(1:3)) ratios(idx(1:3))]\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n  3&times;3 string array\r\n    \"Jonathan D. Cohen\"    \"Jonathan Cohen, Pr&#8230;\"    \"0.90323\"\r\n    \"Jonathan D. Cohen\"    \"Jonathan Bassen, s&#8230;\"    \"0.8125\" \r\n    \"Jonathan D. Cohen\"    \"Jonathan Vacher, U&#8230;\"    \"0.8125\" \r\n<\/pre><h4>Validating Fuzzy Match Results<a name=\"dbaaf030-f2fb-4c5b-8045-e2a04922fa5f\"><\/a><\/h4><p>Let;s try this approach for the first 10 missing names with <tt>ignoreCase<\/tt> option enabled. It looks like we can be fairly confident about the result as long as the maximum match rate is 0.8 or higher.<\/p><pre class=\"codeinput\">org(org == <span class=\"string\">'Dr.'<\/span>) = missing;                                    <span class=\"comment\">% remove salutation<\/span>\r\nmissing_ids = find(ismissing(org));                             <span class=\"comment\">% get missing ids<\/span>\r\nmatches = strings(1,3);                                         <span class=\"comment\">% initialize accumulator<\/span>\r\n<span class=\"keyword\">for<\/span> ii = 1:10                                                   <span class=\"comment\">% for each missing id<\/span>\r\n    cid = missing_ids(ii);                                      <span class=\"comment\">% current id<\/span>\r\n    fhandle = @(s) levenshtein(authors(cid), s, <span class=\"string\">'ratio'<\/span>, <span class=\"string\">'ingoreCase'<\/span>);\r\n    ratios = arrayfun(fhandle, names);                          <span class=\"comment\">% get match rates<\/span>\r\n    [~,idx] = max(ratios);                                      <span class=\"comment\">% get index of max value<\/span>\r\n    matches(ii,:) = [authors(cid) coauth(idx) ratios(idx)];     <span class=\"comment\">% update accumulator<\/span>\r\n<span class=\"keyword\">end<\/span>\r\nmatches\r\n<\/pre><pre class=\"codeoutput\">matches = \r\n  10&times;3 string array\r\n    \"Avrim Blum\"              \"Manuel Blum, Unive&#8230;\"    \"0.71429\"\r\n    \"Jonathan D. Cohen\"       \"Jonathan Cohen, Pr&#8230;\"    \"0.90323\"\r\n    \"Robert C. Williamson\"    \"Robert Williamson,&#8230;\"    \"0.91892\"\r\n    \"Juergen Schmidhuber\"     \"J?rgen Schmidhuber,\"     \"0.94595\"\r\n    \"Wolfgang Maass\"          \"Wolfgang Maass,\"         \"1\"      \r\n    \"Robert E. Schapire\"      \"Robert Schapire, M&#8230;\"    \"0.90909\"\r\n    \"Tom&Atilde;&iexcl;s Lozano-P&Atilde;&copy;rez\"    \"Tom&Atilde;&iexcl;s Lozano-P&Atilde;&copy;r&#8230;\"    \"1\"      \r\n    \"Dit-Yan Yeung\"           \"Dit Yan Yeung, HKUST\"    \"0.96154\"\r\n    \"Geoffrey J. Gordon\"      \"Geoff Gordon, CMU\"       \"0.8\"    \r\n    \"Brendan J. Frey\"         \"Brendan Frey, U. T&#8230;\"    \"0.88889\"\r\n<\/pre><h4>Updating Missing Values with Fuzzy Match Result<a name=\"db4febce-1ce0-4831-8009-ab3b140f57bc\"><\/a><\/h4><p>Now we can apply this approach to update the missing values. Now 89.3% of missing values are identified!<\/p><pre class=\"codeinput\"><span class=\"keyword\">for<\/span> ii = 1:length(missing_ids)                                  <span class=\"comment\">% for each missing id<\/span>\r\n    cid = missing_ids(ii);                                      <span class=\"comment\">% current id<\/span>\r\n    fhandle = @(s) levenshtein(authors(cid), s, <span class=\"string\">'ratio'<\/span>, <span class=\"string\">'ignoreCase'<\/span>);\r\n    ratios = arrayfun(fhandle, names);                          <span class=\"comment\">% get match rates<\/span>\r\n    [~,idx] = max(ratios);                                      <span class=\"comment\">% get index of max value<\/span>\r\n    <span class=\"keyword\">if<\/span> ratios(idx) &gt;= 0.8                                       <span class=\"comment\">% if max is 0.8<\/span>\r\n        res = extractAfter(coauth(idx),<span class=\"string\">','<\/span>);                    <span class=\"comment\">% get org name<\/span>\r\n        res = strtrim(res);                                     <span class=\"comment\">% trim white spaces<\/span>\r\n        <span class=\"keyword\">if<\/span> strlength(res) == 0                                  <span class=\"comment\">% if null string<\/span>\r\n            org(cid) = <span class=\"string\">'UNK'<\/span>;                                   <span class=\"comment\">% unknown<\/span>\r\n        <span class=\"keyword\">else<\/span>                                                    <span class=\"comment\">% otherwise<\/span>\r\n            org(cid) = res;                                     <span class=\"comment\">% update org<\/span>\r\n        <span class=\"keyword\">end<\/span>\r\n    <span class=\"keyword\">end<\/span>\r\n<span class=\"keyword\">end<\/span>\r\nfprintf(<span class=\"string\">'Missing reduced by %.1f%%\\n'<\/span>, (1 - sum(ismissing(org))\/length(missing_ids))*100)\r\n<\/pre><pre class=\"codeoutput\">Missing reduced by 89.3%\r\n<\/pre><h4>Reviewing Unmatched<a name=\"fd6f356a-6412-4ea7-b3c9-fad0c4412864\"><\/a><\/h4><p>When you review what was not matched, you can see that they were no good matches except for \"John W. Fisher III\", \"Oluwasanmi O. Koyejo\", or \"Danielo Jimenz Rezende\". They have much lower match rate and didn't make the cut, but we don't necessarily ease the cut-off. Is there an alternative?<\/p><pre class=\"codeinput\">missing_ids = find(ismissing(org));                             <span class=\"comment\">% get missing ids<\/span>\r\nmatches = strings(1,3);                                         <span class=\"comment\">% initialize accumulator<\/span>\r\n<span class=\"keyword\">for<\/span> ii = 1:10                                                   <span class=\"comment\">% for each missing id thru 10th<\/span>\r\n    cid = missing_ids(ii);                                      <span class=\"comment\">% current id<\/span>\r\n    fhandle = @(s) levenshtein(authors(cid), s, <span class=\"string\">'ratio'<\/span>, <span class=\"string\">'ignoreCase'<\/span>);\r\n    ratios = arrayfun(fhandle, names);                          <span class=\"comment\">% get match rates<\/span>\r\n    [~,idx] = max(ratios);                                      <span class=\"comment\">% get index of max value<\/span>\r\n    matches(ii,:) = [authors(cid) coauth(idx) ratios(idx)];     <span class=\"comment\">% update accumulator<\/span>\r\n<span class=\"keyword\">end<\/span>\r\nmatches\r\n<\/pre><pre class=\"codeoutput\">matches = \r\n  10&times;3 string array\r\n    \"Avrim Blum\"              \"Manuel Blum, Unive&#8230;\"    \"0.7619\" \r\n    \"John W. Fisher III\"      \"John Fisher, MIT\"        \"0.75862\"\r\n    \"&Atilde;&#8211;zg&Atilde;&frac14;r &Aring;&#382;im&Aring;&#376;ek\"        \"Ozgur Simsek, Max &#8230;\"    \"0.71429\"\r\n    \"Joseph J. Lim\"           \"Joseph Salmon, Tel&#8230;\"    \"0.76923\"\r\n    \"Pascal Fua\"              \"Pascal Vincent, U.&#8230;\"    \"0.70833\"\r\n    \"Fran&Atilde;&sect;ois Fleuret\"       \"Matthias Feurer, U&#8230;\"    \"0.71875\"\r\n    \"Oluwasanmi O. Koyejo\"    \"Sanmi Koyejo, Stan&#8230;\"    \"0.75\"   \r\n    \"Samet Oymak\"             \"James Kwok, Hong K&#8230;\"    \"0.71429\"\r\n    \"Danilo Jimenez Rez&#8230;\"    \"Danilo Rezende, Go&#8230;\"    \"0.77778\"\r\n    \"Kafui Dzirasa\"           \"Rahul Krishnan, Ne&#8230;\"    \"0.66667\"\r\n<\/pre><h4>Substring Match<a name=\"52bfa40f-7b93-4c49-ba57-605d65d73933\"><\/a><\/h4><p>Instead of trying to match the whole string,  we can reorder and find the most similar substring with the longer string based on the shorter string using <tt>ignore order<\/tt> and <tt>partial<\/tt> options:<\/p><div><ul><li><tt>ignoreOrder<\/tt> breaks strings into tokens, reorder tokens by finding union and intersection of tokens, and compute edit distance.<\/li><li><tt>partial<\/tt> finds substring within longer string that is closest to the shorter string.<\/li><\/ul><\/div><p>This clearly gives higher match rates for in some specific cases. \"John W. Fisher III\" can be reordered to \"John Fisher III W.\" and the first two tokens (a substring) matches \"John Fisher\", which makes it a 100% match.<\/p><pre class=\"codeinput\">matches = strings(1,3);                                         <span class=\"comment\">% initialize accumulator<\/span>\r\n<span class=\"keyword\">for<\/span> ii = 1:10                                                   <span class=\"comment\">% for each missing id thru 10th<\/span>\r\n    cid = missing_ids(ii);                                      <span class=\"comment\">% current id<\/span>\r\n    fhandle = @(s) levenshtein(authors(cid), s, <span class=\"string\">'ratio'<\/span>, <span class=\"string\">'ingoreCase'<\/span>, <span class=\"string\">'ignoreOrder'<\/span>, <span class=\"string\">'partial'<\/span>);\r\n    ratios = arrayfun(fhandle, names);                          <span class=\"comment\">% get match rates<\/span>\r\n    [~,idx] = max(ratios);                                      <span class=\"comment\">% get index of max value<\/span>\r\n    matches(ii,:) = [authors(cid) coauth(idx) ratios(idx)];     <span class=\"comment\">% update accumulator<\/span>\r\n<span class=\"keyword\">end<\/span>\r\nmatches\r\n<\/pre><pre class=\"codeoutput\">matches = \r\n  10&times;3 string array\r\n    \"Avrim Blum\"              \"Manuel Blum, Unive&#8230;\"    \"0.75\"   \r\n    \"John W. Fisher III\"      \"John Fisher, MIT\"        \"1\"      \r\n    \"&Atilde;&#8211;zg&Atilde;&frac14;r &Aring;&#382;im&Aring;&#376;ek\"        \"Ozgur Simsek, Max &#8230;\"    \"0.66667\"\r\n    \"Joseph J. Lim\"           \"Joseph Wang,\"            \"0.81818\"\r\n    \"Pascal Fua\"              \"Pascal Vincent, U.&#8230;\"    \"0.85\"   \r\n    \"Fran&Atilde;&sect;ois Fleuret\"       \"Tian Tian, Tsinghu&#8230;\"    \"0.75\"   \r\n    \"Oluwasanmi O. Koyejo\"    \"Sanmi Koyejo, Stan&#8230;\"    \"0.79167\"\r\n    \"Samet Oymak\"             \"James Hensman, The&#8230;\"    \"0.77273\"\r\n    \"Danilo Jimenez Rez&#8230;\"    \"Danilo Rezende, Go&#8230;\"    \"1\"      \r\n    \"Kafui Dzirasa\"           \"Kai Fan, Duke Univ&#8230;\"    \"0.71429\"\r\n<\/pre><h4>Updating Missing Values with Substring Match Result<a name=\"aae78a85-d666-4721-b297-32c70fa2b96c\"><\/a><\/h4><p>Substring match is a tricky choice - it can produce more false positives.So let's use match rate of 0.9 or above for cut-off. With this approach, the missing values were further reduced by 35.5%.<\/p><pre class=\"codeinput\"><span class=\"keyword\">for<\/span> ii = 1:length(missing_ids)                                  <span class=\"comment\">% for each missing id<\/span>\r\n    cid = missing_ids(ii);                                      <span class=\"comment\">% current id<\/span>\r\n    fhandle = @(s) levenshtein(authors(cid), s, <span class=\"string\">'ratio'<\/span>, <span class=\"string\">'ingoreCase'<\/span>, <span class=\"string\">'ignoreOrder'<\/span>, <span class=\"string\">'partial'<\/span>);\r\n    ratios = arrayfun(fhandle, names);                          <span class=\"comment\">% get match rates<\/span>\r\n    [~,idx] = max(ratios);                                      <span class=\"comment\">% get index of max value<\/span>\r\n    <span class=\"keyword\">if<\/span> ratios(idx) &gt;= 0.9                                       <span class=\"comment\">% if max is 0.9<\/span>\r\n        res = extractAfter(coauth(idx),<span class=\"string\">','<\/span>);                    <span class=\"comment\">% get org name<\/span>\r\n        res = strtrim(res);                                     <span class=\"comment\">% trim white spaces<\/span>\r\n        <span class=\"keyword\">if<\/span> strlength(res) == 0                                  <span class=\"comment\">% if null string<\/span>\r\n            org(cid) = <span class=\"string\">'UNK'<\/span>;                                   <span class=\"comment\">% unknown<\/span>\r\n        <span class=\"keyword\">else<\/span>                                                    <span class=\"comment\">% otherwise<\/span>\r\n            org(cid) = res;                                     <span class=\"comment\">% update org<\/span>\r\n        <span class=\"keyword\">end<\/span>\r\n    <span class=\"keyword\">end<\/span>\r\n<span class=\"keyword\">end<\/span>\r\nfprintf(<span class=\"string\">'Missing reduced by %.1f%%\\n'<\/span>, (1 - sum(ismissing(org))\/length(missing_ids))*100)\r\n<\/pre><pre class=\"codeoutput\">Missing reduced by 35.5%\r\n<\/pre><h4>Visualizing Paper Author Affiliation<a name=\"dae88173-8579-47c2-bfa3-52d988c2d4ab\"><\/a><\/h4><p>Now we can use the matched data to visualize the <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/graph-and-network-algorithms.html\">graph<\/a> formed by Paper Author Affiliation. <tt>PaperAuthor<\/tt> table can be seen as the edge list where paper and authors represents nodes of a graph and each row represents an edge. We just need to replace authors with their affiliation so that each node represents a paper or the affiliation of its authors, and edges represents the authorship of the paper.<\/p><div><ul><li>A paper has multiple incoming edges if co-authored by scholars from multiple organizations,<\/li><li>and those papers are also connected to other node if their co-authors also contribute to other papers.<\/li><li>Papers authored within the same organization will be isolated, and it is removed from the graph for simplifying visualzation.<\/li><li>The resulting graph represents the largest connected component of the overall graph.<\/li><li>Node size and color reflects outdegrees - how many papers is contributed by the members of a given organization<\/li><\/ul><\/div><p>This visualization shows that we have more work to do - for example, \"University of Texas Austin\" and \"UT Austin\" are the same organization but we have two separate nodes because organizatoin names are not standardized, and that means some organizations are probably under-reprsented. Can you improve this visualzation?<\/p><pre class=\"codeinput\">Af = table(Authors.ID,org, <span class=\"string\">'VariableNames'<\/span>,{<span class=\"string\">'AuthorID'<\/span>,<span class=\"string\">'Org'<\/span>}); <span class=\"comment\">% affiliation<\/span>\r\nAf(Af.Org == <span class=\"string\">'UNK'<\/span> | ismissing(Af.Org),:) = [];                 <span class=\"comment\">% remove UNK &amp; missing<\/span>\r\nT = innerjoin(PaperAuthors(:,2:3), Af);                         <span class=\"comment\">% join with PaperAuthors<\/span>\r\nT = innerjoin(T,Papers(:,[1,2]), <span class=\"string\">'Keys'<\/span>, 1);                    <span class=\"comment\">% join with Papers<\/span>\r\n[T,~,idx] = unique(T(:,[3,4]),<span class=\"string\">'rows'<\/span>);                          <span class=\"comment\">% remove duplicate rows<\/span>\r\nw = accumarray(idx,1);                                          <span class=\"comment\">% count duplicates<\/span>\r\ns = cellstr(T.Org);                                             <span class=\"comment\">% convert to cellstr<\/span>\r\nt = cellstr(T.Title);                                           <span class=\"comment\">% convert to cellstr<\/span>\r\nG = digraph(s,t, w);                                            <span class=\"comment\">% create directed graph<\/span>\r\nbins = conncomp(G,<span class=\"string\">'type'<\/span>, <span class=\"string\">'weak'<\/span>, <span class=\"string\">'OutputForm'<\/span>, <span class=\"string\">'cell'<\/span>);        <span class=\"comment\">% get connected comps<\/span>\r\n[~, idx] = max(cellfun(@length, bins));                         <span class=\"comment\">% find largest comp<\/span>\r\nG = subgraph(G, bins{idx});                                     <span class=\"comment\">% subgraph largest comp<\/span>\r\nfigure                                                          <span class=\"comment\">% new figure<\/span>\r\ncolormap <span class=\"string\">cool<\/span>                                                   <span class=\"comment\">% use cool colormap<\/span>\r\nmsize = 10*(outdegree(G) + 3).\/max(outdegree(G));               <span class=\"comment\">% marker size<\/span>\r\nncol = outdegree(G) + 3;                                        <span class=\"comment\">% node colors<\/span>\r\nnamed = outdegree(G) &gt; 7;                                       <span class=\"comment\">% nodes to label<\/span>\r\nh = plot(G, <span class=\"string\">'MarkerSize'<\/span>, msize, <span class=\"string\">'NodeCData'<\/span>, ncol);            <span class=\"comment\">% plot graph<\/span>\r\nlayout(h,<span class=\"string\">'force3'<\/span>,<span class=\"string\">'Iterations'<\/span>,30)                              <span class=\"comment\">% change layout<\/span>\r\nlabelnode(h, find(named), G.Nodes.Name(named));                 <span class=\"comment\">% add node labels<\/span>\r\ntitle(<span class=\"string\">'NIPS 2015 Papers - Author Affilation Graph'<\/span>)             <span class=\"comment\">% add title<\/span>\r\naxis <span class=\"string\">tight<\/span> <span class=\"string\">off<\/span>                                                  <span class=\"comment\">% set axis<\/span>\r\nset(gca,<span class=\"string\">'clipping'<\/span>,<span class=\"string\">'off'<\/span>)                                       <span class=\"comment\">% turn off clipping<\/span>\r\nzoom(1.3)                                                       <span class=\"comment\">% zoom in<\/span>\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/webscraping_01.png\" alt=\"\"> <h4>Summary<a name=\"8b68f342-5771-48a0-82ea-faed60b581e7\"><\/a><\/h4><p>Web scraping can be a very useful skill to have to collect information from the web, and MATLAB makes it very easy to extract information from a web page. The resulting data is often unstructured, but you can deal with it using techniques like fuzzy string matching.<\/p><p>I hope this example gives you a lot of new ideas. Give it a try and let us know how your experience goes <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=2346#respond\">here<\/a>!<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_54315779938c4f2cb1e62e6d02ecb9d9() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='54315779938c4f2cb1e62e6d02ecb9d9 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 54315779938c4f2cb1e62e6d02ecb9d9';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2017 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_54315779938c4f2cb1e62e6d02ecb9d9()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2017a<br><\/p><\/div><!--\r\n54315779938c4f2cb1e62e6d02ecb9d9 ##### SOURCE BEGIN #####\r\n%% Web Scraping and Mining Unstructured Data with MATLAB\r\n% A lot of information is shared on the web and a lot of people are interested \r\n% in taking advantage of it. It can be used to enrich the existing data, for example. \r\n% However, information is buries in HTML tags and it is not easy to extract useful \r\n% information. Today's guest blogger, <https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/951521 \r\n% Toshi Takeuchi> shows us how he uses MATLAB for <https:\/\/en.wikipedia.org\/wiki\/Web_scraping \r\n% web scraping> to harvest useful data from the web and then uses fuzzy string \r\n% match to enrich existing data. \r\n% \r\n% <<nips2015.gif>>\r\n% \r\n%% NIPS 2015 Papers\r\n% Web scraping is actually pretty easy with MATLAB thanks to new string fucntions \r\n% introdiced in R2016B. \r\n% \r\n% I am going to use as an example the same data used in <https:\/\/blogs.mathworks.com\/loren\/2016\/08\/08\/text-mining-machine-learning-research-papers-with-matlab\/ \r\n% Text Mining Machine Learning Research Papers with MATLAB>. \r\n% \r\n% If you would like to follow along, please download\r\n%\r\n% * the source of this post by clicking on \"Get the MATLAB code\" at the\r\n% bottom of this page\r\n% * the data from Kaggle's\r\n% <https:\/\/www.kaggle.com\/benhamner\/nips-2015-papers NIPS 2015 Papers> page\r\n% * my custom function\r\n% <https:\/\/blogs.mathworks.com\/images\/loren\/2017\/levenshtein.m\r\n% levenshtein.m>\r\n% * my custom script to generate GIF animation\r\n% <https:\/\/blogs.mathworks.com\/images\/loren\/2017\/animateNIPS2015.m\r\n% animateNIPS2015.m>\r\n% \r\n% Here I am using |<https:\/\/www.mathworks.com\/help\/database\/ug\/sqlite.html \r\n% sqlite>| in <https:\/\/www.mathworks.com\/products\/database\/ Database Toolbox> to \r\n% load data from a sqlite file. If you don't have Database Toolbox, you can try \r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/readtable.html readtable>| to read \r\n% CSV files. |Authors| table only list names, but I want to enrich it with authors' \r\n% affilation to see which organizations are active in this academic conference. \r\n%%\r\ndb = 'output\/database.sqlite';                                  % database file\r\nconn = sqlite(db,'readonly');                                   % create connection\r\nAuthors = fetch(conn,'SELECT * FROM Authors');                  % get data with SQL command\r\nPapers = fetch(conn,'SELECT * FROM Papers');                    % get data with SQL command\r\nPaperAuthors = fetch(conn,'SELECT * FROM PaperAuthors');        % get data with SQL command\r\nclose(conn)                                                     % close connection\r\nAuthors = cell2table(Authors,'VariableNames',{'ID','Name'});    % convert to table\r\nPapers = cell2table(Papers,'VariableNames', ...                 % convert to table\r\n    {'ID','Title','EventType','PdfName','Abstract','PaperText'});\r\nPaperAuthors = cell2table(PaperAuthors,'VariableNames', ...     % convert to table\r\n    {'ID','PaperID','AuthorID'});\r\nhead(Authors)\r\n%% \r\n% Luckily, there is an HTML file that lists each paper with its authors \r\n% and their affiliation. Each list item start with |&lt;i&gt;&lt;span class=\"larger-font\"&gt;| \r\n% and ends with |&lt;\/b&gt;&lt;br&gt;&lt;br&gt;|, titles and authors are separated by |&lt;\/span&gt;&lt;\/i&gt;&lt;br&gt;&lt;b&gt;|, \r\n% mutiple co-authors by semicolon,  and finally the names and affiliation by comma.\r\n%%\r\ndbtype output\/accepted_papers.html 397:400\r\n%% \r\n% Since I have the html file locally, I use <https:\/\/www.mathworks.com\/help\/matlab\/ref\/fileread.html \r\n% fileread> to load text from it. If you want to scape a web page directly, you \r\n% would use <https:\/\/www.mathworks.com\/help\/matlab\/ref\/webread.html webread> instead. \r\n% Imported text is converted into <https:\/\/www.mathworks.com\/help\/matlab\/ref\/string.html \r\n% string> to take advantage of built-in string functions,\r\n\r\nhtml = string(fileread('output\/accepted_papers.html'));         % load text from file\r\n% html = string(webread('https:\/\/nips.cc\/Conferences\/2015\/AcceptedPapers'));\r\n%% Scraping Data from a Web Page\r\n% Usually scraping data from a web page or other unstructured text data sources \r\n% requires <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html \r\n% regular expressions> and many people find it powerful but very difficult to \r\n% use. String functions in MATLAB like |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/extractbetween.html \r\n% extractBetween>|, |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/extractbefore.html \r\n% extractBefore>|, |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/extractafter.html \r\n% extractAfter>|, |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/erase.html erase>|, \r\n% and |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/replace.html replace>| makes \r\n% it ridiculously simple!\r\n%%\r\npattern1 = '<div><h3>NIPS 2015 Accepted Papers<\/h3><p><br><\/p>';% start of list\r\npattern2 = '<\/div>  <!REPLACE_WITH_DASH_DASHdiv class=\"col-xs-12 col-sm-9\"REPLACE_WITH_DASH_DASH>';     % end of list   \r\nlist = extractBetween(html, pattern1, pattern2);                % extract list\r\npattern1 = '<i><span class=\"larger-font\">';                     % start of list item\r\npattern2 = '<\/b><br><br>';                                      % end of list item\r\nlistitems = extractBetween(list, pattern1, pattern2);           % extract list items\r\npattern1 = ['<\/span><\/i><br><b>' newline];                      % end of title\r\ntitles = extractBefore(listitems,pattern1);                     % extract titles\r\nnamesorgs = extractAfter(listitems,pattern1);                   % extract names orgs\r\nnamesorgs = erase(namesorgs,'*');                               % erase *\r\nnamesorgs = erase(namesorgs,'\"');                               % erase \"\r\nnamesorgs = replace(namesorgs,'  ', ' ');                       % remove double space\r\ndisp([titles(1:2) namesorgs(1:2)])\r\n%% \r\n% Since multiple co-authors are still contained in a single string, let's \r\n% separate them into a list of co-authors and their affilation. When you split \r\n% a string, you get varying number of substrings depending on each row. So we \r\n% need to use arrayfun to parse|UniformOutput| option set to |false| to split \r\n% it row by row, trim the result with |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/strtrim.html \r\n% strtrim>| the cell aray with |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/vertcat.html \r\n% vertcat>| to get the list of coauthors with their affiliation. \r\n%%\r\nnamesorgs = replace(namesorgs,'&amp;','&');                     % revert escaped &\r\nnamesorgs = erase(namesorgs,[char(194) '\u00c2\u00a0<\/b><span><strong>']); % remove extra tags\r\nnamesorgs = replace(namesorgs,['<\/strong><\/span><b>' ...        % replace missing semicolon\r\n    char(194)],';');    \r\ncoauth = arrayfun(@(x) strtrim(split(x,';')), namesorgs, ...    % split by semicolon\r\n    'UniformOutput', false);                                    % and trim white space\r\ncoauth = vertcat(coauth{:});                                    % unnest cell array\r\ncoauth(1:5)\r\n%% Matching Scraped Data to Database Table\r\n% You now see how easy web scraping is  with MATLAB. \r\n% \r\n% Now that we have the list of names with affiliation, we just have to match \r\n% it to the |Authors| table by name, right? Unfortunately, we see a lot of missing \r\n% values because the names didn't match even with |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/contains.html \r\n% contains>| partial match function.\r\n% \r\n% The real hard part now is what to do after you scraped the data from the \r\n% web. \r\n%%\r\nauthors = Authors.Name;                                         % author names\r\nnames = strtrim(extractBefore(coauth,','));                     % extract and trim names\r\norg = strings(length(authors),1);                               % initialize accumulator\r\nfor ii = 1:length(authors)                                      % for each name in |authors|\r\n    res = coauth(contains(names,authors(ii)));                  % find match in |names|\r\n    if isempty(res)                                             % if no match\r\n        org(ii) = missing;                                      % mark it missing\r\n    end\r\n    res = extractAfter(res,',');                                % extract after comma\r\n    res = strtrim(res);                                         % remove white space\r\n    res = unique(res);                                          % remove duplicates\r\n    res(strlength(res) == 0) = [];                              % remove emptry string\r\n    if length(res) == 1                                         % if single string\r\n        org(ii) = res;                                          % use it as is\r\n    elseif length(res) > 1                                      % if multiple string\r\n        org(ii) = join(res,';');                                % join them with semicolon\r\n    else                                                        % otherwise\r\n        org(ii) = missing;                                      % mark it missing\r\n    end\r\nend\r\nhead(table(authors, org, 'VariableNames',{'Name','Org'}))\r\n%% \r\n% For example, the partial match doesn't work if middle initial is missing \r\n% or nicknames are used instead of full names. There can be other irregularities. \r\n% Yikes, we are dealing with <https:\/\/en.wikipedia.org\/wiki\/Unstructured_data \r\n% unstructured data>!\r\n%%\r\n[authors(4) coauth(contains(names,'Jonathan Cohen'));\r\n    authors(8) coauth(contains(names,'Robert Williamson'));\r\n    authors(406) coauth(contains(names,'Sanmi Koyejo'));\r\n    authors(440) coauth(contains(names,'Danilo Rezende'));\r\n    authors(743) coauth(contains(names,'Bill Dally'));\r\n    authors(769) coauth(contains(names,'Julian Yarkony'))]\r\n%% Fuzzy String Matching\r\n% What can we do when exact match approach doesn't work? Maybe we can come\r\n% up with various rules to match strings using regular expressions, but\r\n% that is very time consuming. Let's revisit the\r\n% <https:\/\/blogs.mathworks.com\/loren\/2015\/10\/14\/40-year-old-algorithm-that-cannot-be-improved\/\r\n% 40-year-old Algorithm That Cannot Be Improved> to solve this problem. I\r\n% created a new custom function |levenshtein| for this example. It\r\n% calculates the edit distance that measures the minimum number of edit\r\n% operations required to transform one string into another,as a way to\r\n% quantify how similar or dissimilar they are. For more details of this\r\n% algorithm, please check out the blog post linked above\r\n% \r\n% Converting Sunday to Saturday requires 3 edit operations.\r\n%%\r\nlevenshtein('sunday', 'saturday')\r\n%% \r\n% Perhaps it is easier to understand if I show how similar they are as match \r\n% rate rather than number of edit operations?\r\n\r\nlevenshtein('sunday', 'saturday', 'ratio')\r\n%% \r\n% Now we can find \"Jonathan Cohen\" in the top 3 matches for \"Jonathan D. \r\n% Cohen\".\r\n\r\nfhandle = @(s) levenshtein(authors(4), s, 'ratio');             % function handle\r\nratios = arrayfun(fhandle, extractBefore(coauth,','));          % get match rates\r\n[~,idx] = sort(ratios,'descend');                               % rank by match rate\r\n[repmat(authors(4),[3,1]) coauth(idx(1:3)) ratios(idx(1:3))]\r\n%% Validating Fuzzy Match Results\r\n% Let;s try this approach for the first 10 missing names with |ignoreCase| option \r\n% enabled. It looks like we can be fairly confident about the result as long as \r\n% the maximum match rate is 0.8 or higher. \r\n%%\r\norg(org == 'Dr.') = missing;                                    % remove salutation\r\nmissing_ids = find(ismissing(org));                             % get missing ids\r\nmatches = strings(1,3);                                         % initialize accumulator\r\nfor ii = 1:10                                                   % for each missing id\r\n    cid = missing_ids(ii);                                      % current id\r\n    fhandle = @(s) levenshtein(authors(cid), s, 'ratio', 'ingoreCase');\r\n    ratios = arrayfun(fhandle, names);                          % get match rates \r\n    [~,idx] = max(ratios);                                      % get index of max value\r\n    matches(ii,:) = [authors(cid) coauth(idx) ratios(idx)];     % update accumulator\r\nend\r\nmatches\r\n%% Updating Missing Values with Fuzzy Match Result\r\n% Now we can apply this approach to update the missing values. Now 89.3% of \r\n% missing values are identified! \r\n%%\r\nfor ii = 1:length(missing_ids)                                  % for each missing id\r\n    cid = missing_ids(ii);                                      % current id \r\n    fhandle = @(s) levenshtein(authors(cid), s, 'ratio', 'ignoreCase');\r\n    ratios = arrayfun(fhandle, names);                          % get match rates\r\n    [~,idx] = max(ratios);                                      % get index of max value\r\n    if ratios(idx) >= 0.8                                       % if max is 0.8\r\n        res = extractAfter(coauth(idx),',');                    % get org name\r\n        res = strtrim(res);                                     % trim white spaces\r\n        if strlength(res) == 0                                  % if null string\r\n            org(cid) = 'UNK';                                   % unknown\r\n        else                                                    % otherwise\r\n            org(cid) = res;                                     % update org\r\n        end\r\n    end\r\nend\r\nfprintf('Missing reduced by %.1f%%\\n', (1 - sum(ismissing(org))\/length(missing_ids))*100)\r\n%% Reviewing Unmatched\r\n% When you review what was not matched, you can see that they were no good matches \r\n% except for \"John W. Fisher III\", \"Oluwasanmi O. Koyejo\", or \"Danielo Jimenz \r\n% Rezende\". They have much lower match rate and didn't make the cut, but we don't \r\n% necessarily ease the cut-off. Is there an alternative?\r\n%%\r\nmissing_ids = find(ismissing(org));                             % get missing ids\r\nmatches = strings(1,3);                                         % initialize accumulator\r\nfor ii = 1:10                                                   % for each missing id thru 10th\r\n    cid = missing_ids(ii);                                      % current id\r\n    fhandle = @(s) levenshtein(authors(cid), s, 'ratio', 'ignoreCase');\r\n    ratios = arrayfun(fhandle, names);                          % get match rates \r\n    [~,idx] = max(ratios);                                      % get index of max value\r\n    matches(ii,:) = [authors(cid) coauth(idx) ratios(idx)];     % update accumulator\r\nend\r\nmatches\r\n%% Substring Match\r\n% Instead of trying to match the whole string,  we can reorder and find the \r\n% most similar substring with the longer string based on the shorter string using \r\n% |ignore order| and |partial| options:\r\n% \r\n% * |ignoreOrder| breaks strings into tokens, reorder tokens by finding union \r\n% and intersection of tokens, and compute edit distance. \r\n% * |partial| finds substring within longer string that is closest to the shorter \r\n% string.\r\n% \r\n% This clearly gives higher match rates for in some specific cases. \"John \r\n% W. Fisher III\" can be reordered to \"John Fisher III W.\" and the first two tokens \r\n% (a substring) matches \"John Fisher\", which makes it a 100% match. \r\n%%\r\nmatches = strings(1,3);                                         % initialize accumulator\r\nfor ii = 1:10                                                   % for each missing id thru 10th\r\n    cid = missing_ids(ii);                                      % current id\r\n    fhandle = @(s) levenshtein(authors(cid), s, 'ratio', 'ingoreCase', 'ignoreOrder', 'partial');\r\n    ratios = arrayfun(fhandle, names);                          % get match rates \r\n    [~,idx] = max(ratios);                                      % get index of max value\r\n    matches(ii,:) = [authors(cid) coauth(idx) ratios(idx)];     % update accumulator\r\nend\r\nmatches\r\n%% Updating Missing Values with Substring Match Result\r\n% Substring match is a tricky choice - it can produce more false positives.So \r\n% let's use match rate of 0.9 or above for cut-off. With this approach, the missing \r\n% values were further reduced by 35.5%. \r\n%%\r\nfor ii = 1:length(missing_ids)                                  % for each missing id\r\n    cid = missing_ids(ii);                                      % current id \r\n    fhandle = @(s) levenshtein(authors(cid), s, 'ratio', 'ingoreCase', 'ignoreOrder', 'partial');\r\n    ratios = arrayfun(fhandle, names);                          % get match rates\r\n    [~,idx] = max(ratios);                                      % get index of max value\r\n    if ratios(idx) >= 0.9                                       % if max is 0.9\r\n        res = extractAfter(coauth(idx),',');                    % get org name\r\n        res = strtrim(res);                                     % trim white spaces\r\n        if strlength(res) == 0                                  % if null string\r\n            org(cid) = 'UNK';                                   % unknown\r\n        else                                                    % otherwise\r\n            org(cid) = res;                                     % update org\r\n        end\r\n    end\r\nend\r\nfprintf('Missing reduced by %.1f%%\\n', (1 - sum(ismissing(org))\/length(missing_ids))*100)\r\n%% Visualizing Paper Author Affiliation\r\n% Now we can use the matched data to visualize the <https:\/\/www.mathworks.com\/help\/matlab\/graph-and-network-algorithms.html \r\n% graph> formed by Paper Author Affiliation. |PaperAuthor| table can be seen as \r\n% the edge list where paper and authors represents nodes of a graph and each row \r\n% represents an edge. We just need to replace authors with their affiliation so \r\n% that each node represents a paper or the affiliation of its authors, and edges \r\n% represents the authorship of the paper. \r\n% \r\n% * A paper has multiple incoming edges if co-authored by scholars from multiple \r\n% organizations, \r\n% * and those papers are also connected to other node if their co-authors also \r\n% contribute to other papers. \r\n% * Papers authored within the same organization will be isolated, and it is \r\n% removed from the graph for simplifying visualzation.\r\n% * The resulting graph represents the largest connected component of the overall \r\n% graph. \r\n% * Node size and color reflects outdegrees - how many papers is contributed \r\n% by the members of a given organization\r\n% \r\n% This visualization shows that we have more work to do - for example, \"University \r\n% of Texas Austin\" and \"UT Austin\" are the same organization but we have two separate \r\n% nodes because organizatoin names are not standardized, and that means some organizations \r\n% are probably under-reprsented. Can you improve this visualzation? \r\n%%\r\nAf = table(Authors.ID,org, 'VariableNames',{'AuthorID','Org'}); % affiliation\r\nAf(Af.Org == 'UNK' | ismissing(Af.Org),:) = [];                 % remove UNK & missing\r\nT = innerjoin(PaperAuthors(:,2:3), Af);                         % join with PaperAuthors\r\nT = innerjoin(T,Papers(:,[1,2]), 'Keys', 1);                    % join with Papers\r\n[T,~,idx] = unique(T(:,[3,4]),'rows');                          % remove duplicate rows\r\nw = accumarray(idx,1);                                          % count duplicates\r\ns = cellstr(T.Org);                                             % convert to cellstr\r\nt = cellstr(T.Title);                                           % convert to cellstr\r\nG = digraph(s,t, w);                                            % create directed graph\r\nbins = conncomp(G,'type', 'weak', 'OutputForm', 'cell');        % get connected comps\r\n[~, idx] = max(cellfun(@length, bins));                         % find largest comp\r\nG = subgraph(G, bins{idx});                                     % subgraph largest comp\r\nfigure                                                          % new figure\r\ncolormap cool                                                   % use cool colormap\r\nmsize = 10*(outdegree(G) + 3).\/max(outdegree(G));               % marker size\r\nncol = outdegree(G) + 3;                                        % node colors\r\nnamed = outdegree(G) > 7;                                       % nodes to label\r\nh = plot(G, 'MarkerSize', msize, 'NodeCData', ncol);            % plot graph\r\nlayout(h,'force3','Iterations',30)                              % change layout\r\nlabelnode(h, find(named), G.Nodes.Name(named));                 % add node labels\r\ntitle('NIPS 2015 Papers - Author Affilation Graph')             % add title\r\naxis tight off                                                  % set axis\r\nset(gca,'clipping','off')                                       % turn off clipping\r\nzoom(1.3)                                                       % zoom in\r\n%% Summary\r\n% Web scraping can be a very useful skill to have to collect information from \r\n% the web, and MATLAB makes it very easy to extract information from a web page. \r\n% The resulting data is often unstructured, but you can deal with it using techniques \r\n% like fuzzy string matching. \r\n% \r\n% I hope this example gives you a lot of new ideas. Give it a try and let \r\n% us know how your experience goes <https:\/\/blogs.mathworks.com\/loren\/?p=2346#respond \r\n% here>!\r\n##### SOURCE END ##### 54315779938c4f2cb1e62e6d02ecb9d9\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/webscraping_01.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>A lot of information is shared on the web and a lot of people are interested in taking advantage of it. It can be used to enrich the existing data, for example. However, information is buries in HTML tags and it is not easy to extract useful information. Today's guest blogger, <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/951521\">Toshi Takeuchi<\/a> shows us how he uses MATLAB for <a href=\"https:\/\/en.wikipedia.org\/wiki\/Web_scraping\">web scraping<\/a> to harvest useful data from the web and then uses fuzzy string match to enrich existing data.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2017\/07\/10\/web-scraping-and-mining-unstructured-data-with-matlab\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[61,2],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/2346"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=2346"}],"version-history":[{"count":3,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/2346\/revisions"}],"predecessor-version":[{"id":2383,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/2346\/revisions\/2383"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=2346"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=2346"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=2346"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}