{"id":4721,"date":"2020-10-23T16:47:44","date_gmt":"2020-10-23T20:47:44","guid":{"rendered":"https:\/\/blogs.mathworks.com\/videos\/?p=4721"},"modified":"2020-10-23T21:28:20","modified_gmt":"2020-10-24T01:28:20","slug":"scraping-links-from-a-set-of-matlab-documentation-pages","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/videos\/2020\/10\/23\/scraping-links-from-a-set-of-matlab-documentation-pages\/","title":{"rendered":"Scraping Links from a Set of MATLAB Documentation Pages"},"content":{"rendered":"<p>My colleague Sam asked if I could help him try and understand how a set of <a href=\"https:\/\/www.mathworks.com\/help\/matlab-parallel-server\/\">documentation pages<\/a> were linked together and perhaps visualize them as a graph.<\/p>\n<p>Now, I do have a good idea of all the pages on our website and all the links between them but he is only interested in the links in the body of the page, i.e. not the links in the menu or footer of the page or in the navigation on the left. So I need to find a way to extract just those links. After using <tt>webread<\/tt> to read the content, I think I can use a combination of functions form the <a href=\"https:\/\/www.mathworks.com\/products\/text-analytics.html\">Text Analytics Toolbox<\/a> to process the HTML tags.<\/p>\n<p>Features covered in this <a href=\"https:\/\/blogs.mathworks.com\/videos\/2015\/10\/29\/matlab-code-along-videos\/\">code-along<\/a> style video include:<\/p>\n<ul>\n<li>Text Analytics Toolbox: <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/htmltree.findelement.html\"><tt>findElement<\/tt><\/a>, <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/htmltree.getattribute.html\"><tt>getAttribute<\/tt><\/a>, and <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/htmltree.extracthtmltext.html\"><tt>extractHTMLText<\/tt><\/a><\/li>\n<li>Misc: <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/webread.html\"><tt>webread<\/tt><\/a>, <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/regexprep.html\"><tt>regexprep<\/tt><\/a><\/li>\n<\/ul>\n<p><div class=\"row\"><div class=\"col-xs-12 containing-block\"><div class=\"bc-outer-container add_margin_20\"><videoplayer><div class=\"video-js-container\"><video data-video-id=\"6204270874001\" data-video-category=\"blog\" data-autostart=\"false\" data-account=\"62009828001\" data-omniture-account=\"mathwgbl\" data-player=\"rJ9XCz2Sx\" data-embed=\"default\" id=\"mathworks-brightcove-player\" class=\"video-js\" controls><\/video><script src=\"\/\/players.brightcove.net\/62009828001\/rJ9XCz2Sx_default\/index.min.js\"><\/script><script>if (typeof(playerLoaded) === 'undefined') {var playerLoaded = false;}(function isVideojsDefined() {if (typeof(videojs) !== 'undefined') {videojs(\"mathworks-brightcove-player\").on('loadedmetadata', function() {playerLoaded = true;});} else {setTimeout(isVideojsDefined, 10);}})();<\/script><\/div><\/videoplayer><\/div><\/div><\/div><\/p>\n<p>Play the video in full screen mode for a better viewing experience.\u00a0<\/p>\n","protected":false},"excerpt":{"rendered":"<div class=\"thumbnail thumbnail_asset asset_overlay video\"><a href=\"https:\/\/blogs.mathworks.com\/videos\/2020\/10\/23\/scraping-links-from-a-set-of-matlab-documentation-pages\/?dir=autoplay\"><img decoding=\"async\" src=\"https:\/\/cf-images.us-east-1.prod.boltdns.net\/v1\/static\/62009828001\/a20c801e-d174-4395-ba11-92b4c006323c\/3ed34bec-2837-47ae-b1da-58b22c2c9e79\/1280x720\/match\/image.jpg\" onError=\"this.style.display ='none';\"\/><\/p>\n<div class=\"overlay_container\">\n      <span class=\"icon-video icon_color_null\"><time class=\"video_length\">58:36<\/time><\/span>\n      <\/div>\n<p>      <\/a><\/div>\n<p>My colleague Sam asked if I could help him try and understand how a set of documentation pages were linked together and perhaps visualize them as a graph.<br \/>\nNow, I do have a good idea of all the pages&#8230; <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/videos\/2020\/10\/23\/scraping-links-from-a-set-of-matlab-documentation-pages\/\">read more >><\/a><\/p>\n","protected":false},"author":133,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[27,4],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/videos\/wp-json\/wp\/v2\/posts\/4721"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/videos\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/videos\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/videos\/wp-json\/wp\/v2\/users\/133"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/videos\/wp-json\/wp\/v2\/comments?post=4721"}],"version-history":[{"count":10,"href":"https:\/\/blogs.mathworks.com\/videos\/wp-json\/wp\/v2\/posts\/4721\/revisions"}],"predecessor-version":[{"id":4741,"href":"https:\/\/blogs.mathworks.com\/videos\/wp-json\/wp\/v2\/posts\/4721\/revisions\/4741"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/videos\/wp-json\/wp\/v2\/media?parent=4721"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/videos\/wp-json\/wp\/v2\/categories?post=4721"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/videos\/wp-json\/wp\/v2\/tags?post=4721"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}