{"id":5095,"date":"2014-01-17T09:00:55","date_gmt":"2014-01-17T14:00:55","guid":{"rendered":"https:\/\/blogs.mathworks.com\/pick\/?p=5095"},"modified":"2017-03-27T20:32:20","modified_gmt":"2017-03-28T00:32:20","slug":"scraping-data-from-the-web","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/pick\/2014\/01\/17\/scraping-data-from-the-web\/","title":{"rendered":"Scraping data from the web"},"content":{"rendered":"<div class=\"content\"><p><a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/15007\">Jiro<\/a>'s pick this week is <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/44751-url-filter\"><tt>urlfilter<\/tt><\/a> by <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/3050\">Ned Gulley<\/a>.<\/p><p>Many of you may know Ned from various parts of <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/\">MATLAB Central<\/a>, such as the community blog <a href=\"https:\/\/blogs.mathworks.com\/community\/\">\"MATLAB Spoken Here\"<\/a>. If you're a frequent visitor of MATLAB Central, you may have also visited Trendy, which allows you to quickly query and plot trends from the web. One of the utility functions provided within Trendy has been <tt>urlfilter<\/tt>, and it's a convenient function that allows you to easily scrape data from a web page. Now, you can use <tt>urlfilter<\/tt> outside of Trendy!<\/p><p>To see how it works, take a look at the Trendy tutorial or the published example script included with Ned's entry. But here's a quick example of how it could be used.<\/p><p>Let's say that I want to grab and plot the high and low temperatures in Natick, MA for the next 10 days. I will grab data from <a href=\"http:\/\/www.wunderground.com\/q\/zmw:01760.1.99999\">this URL<\/a> at <a href=\"http:\/\/www.wunderground.com\">http:\/\/www.wunderground.com<\/a>. As you can see from the web page, the 10-day forecast is displayed about halfway down the page in a table. Each day has a header in the format of \"day of week, day\", e.g. \"Friday, 17\".<\/p><p>First, I calculate the days I'm interested in, which is today to 10 days from today. I also determine the day of the week using the <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/weekday.html\"><tt>weekday<\/tt><\/a> function. I need this information, because <tt>urlfilter<\/tt> will use this to scrape the necessary data.<\/p><pre class=\"codeinput\">days = floor(now):floor(now)+9;\r\n[~, ~, dayval] = datevec(days);\r\n[~, weekdaystr] = weekday(days, <span class=\"string\">'long'<\/span>);\r\n<\/pre><p>Now, I simply use <tt>urlfilter<\/tt> to iterate through each day as the search term.<\/p><pre class=\"codeinput\"><span class=\"comment\">% Pre-allocate variables<\/span>\r\nlow = nan(1,length(days));\r\nhigh = nan(1,length(days));\r\n\r\nurl = <span class=\"string\">'http:\/\/www.wunderground.com\/q\/zmw:01760.1.99999'<\/span>;\r\n<span class=\"keyword\">for<\/span> iD = 1:length(days)\r\n    <span class=\"comment\">% Search term<\/span>\r\n    str = [strtrim(weekdaystr(iD,:)), <span class=\"string\">', '<\/span>, num2str(dayval(iD))];\r\n    disp([<span class=\"string\">'Scraping temperatures for \"'<\/span>, str, <span class=\"string\">'\"...'<\/span>])\r\n\r\n    <span class=\"comment\">% Fetch 2 values (high and low)<\/span>\r\n    vals = urlfilter(url,str,2);\r\n\r\n    high(iD) = vals(1);\r\n    low(iD) = vals(2);\r\n<span class=\"keyword\">end<\/span>\r\n<\/pre><pre class=\"codeoutput\">Scraping temperatures for \"Friday, 17\"...\r\nScraping temperatures for \"Saturday, 18\"...\r\nScraping temperatures for \"Sunday, 19\"...\r\nScraping temperatures for \"Monday, 20\"...\r\nScraping temperatures for \"Tuesday, 21\"...\r\nScraping temperatures for \"Wednesday, 22\"...\r\nScraping temperatures for \"Thursday, 23\"...\r\nScraping temperatures for \"Friday, 24\"...\r\nScraping temperatures for \"Saturday, 25\"...\r\nScraping temperatures for \"Sunday, 26\"...\r\n<\/pre><p>Let's plot the results. To show the temperature in two different units, I'm using my <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/7426-plot2axes\"><tt>plot2axes<\/tt><\/a> (shameless plug).<\/p><pre class=\"codeinput\">ax = plot2axes(days,high,<span class=\"string\">'r.-'<\/span>,days,low,<span class=\"string\">'b.-'<\/span>, <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'YScale'<\/span>,@(x)5\/9*(x-32));\r\nylabel(ax(1),<span class=\"string\">'Temperature (\\circF)'<\/span>)\r\nylabel(ax(2),<span class=\"string\">'Temperature (\\circC)'<\/span>)\r\ndatetick(<span class=\"string\">'x'<\/span>,<span class=\"string\">'mmm dd'<\/span>,<span class=\"string\">'keepticks'<\/span>)\r\nlegend(<span class=\"string\">'High'<\/span>,<span class=\"string\">'Low'<\/span>)\r\ntitle(<span class=\"string\">'10-day Forecast for Natick, MA, U.S.A'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/pick\/jiro\/potw_urlfilter\/potw_urlfilter_01.png\" alt=\"\"> <p>Note that I could have done this more efficiently with a single call to <tt>urlfilter<\/tt>, extracting about 40 numbers at once, and then parsing the numbers to get the necessary high and low temperatures. I used the above approach to make it easier to understand.<\/p><p><b>Comments<\/b><\/p><p>Wasn't that easy? Give this a try, and let us know what you think <a href=\"https:\/\/blogs.mathworks.com\/pick\/?p=5095#respond\">here<\/a> or leave a <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/44751-url-filter#comments\">comment<\/a> for Ned. If you find interesting data, consider tracking the trend using <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/trendy\/\">Trendy<\/a>!<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_252d2d06f95d431ab1193afd19dd2605() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='252d2d06f95d431ab1193afd19dd2605 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 252d2d06f95d431ab1193afd19dd2605';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2014 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_252d2d06f95d431ab1193afd19dd2605()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2013b<br><\/p><p class=\"footer\"><br>\r\n      Published with MATLAB&reg; R2013b<br><\/p><\/div><!--\r\n252d2d06f95d431ab1193afd19dd2605 ##### SOURCE BEGIN #####\r\n%%\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/15007\r\n% Jiro>'s pick this week is\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/44751-url-filter |urlfilter|>\r\n% by <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/3050 Ned\r\n% Gulley>.\r\n%\r\n% Many of you may know Ned from various parts of\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/ MATLAB Central>, such as the\r\n% community blog <https:\/\/blogs.mathworks.com\/community\/ \"MATLAB Spoken\r\n% Here\">. If you're a frequent visitor of MATLAB Central, you may have also\r\n% visited <https:\/\/www.mathworks.com\/matlabcentral\/trendy\/ Trendy>, which\r\n% allows you to quickly query and plot trends from the web. One of the\r\n% utility functions provided within Trendy has been |urlfilter|, and it's a\r\n% convenient function that allows you to easily scrape data from a web\r\n% page. Now, you can use |urlfilter| outside of Trendy!\r\n%\r\n% To see how it works, take a look at the\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/trendy\/Tutorial\/trendco.html\r\n% Trendy tutorial> or the published\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/44751-url-filter\/content\/html\/urlfilter_demo.html\r\n% example script> included with Ned's entry. But here's a quick example of\r\n% how it could be used.\r\n%\r\n% Let's say that I want to grab and plot the high and low temperatures in\r\n% Natick, MA for the next 10 days. I will grab data from\r\n% <http:\/\/www.wunderground.com\/q\/zmw:01760.1.99999 this URL> at\r\n% <http:\/\/www.wunderground.com>. As you can see from the web page, the 10-day\r\n% forecast is displayed about halfway down the page in a table. Each day\r\n% has a header in the format of \"day of week, day\", e.g. \"Friday, 17\".\r\n%\r\n% First, I calculate the days I'm interested in, which is today to 10 days\r\n% from today. I also determine the day of the week using the\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/weekday.html |weekday|>\r\n% function. I need this information, because |urlfilter| will use this to\r\n% scrape the necessary data.\r\n\r\ndays = floor(now):floor(now)+9;\r\n[~, ~, dayval] = datevec(days);\r\n[~, weekdaystr] = weekday(days, 'long');\r\n\r\n%%\r\n% Now, I simply use |urlfilter| to iterate through each day as the search\r\n% term.\r\n\r\n% Pre-allocate variables\r\nlow = nan(1,length(days));\r\nhigh = nan(1,length(days));\r\n\r\nurl = 'http:\/\/www.wunderground.com\/q\/zmw:01760.1.99999';\r\nfor iD = 1:length(days)\r\n    % Search term\r\n    str = [strtrim(weekdaystr(iD,:)), ', ', num2str(dayval(iD))];\r\n    disp(['Scraping temperatures for \"', str, '\"...'])\r\n    \r\n    % Fetch 2 values (high and low)\r\n    vals = urlfilter(url,str,2);\r\n    \r\n    high(iD) = vals(1);\r\n    low(iD) = vals(2);\r\nend\r\n\r\n%%\r\n% Let's plot the results. To show the temperature in two different units,\r\n% I'm using my <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/7426-plot2axes\r\n% |plot2axes|> (shameless plug).\r\n\r\nax = plot2axes(days,high,'r.-',days,low,'b.-', ...\r\n    'YScale',@(x)5\/9*(x-32));\r\nylabel(ax(1),'Temperature (\\circF)')\r\nylabel(ax(2),'Temperature (\\circC)')\r\ndatetick('x','mmm dd','keepticks')\r\nlegend('High','Low')\r\ntitle('10-day Forecast for Natick, MA, U.S.A')\r\n\r\n%%\r\n% Note that I could have done this more efficiently with a single call to\r\n% |urlfilter|, extracting about 40 numbers at once, and then parsing the\r\n% numbers to get the necessary high and low temperatures. I used the above\r\n% approach to make it easier to understand.\r\n%\r\n% *Comments*\r\n%\r\n% Wasn't that easy? Give this a try, and let us know what you think\r\n% <https:\/\/blogs.mathworks.com\/pick\/?p=5095#respond here> or leave a\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/44751-url-filter#comments\r\n% comment> for Ned. If you find interesting data, consider tracking the\r\n% trend using <https:\/\/www.mathworks.com\/matlabcentral\/trendy\/ Trendy>!\r\n\r\n##### SOURCE END ##### 252d2d06f95d431ab1193afd19dd2605\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/pick\/jiro\/potw_urlfilter\/potw_urlfilter_01.png\" onError=\"this.style.display ='none';\" \/><\/div><p>Jiro's pick this week is urlfilter by Ned Gulley.Many of you may know Ned from various parts of MATLAB Central, such as the community blog \"MATLAB Spoken Here\". If you're a frequent visitor of MATLAB... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/pick\/2014\/01\/17\/scraping-data-from-the-web\/\">read more >><\/a><\/p>","protected":false},"author":35,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[16],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/5095"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/users\/35"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/comments?post=5095"}],"version-history":[{"count":8,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/5095\/revisions"}],"predecessor-version":[{"id":8490,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/5095\/revisions\/8490"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/media?parent=5095"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/categories?post=5095"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/tags?post=5095"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}