{"id":7186,"date":"2016-05-13T17:05:52","date_gmt":"2016-05-13T21:05:52","guid":{"rendered":"https:\/\/blogs.mathworks.com\/pick\/?p=7186"},"modified":"2016-07-01T13:30:02","modified_gmt":"2016-07-01T17:30:02","slug":"flatten-nested-cell-arrays","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/pick\/2016\/05\/13\/flatten-nested-cell-arrays\/","title":{"rendered":"Flatten (Nested) Cell Arrays"},"content":{"rendered":"<div xmlns:mwsh=\"https:\/\/www.mathworks.com\/namespace\/mcode\/v1\/syntaxhighlight.dtd\" class=\"content\">\r\n   <introduction>\r\n      <p><a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/3208495\">Sean<\/a>'s pick this week is <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/50502-flatten--nested--cell-arrays\">Flatten (Nested) Cell Arrays<\/a> by <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/1100783\">Yung-Yeh<\/a>.\r\n      <\/p>\r\n   <\/introduction>\r\n   <h3>Background<a name=\"1\"><\/a><\/h3>\r\n   <p>I recently have been using <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/webread.html\"><tt>webread<\/tt><\/a> to scan websites and mine information to do data analysis with.\r\n   <\/p>\r\n   <p>This requires a lot of <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html\">regular expressions<\/a>.  Regular expressions are one of those things that are incredibly frustrating but fun at the same time.  I'm usually looking\r\n      for some text pattern inside of html tags which means I'm going to be grabbing <i>'tokens'<\/i> or the unknown part that matches the expression.\r\n   <\/p>\r\n   <p>Let's do a simple example where we grab the list of MathWorks' products from the website <a href=\"https:\/\/www.mathworks.com\/products.html\">https:\/\/www.mathworks.com\/products.html<\/a>.\r\n   <\/p>\r\n   <p>First, let's identify the pattern we'll be looking for.  I like to do this in a web browser thanks to the syntax highlighting\r\n      and other editor features.  Here's a first view:\r\n   <\/p>\r\n   <p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/pick\/files\/mathworksproductshtml.png\"> <\/p>\r\n   <p>There are two patterns we need to parameterize.   The first, in yellow is the product reference.  The second, in green, is\r\n      the product name, the token we want to capture.\r\n   <\/p>\r\n   <p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/pick\/files\/mathworksproductshighlight.png\"> <\/p>\r\n   <p>Now let's code this up.  First, we'll grab the html.<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">html = webread(<span style=\"color: #A020F0\">'https:\/\/www.mathworks.com\/products.html'<\/span>);<\/pre><p>Next, we'll build the regular expression.  It's always nice to keep the <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/regexp.html#input_argument_expression\">doc page<\/a> open for this.\r\n   <\/p>\r\n   <div>\r\n      <ol>\r\n         <li>Match the string literal \"\/product\/\"<\/li>\r\n         <li>Match any \"[]\" characters \"\\w\" or hyphens \"\\w\"<\/li>\r\n         <li>As many times as possible \"*\"<\/li>\r\n         <li>Match the next backslash and closing double quote and greater than sign \"\/\"\"<\/li>\r\n         <li>Start the token with parenthesis (<\/li>\r\n         <li>Match any words, hyphens, or spaces \"\\s\" as many times as possible.<\/li>\r\n         <li>Close the token )<\/li>\r\n      <\/ol>\r\n   <\/div><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">expr = <span style=\"color: #A020F0\">'\/products\/[\\w\\-]*\/\"&gt;([\\w\\-\\s]*)'<\/span>;<\/pre><p>Run the regular expression capturing tokens.<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">tokens = regexp(html, expr, <span style=\"color: #A020F0\">'tokens'<\/span>);<\/pre><p>This is where Yung-Yeh's file comes in.  The output from <tt>regexp<\/tt> with tokens is a nested cell that can have many nesting levels depending on number of tokens and token nesting level.  <tt>cellflat<\/tt> allows me to flatten it as many levels as necessary into a cell string.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">tokens = cellflat(tokens);<\/pre><p>Now we can look at the unique products with white space at the end removed.<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">disp(unique(deblank(tokens))')<\/pre><pre style=\"font-style:oblique\">    '...'\r\n    'Aerospace Blockset'\r\n    'Aerospace Toolbox'\r\n    'Antenna Toolbox'\r\n    'Audio System Toolbox'\r\n    'Bioinformatics Toolbox'\r\n    'Communications System Toolbox'\r\n    'Computer Vision System Toolbox'\r\n    'Control System Toolbox'\r\n    'Curve Fitting Toolbox'\r\n    'DO Qualification Kit'\r\n    'DSP System Toolbox'\r\n    'Data Acquisition Toolbox'\r\n    'Database Toolbox'\r\n    'Datafeed Toolbox'\r\n    'Econometrics Toolbox'\r\n    'Embedded Coder'\r\n    'Filter Design HDL Coder'\r\n    'Financial Instruments Toolbox'\r\n    'Financial Toolbox'\r\n    'Fixed-Point Designer'\r\n    'Fuzzy Logic Toolbox'\r\n    'Global Optimization Toolbox'\r\n    'HDL Coder'\r\n    'HDL Verifier'\r\n    'IEC Certification Kit'\r\n    'Image Acquisition Toolbox'\r\n    'Image Processing Toolbox'\r\n    'Instrument Control Toolbox'\r\n    'LTE System Toolbox'\r\n    'MATLAB'\r\n    'MATLAB Coder'\r\n    'MATLAB Compiler'\r\n    'MATLAB Compiler SDK'\r\n    'MATLAB Distributed Computing Server'\r\n    'MATLAB Mobile'\r\n    'MATLAB Production Server'\r\n    'MATLAB Report Generator'\r\n    'MATLAB for Home Use'\r\n    'Mapping Toolbox'\r\n    'Model Predictive Control Toolbox'\r\n    'Model-Based Calibration Toolbox'\r\n    'Neural Network Toolbox'\r\n    'OPC Toolbox'\r\n    'Optimization Toolbox'\r\n    'Parallel Computing Toolbox'\r\n    'Partial Differential Equation Toolbox'\r\n    'Phased Array System Toolbox'\r\n    'Polyspace Bug Finder'\r\n    'Polyspace Code Prover'\r\n    'RF Toolbox'\r\n    'Robotics System Toolbox'\r\n    'Robust Control Toolbox'\r\n    'Signal Processing Toolbox'\r\n    'SimBiology'\r\n    'SimEvents'\r\n    'SimRF'\r\n    'Simscape'\r\n    'Simscape Driveline'\r\n    'Simscape Electronics'\r\n    'Simscape Fluids'\r\n    'Simscape Multibody'\r\n    'Simscape Power Systems'\r\n    'Simulink'\r\n    'Simulink 3D Animation'\r\n    'Simulink Code Inspector'\r\n    'Simulink Coder'\r\n    'Simulink Control Design'\r\n    'Simulink Design Optimization'\r\n    'Simulink Design Verifier'\r\n    'Simulink Desktop Real-Time'\r\n    'Simulink PLC Coder'\r\n    'Simulink Real-Time'\r\n    'Simulink Report Generator'\r\n    'Simulink Test'\r\n    'Simulink Verification and Validation'\r\n    'Spreadsheet Link'\r\n    'Stateflow'\r\n    'Statistics and Machine Learning Toolbox'\r\n    'Symbolic Math Toolbox'\r\n    'System Identification Toolbox'\r\n    'Trading Toolbox'\r\n    'Vehicle Network Toolbox'\r\n    'Vision HDL Toolbox'\r\n    'WLAN System Toolbox'\r\n    'Wavelet Toolbox'\r\n<\/pre><h3>Comments<a name=\"6\"><\/a><\/h3>\r\n   <p>Give it a try and let us know what you think <a href=\"https:\/\/blogs.mathworks.com\/pick\/?p=7186#respond\">here<\/a> or leave a <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/50502-flatten--nested--cell-arrays#comments\">comment<\/a> for Yung-Yeh.\r\n   <\/p><script language=\"JavaScript\">\r\n<!--\r\n\r\n    function grabCode_143de7754d0f48698c03c6ec6c835ce4() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='143de7754d0f48698c03c6ec6c835ce4 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 143de7754d0f48698c03c6ec6c835ce4';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        author = 'Sean de Wolski';\r\n        copyright = 'Copyright 2016 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add author and copyright lines at the bottom if specified.\r\n        if ((author.length > 0) || (copyright.length > 0)) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (author.length > 0) {\r\n                d.writeln('% _' + author + '_');\r\n            }\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n      \r\n      d.title = title + ' (MATLAB code)';\r\n      d.close();\r\n      }   \r\n      \r\n-->\r\n<\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_143de7754d0f48698c03c6ec6c835ce4()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n            the MATLAB code \r\n            <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2016a<br><\/p>\r\n<\/div>\r\n<!--\r\n143de7754d0f48698c03c6ec6c835ce4 ##### SOURCE BEGIN #####\r\n%% Flatten (Nested) Cell Arrays\r\n%\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/3208495 Sean>'s pick this week is\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/50502-flatten--nested--cell-arrays Flatten (Nested) Cell Arrays > by\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/1100783 Yung-Yeh>.\r\n% \r\n\r\n%% Background\r\n%\r\n% I recently have been using\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/webread.html |webread|> to scan\r\n% websites and mine information to do data analysis with.\r\n%\r\n% This requires a lot of\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html\r\n% regular expressions>.  Regular expressions are one of those things that\r\n% are incredibly frustrating but fun at the same time.  I'm usually looking\r\n% for some text pattern inside of html tags which means I'm going to be\r\n% grabbing _'tokens'_ or the unknown part that matches the expression.\r\n%\r\n% Let's do a simple example where we grab the list of MathWorks' products\r\n% from the website <www.mathworks.com\/products>.\r\n%\r\n% First, let's identify the pattern we'll be looking for.  I like to do\r\n% this in a web browser thanks to the syntax highlighting and other editor\r\n% features.  Here's a first view:\r\n%\r\n% <<mathworksproductshtml.png>>\r\n%\r\n% There are two patterns we need to parameterize.   The first, in yellow is\r\n% the product reference.  The second, in green, is the product name, the\r\n% token we want to capture.\r\n%\r\n% <<mathworksproductshighlight.png>>\r\n%\r\n% Now let's code this up.  First, we'll grab the html.\r\n\r\nhtml = webread('https:\/\/www.mathworks.com\/products.html');\r\n\r\n%%\r\n% Next, we'll build the regular expression.  It's always nice to keep the\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/regexp.html#input_argument_expression\r\n% doc page> open for this.\r\n%\r\n% # Match the string literal \"\/product\/\"\r\n% # Match any \"[]\" characters \"\\w\" or hyphens \"\\w\"\r\n% # As many times as possible \"*\"\r\n% # Match the next backslash and closing double quote and greater than sign \"\/\"\"\r\n% # Start the token with parenthesis (\r\n% # Match any words, hyphens, or spaces \"\\s\" as many times as possible.\r\n% # Close the token )\r\n\r\nexpr = '\/products\/[\\w\\-]*\/\">([\\w\\-\\s]*)';\r\n\r\n%%\r\n% Run the regular expression capturing tokens.\r\n\r\ntokens = regexp(html, expr, 'tokens');\r\n\r\n%%\r\n% This is where Yung-Yeh's file comes in.  The output from |regexp| with\r\n% tokens is a nested cell that can have many nesting levels depending on\r\n% number of tokens and token nesting level.  |cellflat| allows me to\r\n% flatten it as many levels as necessary into a cell string.\r\n\r\ntokens = cellflat(tokens);\r\n\r\n%%\r\n% Now we can look at the unique products with white space at the end\r\n% removed.\r\n\r\ndisp(unique(deblank(tokens))')\r\n\r\n%% Comments\r\n% \r\n% Give it a try and let us know what you think\r\n% <https:\/\/blogs.mathworks.com\/pick\/?p=7186#respond here> or leave a\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/50502-flatten--nested--cell-arrays#comments\r\n% comment> for Yung-Yeh.\r\n \r\n\r\n##### SOURCE END ##### 143de7754d0f48698c03c6ec6c835ce4\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/pick\/files\/mathworksproductshtml.png\" onError=\"this.style.display ='none';\" \/><\/div><p>\r\n   \r\n      Sean's pick this week is Flatten (Nested) Cell Arrays by Yung-Yeh.\r\n      \r\n   \r\n   Background\r\n   I recently have been using webread to scan websites and mine information to do data... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/pick\/2016\/05\/13\/flatten-nested-cell-arrays\/\">read more >><\/a><\/p>","protected":false},"author":87,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[16],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/7186"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/users\/87"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/comments?post=7186"}],"version-history":[{"count":5,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/7186\/revisions"}],"predecessor-version":[{"id":7554,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/7186\/revisions\/7554"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/media?parent=7186"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/categories?post=7186"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/tags?post=7186"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}