{"id":2650,"date":"2017-08-28T12:00:41","date_gmt":"2017-08-28T17:00:41","guid":{"rendered":"https:\/\/blogs.mathworks.com\/cleve\/?p=2650"},"modified":"2017-08-27T11:10:44","modified_gmt":"2017-08-27T16:10:44","slug":"c5-cleves-corner-collection-card-catalog","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/cleve\/2017\/08\/28\/c5-cleves-corner-collection-card-catalog\/","title":{"rendered":"C^5, Cleve&#8217;s Corner Collection Card Catalog"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p>I have been writing books, programs, newsletter columns and blogs since 1990.  I have now collected all of this material into one repository.  Cleve's Corner Collection consists of 458 \"documents\", all available on the internet.  There are<\/p><div><ul><li>150 posts from Cleve's Corner blog.<\/li><li>43 columns from Cleve's Corner News and Notes edition.<\/li><li>33 chapters from two books, Experiments with MATLAB and Numerical   Computing with MATLAB.<\/li><li>218 programs from Cleve's Laboratory, EXM and NCM.<\/li><li>14 video transcripts from MIT Open Courseware and elsewhere.<\/li><\/ul><\/div><p>C^5 is an app, a search tool that acts like a traditional library card catalog.  It allows you to do keyword based searches through the collection and follow links to the material on the net.  Responses to queries are ordered by the scores generated by Latent Semantic Indexing, LSI, which employs the singular value decomposition of a term-document matrix of key word counts.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#83cdc878-2a18-4bfd-a772-2a38afb51d74\">Opening figure<\/a><\/li><li><a href=\"#f981bcd2-fb80-4ef4-a3c2-dea5b1c2b470\">Don Knuth<\/a><\/li><li><a href=\"#1ebd759c-e3e2-4295-8369-9eb3c937bca3\">c5setup<\/a><\/li><li><a href=\"#a1d8a84f-c227-494d-9ee6-86a3610374b2\">c5database<\/a><\/li><li><a href=\"#93ad3887-5935-4a7e-91a7-76f6dcdebc69\">Sparsity<\/a><\/li><li><a href=\"#47bdf775-4121-4044-b032-832e82706679\">Spy<\/a><\/li><li><a href=\"#dcdc00c3-3431-4e8f-88ad-cdf56fe89a92\">Most frequent terms<\/a><\/li><li><a href=\"#82e8cd69-b496-475b-a593-a53a7be8c48a\">Singular values<\/a><\/li><li><a href=\"#39f74a82-e4e4-4813-bfb9-76ffb399a83b\">Reduced rank approximation<\/a><\/li><li><a href=\"#b9cf2e40-9a6f-46c8-909c-69dcad58e5b8\">Arrow keys<\/a><\/li><li><a href=\"#fd6b84be-f958-44cf-820e-06064ea0b547\">Lothar Collatz<\/a><\/li><li><a href=\"#a30da276-0c84-4a8d-a513-56b120fb6622\">Blackjack<\/a><\/li><li><a href=\"#1c3d5137-141f-4402-850f-593536e838dc\">Levenshtein distance<\/a><\/li><li><a href=\"#5ff5f5de-c2d8-42bf-8b24-7e7453404645\">Multi-word queries<\/a><\/li><li><a href=\"#4a3ed897-bfc0-4df2-90f2-a3aabef170ac\">Stemming<\/a><\/li><li><a href=\"#da7f9fd8-6c56-4251-91f9-42b43b2a0c32\">Parsing queries<\/a><\/li><li><a href=\"#b6601962-0de7-46ef-a2ca-d6b74eba7ac3\">Limitations<\/a><\/li><li><a href=\"#508af496-8985-492e-a065-f85d4c2d7099\">Software<\/a><\/li><\/ul><\/div><h4>Opening figure<a name=\"83cdc878-2a18-4bfd-a772-2a38afb51d74\"><\/a><\/h4><p>Here is the opening window for C^5.<\/p><pre class=\"codeinput\">   c5\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/c5_blog_01.png\" alt=\"\"> <p>Enter a query, usually just a single key word, in the edit box at the top.  This is a <i>term<\/i>.  The names of the various <i>documents<\/i> that are relevant to the term are then displayed, one at a time, in the document box.<\/p><p>The arrow keys allow the document list to be scanned and changed. The LSI score determines the ordering of the list.  The term count is the number of times, if any, that the query term appears in the document.  The web button accesses a copy of the document on the internet.<\/p><h4>Don Knuth<a name=\"f981bcd2-fb80-4ef4-a3c2-dea5b1c2b470\"><\/a><\/h4><p>For my first example, let's search for material I have written that mentions Stanford Emeritus Professor of Computer Science, Donald Knuth. Enter \"knuth\" in the query box or on the command line.<\/p><pre class=\"codeinput\">   c5 <span class=\"string\">knuth<\/span>\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/c5_blog_02.png\" alt=\"\"> <p>The first document name \"blog\/c5_blog.m\" refers to this blog post, so there is a bit of self reference happening here. The document suffix is <tt>.m<\/tt> because the source texts for my blog are MATLAB programs processed by the <tt>publish<\/tt> command.<\/p><p>The term count \"10, 10\/29\" indicates that \"knuth\" appears 10 times in this document and that so far we have seen 10 out of the 29 times that \"knuth\" appears in the entire collection.<\/p><p>Click on the log button and then click a dozen or so times on the right arrow.  This log of the successive displays is printed in the command window.  Document names, dates, and term counts are displayed in decreasing order of LSI score.<\/p><pre class=\"codeinput\"><span class=\"comment\">% knuth<\/span>\r\n\r\n<span class=\"comment\">% arrow                     document     term counts     lsi   date<\/span>\r\n<span class=\"comment\">%                        blog\/c5_blog.m   10   10\/29    0.594  28-Aug-2017<\/span>\r\n<span class=\"comment\">%  &gt;                 blog\/easter_blog.m    5   15\/29    0.553  18-Mar-2013<\/span>\r\n<span class=\"comment\">%  &gt;               blog\/lambertw_blog.m    3   18\/29    0.183  02-Sep-2013<\/span>\r\n<span class=\"comment\">%  &gt;           news\/stiff_equations.txt    2   20\/29    0.182  01-May-2003<\/span>\r\n<span class=\"comment\">%  &gt;                blog\/hilbert_blog.m    2   22\/29    0.139  02-Feb-2013<\/span>\r\n<span class=\"comment\">%  &gt;                 blog\/random_blog.m    2   24\/29    0.139  17-Apr-2015<\/span>\r\n<span class=\"comment\">%  &gt;                      exmm\/easter.m    1   25\/29    0.112  2016<\/span>\r\n<span class=\"comment\">%  &gt;                         blog\/gef.m    3   28\/29    0.100  07-Jan-2013<\/span>\r\n<span class=\"comment\">%  &gt;                     ncmm\/ode23tx.m    0   28\/29    0.086  2016<\/span>\r\n<span class=\"comment\">%  &gt;           news\/normal_behavior.txt    0   28\/29    0.070  01-May-2001<\/span>\r\n<span class=\"comment\">%  &gt;                     blog\/magic_2.m    0   28\/29    0.059  05-Nov-2012<\/span>\r\n<span class=\"comment\">%                                         ..........<\/span>\r\n<span class=\"comment\">%  &gt;&gt;                    blog\/denorms.m    1   29\/29    0.010  21-Jul-2014                                           .........<\/span>\r\n<\/pre><p>The second most relevant document, \"easter_blog.m\", is a post from 2013 that describes an algorithm, popularized by Knuth, for computing the date each year in the Western, or Gregorian Calendar that Easter Sunday is celebrated.  The term count is \"5, 15\/29\", so the first two documents account for slightly over half of the total appearances of the search term.<\/p><p>The next six lines tell us that \"knuth\" appears in blog posts about the Lambert W function, Hilbert matrices, random numbers, and George Forsythe (gef), as well as a MATLAB News and Notes column in 2003 about stiff differential equations, and the actual MATLAB program from EXM for computing the date of Easter.<\/p><p>The following results with term counts of zero are blog posts that do not contain \"knuth\", but which have LSI scores indicating they might be relevant.  Finally, the blog post named \"denorms\" is about denormal floating point numbers.  It is reached by right-clicking on the right arrow to skip over documents with term counts of zero.<\/p><h4>c5setup<a name=\"1ebd759c-e3e2-4295-8369-9eb3c937bca3\"><\/a><\/h4><p>I don't know how to parse <tt>.html<\/tt> or <tt>.pdf<\/tt> files, so I have collected the original source material for everything that I have written that is now available on the web.  There are <tt>.m<\/tt> files for the blog and MATLAB programs, <tt>.tex<\/tt> files for the LaTeX of the book chapters, and <tt>.txt<\/tt> files for the newsletter columns and transcripts of the videos. There are 458 files totaling about 3.24 megabytes of text.<\/p><p>I have a program, <tt>c5setup<\/tt>, that I run on my own laptop to extract all the individual words and produce the term-document matrix. This is a sparse matrix whose <tt>(k,j)<\/tt> -th entry is the number of times that the <tt>k<\/tt> -th term appears in the <tt>j<\/tt> -th document.  It is saved in <tt>c5database.mat<\/tt> for use by the <tt>c^5<\/tt> app.<\/p><p>This setup processing eliminates frequently occurring English language words, like \"the\", on a list of <tt>stopwords<\/tt>.<\/p><pre class=\"codeinput\">   length(stopwords)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n   177\r\n<\/pre><h4>c5database<a name=\"a1d8a84f-c227-494d-9ee6-86a3610374b2\"><\/a><\/h4><pre class=\"codeinput\">   clear\r\n   load <span class=\"string\">c5database<\/span>\r\n   whos\r\n<\/pre><pre class=\"codeoutput\">  Name          Size               Bytes  Class     Attributes\r\n\r\n  A         16315x458            1552776  double    sparse    \r\n  D           458x1                40628  string              \r\n  L             1x1               120156  struct              \r\n  T         16315x1              1132930  string              \r\n\r\n<\/pre><div><ul><li><tt>A<\/tt> is the term-document matrix.<\/li><li><tt>D<\/tt> is a string array of the file names in my personal repository   of the source documents.<\/li><li><tt>L<\/tt> is a struct containing string arrays used to   generate URLs of the documents on the web.<\/li><li><tt>T<\/tt> is a string array of key words or terms.<\/li><\/ul><\/div><h4>Sparsity<a name=\"93ad3887-5935-4a7e-91a7-76f6dcdebc69\"><\/a><\/h4><p>The sparsity of the term-document matrix is a little over one percent.<\/p><pre class=\"codeinput\">   sparsity = nnz(A)\/numel(A)\r\n<\/pre><pre class=\"codeoutput\">sparsity =\r\n    0.0130\r\n<\/pre><h4>Spy<a name=\"47bdf775-4121-4044-b032-832e82706679\"><\/a><\/h4><p>Spy plot of the first 1000 rows of the term-document matrix.<\/p><pre class=\"codeinput\">   clf\r\n   spy(A(1:1000,:))\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/c5_blog_03.png\" alt=\"\"> <h4>Most frequent terms<a name=\"dcdc00c3-3431-4e8f-88ad-cdf56fe89a92\"><\/a><\/h4><p>The row sums are the total term counts.<\/p><pre class=\"codeinput\">   ttc = sum(A,2);\r\n<\/pre><p>Find the terms that occur at least 1000 times.<\/p><pre class=\"codeinput\">   k = find(ttc &gt;= 1000);\r\n   fprintf(<span class=\"string\">'%-10s %6s\\n'<\/span>,[T(k) num2str(ttc(k))]')\r\n<\/pre><pre class=\"codeoutput\">function     1806\r\nmatlab       1407\r\nmatrix       1499\r\none          1262\r\ntwo          1090\r\n<\/pre><p>Surprise.  I write a lot about MATLAB and matrices.<\/p><h4>Singular values<a name=\"82e8cd69-b496-475b-a593-a53a7be8c48a\"><\/a><\/h4><p>We might as well compute all the singular values of the full matrix. It takes less than a second.  It's important to use the economical version of the SVD that produces a <tt>U<\/tt> the same size as <tt>A<\/tt>. Otherwise we'd have a 16,315-by-16,315 <tt>U<\/tt>.<\/p><pre class=\"codeinput\">   tic\r\n   [U,S,V] = svd(full(A),<span class=\"string\">'econ'<\/span>);\r\n   toc\r\n<\/pre><pre class=\"codeoutput\">Elapsed time is 0.882556 seconds.\r\n<\/pre><p>A logarithmic plot of the singular values shows that they do not decrease very rapidly.<\/p><pre class=\"codeinput\">   clf\r\n   semilogy(diag(S),<span class=\"string\">'.'<\/span>,<span class=\"string\">'markersize'<\/span>,10)\r\n   axis([-10 450 1 1000])\r\n   title(<span class=\"string\">'singular values'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/c5_blog_04.png\" alt=\"\"> <h4>Reduced rank approximation<a name=\"39f74a82-e4e4-4813-bfb9-76ffb399a83b\"><\/a><\/h4><p>I wrote a post about <a href=\"https:\/\/blogs.mathworks.com\/cleve\/2017\/07\/31\/latent-semantic-indexing-svd-and-zipfs-law\/\">Latent Semantic Indexing<\/a> a month ago.  LSI employs a reduced rank approximation to the term-document matrix.  <tt>c^5<\/tt> has a slider for choosing the rank.  The plot of the singular values shows that the accuracy of the approximation is pretty much independent of the chosen value. Any value except very small values or large values near full rank gives an approximation good to between one and ten percent.  The power of LSI does not derive from the approximation accuracy.  I usually take the rank to be about half the number of columns.<\/p><pre class=\"codeinput\">   n = size(A,2);\r\n   k = n\/2;\r\n   Uk = U(:,1:k);\r\n   Sk = S(1:k,1:k);\r\n   Vk = V(:,1:k);\r\n   relerr = norm(Uk*Sk*Vk'-A)\/S(1,1)\r\n<\/pre><pre class=\"codeoutput\">relerr =\r\n    0.0357\r\n<\/pre><h4>Arrow keys<a name=\"b9cf2e40-9a6f-46c8-909c-69dcad58e5b8\"><\/a><\/h4><p>The three arrow keys in the <tt>c^5<\/tt> app can be clicked with either the left or right mouse button (or control-click on a one-button mouse).<\/p><div><ul><li>left &gt;: next document, any term count.<\/li><li>right &gt;: next document with nonzero term count.<\/li><li>left &lt;: previous document, any term count.<\/li><li>right &lt;: previous document with nonzero term count.<\/li><li>left ^: use the root of the current document for the query.<\/li><li>right ^: use a random term for the query.<\/li><\/ul><\/div><p>Repeatedly clicking the up arrow with the right button (an alt click) is a good way to browse the entire collection.<\/p><h4>Lothar Collatz<a name=\"fd6b84be-f958-44cf-820e-06064ea0b547\"><\/a><\/h4><p>Let's see the logs for two more examples.  Lothar Collatz has a short log.<\/p><pre class=\"codeinput\">   c5 <span class=\"string\">Collatz<\/span>\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/c5_blog_05.png\" alt=\"\"> <pre class=\"codeinput\"><span class=\"comment\">% collatz<\/span>\r\n<span class=\"comment\">%<\/span>\r\n<span class=\"comment\">% arrow                     document     term counts     lsi   date<\/span>\r\n<span class=\"comment\">%               blog\/threenplus1_blog.m    9    9\/19    0.904  19-Jan-2015<\/span>\r\n<span class=\"comment\">%  &gt;&gt;         blog\/collatz_inequality.m    4   13\/19    0.108  16-Mar-2015<\/span>\r\n<span class=\"comment\">%  &gt;&gt;                    blog\/c5_blog.m    5   18\/19    0.075  28-Aug-2017<\/span>\r\n<span class=\"comment\">%  &gt;&gt;                     ncm\/intro.tex    1   19\/19   -0.003  2004<\/span>\r\n<\/pre><p>Collatz appears in two posts from 2015, one on his <tt>3n+1<\/tt> problem and one on an elegant inequality that produces a surprising graphic, and in the section of this blog post about <tt>c^5<\/tt> that you are now reading,   He is also mentioned in the introduction to the NCM book, but the LSI value of very small.  The double arrow at the beginning of each line signifies a right click, skipping over documents that do not mention him.<\/p><h4>Blackjack<a name=\"a30da276-0c84-4a8d-a513-56b120fb6622\"><\/a><\/h4><p>I have written a lot about the card game Blackjack.<\/p><pre class=\"codeinput\">   c5 <span class=\"string\">blackjack<\/span>\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/c5_blog_06.png\" alt=\"\"> <pre class=\"codeinput\"><span class=\"comment\">% blackjack<\/span>\r\n\r\n<span class=\"comment\">% arrow                     document     term counts     lsi   date<\/span>\r\n<span class=\"comment\">%         news\/simulating_blackjack.txt   19   19\/68    0.536  01-Oct-2012<\/span>\r\n<span class=\"comment\">%  &gt;&gt;                     ncmm\/ncmgui.m    4   23\/68    0.372  2016<\/span>\r\n<span class=\"comment\">%  &gt;&gt;               blog\/random_blog2.m    4   27\/68    0.266  04-May-2015<\/span>\r\n<span class=\"comment\">%  &gt;&gt;                   ncmm\/Contents.m    2   29\/68    0.244  2016<\/span>\r\n<span class=\"comment\">%  &gt;&gt;                    blog\/c5_blog.m    5   34\/68    0.206  28-Aug-2017<\/span>\r\n<span class=\"comment\">%  &gt;&gt;                    ncm\/random.tex   13   47\/68    0.148  2004<\/span>\r\n<span class=\"comment\">%  &gt;&gt;                 lab\/thumbnails2.m    2   49\/68    0.088  2017<\/span>\r\n<span class=\"comment\">%  &gt;&gt;                        lab\/lab2.m    1   50\/68    0.061  2017<\/span>\r\n<span class=\"comment\">%  &gt;&gt;      news\/numerical_computing.txt    1   51\/68    0.025  01-Jun-2004<\/span>\r\n<span class=\"comment\">%  &gt;&gt;                   blog\/lab_blog.m    1   52\/68    0.004  31-Oct-2016<\/span>\r\n<span class=\"comment\">%  &gt;&gt;                  ncmm\/blackjack.m    8   60\/68   -0.023  2016<\/span>\r\n<span class=\"comment\">%  &gt;&gt;                   lab\/blackjack.m    8   68\/68   -0.026  2017<\/span>\r\n<\/pre><p>We can see two newsletter columns, three blogs, a portion of a book chapter, several code segments, and two copies of the blackjack app. Again, I am using right clicks.<\/p><h4>Levenshtein distance<a name=\"1c3d5137-141f-4402-850f-593536e838dc\"><\/a><\/h4><p>I recently wrote a blog post about <a href=\"https:\/\/blogs.mathworks.com\/cleve\/2017\/08\/14\/levenshtein-edit-distance-between-strings\/\">Levenshtein Edit Distance Between Strings<\/a>.  If <tt>c^5<\/tt> does not recognize the key word in a query, it uses Levenshtein distance to find the closest match in the term list to the unrecognized query.  This easily corrects simple spelling mistakes, like missing letters.  For example the missing \"i\" in \"polynomal\" is corrected to become \"polynomial\". And \"Levenstein\" becomes \"levenshtein\".<\/p><p>I received a pleasant surprise when I entered \"Molar\", expecting it to become \"moler\".  Instead, I got \"polar\" because only one substitution is required to convert \"Molar\" to \"polar\", but two substitutions are required to turn \"Molar\" into \"moler\".  (Microsoft Word spelling correction used to turn \"MATLAB\" into \"Meatball\".)<\/p><h4>Multi-word queries<a name=\"5ff5f5de-c2d8-42bf-8b24-7e7453404645\"><\/a><\/h4><p>I'm not quite sure what to do with queries consisting of more than one term.  What is the expected response to a query of \"Wilkinson polynomial\", for example?  Is it documents that contain <i>either<\/i> \"Wilkinson\" <i>or<\/i> \"polynomial\"?  This is what LSI would provide. But it is probably better to look for documents that contain <i>both<\/i> \"Wilkinson\" <i>and<\/i> \"polynomial\".  I'm not sure how to do this.<\/p><p>Worse yet, I can't look for an exact match to the two-word string \"Wilkinson polynomial\" because the first thing the setup program does is to break text into individual words.<\/p><h4>Stemming<a name=\"4a3ed897-bfc0-4df2-90f2-a3aabef170ac\"><\/a><\/h4><p>This project is not finished.  If I work on it any more, I am going to have learn about <i>scraping<\/i>, <i>stemming<\/i> and <i>lemmatization<\/i> of the source texts.  This involves relatively simple tasks like removing possessives and plurals and more complicated tasks like combining all the words with the same root or <i>lemma<\/i>.  The sentence<\/p><pre class=\"language-matlab\"><span class=\"string\">\"the quick brown fox jumped over the lazy dog's back\"<\/span>\r\n<\/pre><p>becomes<\/p><pre class=\"language-matlab\"><span class=\"string\">\"the quick brown fox jump over the lazi dog' back\"<\/span>\r\n<\/pre><p>Loren's guest blogger Toshi Takeuchi <a href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/04\/08\/can-you-find-love-through-text-analytics\/\">posted an article<\/a> in 2015 about Latent Semantic Analysis with MATLAB. He references <a href=\"http:\/\/tartarus.org\/martin\/PorterStemmer\/matlab.txt\">MATLAB code<\/a> for stemming.<\/p><h4>Parsing queries<a name=\"da7f9fd8-6c56-4251-91f9-42b43b2a0c32\"><\/a><\/h4><p>I can imagine doing a better job of parsing queries, although I could never approach the sophistication of a system like Google or Siri.<\/p><h4>Limitations<a name=\"b6601962-0de7-46ef-a2ca-d6b74eba7ac3\"><\/a><\/h4><p>A significant fraction of what I have written is not prose -- it is mathematics or code.  It cannot be parsed with the techniques of text analytics.  For example, the source texts for the books NCM and EXM have hundreds of snippets of LaTeX like<\/p><pre>\\begin{eqnarray*}\r\n A V \\eqs U \\Sigma , \\\\\r\n A^H U \\eqs V \\Sigma^H .\r\n\\end{eqnarray*}<\/pre><p>And earlier in this blog post I had<\/p><pre>tic\r\n[U,S,V] = svd(full(A),'econ');\r\ntoc<\/pre><p>My <tt>c5setup<\/tt> program now has to skip over everything like this. In doing so, it misses much the message.<\/p><h4>Software<a name=\"508af496-8985-492e-a065-f85d4c2d7099\"><\/a><\/h4><p>I had updated <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/59085-cleve-laboratory\">Cleve's Laboratory<\/a> in the Central File Exchange to include <tt>c5.m<\/tt> and <tt>c5database.mat<\/tt>.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_e3bb1eb330494ccb89b42865eb518eee() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='e3bb1eb330494ccb89b42865eb518eee ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' e3bb1eb330494ccb89b42865eb518eee';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2017 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_e3bb1eb330494ccb89b42865eb518eee()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2017a<br><\/p><\/div><!--\r\ne3bb1eb330494ccb89b42865eb518eee ##### SOURCE BEGIN #####\r\n%% C^5, Cleve's Corner Collection Card Catalog\r\n% I have been writing books, programs, newsletter columns and blogs\r\n% since 1990.  I have now collected all of this material into one\r\n% repository.  Cleve's Corner Collection consists of 458 \"documents\",\r\n% all available on the internet.  There are\r\n%\r\n% * 150 posts from Cleve's Corner blog.\r\n% * 43 columns from Cleve's Corner News and Notes edition.\r\n% * 33 chapters from two books, Experiments with MATLAB and Numerical\r\n%   Computing with MATLAB.\r\n% * 218 programs from Cleve's Laboratory, EXM and NCM.\r\n% * 14 video transcripts from MIT Open Courseware and elsewhere.\r\n%\r\n% C^5 is an app, a search tool that acts like a traditional library card\r\n% catalog.  It allows you to do keyword based searches through the\r\n% collection and follow links to the material on the net.  Responses to\r\n% queries are ordered by the scores generated by Latent Semantic Indexing,\r\n% LSI, which employs the singular value decomposition of a term-document\r\n% matrix of key word counts.\r\n\r\n%% Opening figure\r\n% Here is the opening window for C^5.\r\n\r\n   c5\r\n   \r\n%%\r\n% Enter a query, usually just a single key word, in the edit box\r\n% at the top.  This is a _term_.  The names of the various _documents_\r\n% that are relevant to the term are then displayed, one at a time,\r\n% in the document box.\r\n\r\n%% \r\n% The arrow keys allow the document list to be scanned and changed.\r\n% The LSI score determines the ordering of the list.  The term count\r\n% is the number of times, if any, that the query term appears in the\r\n% document.  The web button accesses a copy of the document on the\r\n% internet.\r\n   \r\n%% Don Knuth\r\n% For my first example, let's search for material I have written that\r\n% mentions Stanford Emeritus Professor of Computer Science, Donald Knuth.\r\n% Enter \"knuth\" in the query box or on the command line.\r\n\r\n   c5 knuth\r\n   \r\n%%\r\n% The first document name \"blog\/c5_blog.m\" refers to this blog post,\r\n% so there is a bit of self reference happening here.\r\n% The document suffix is |.m| because the source texts\r\n% for my blog are MATLAB programs processed by the |publish| command.\r\n\r\n%%\r\n% The term count \"10, 10\/29\" indicates that \"knuth\" appears 10 times in\r\n% this document and that so far we have seen 10 out of the 29 times that\r\n% \"knuth\" appears in the entire collection.\r\n\r\n%%\r\n% Click on the log button and then click a dozen or so times on the\r\n% right arrow.  This log of the successive displays is printed in the\r\n% command window.  Document names, dates, and term counts are displayed\r\n% in decreasing order of LSI score.\r\n\r\n% knuth\r\n\r\n% arrow                     document     term counts     lsi   date\r\n%                        blog\/c5_blog.m   10   10\/29    0.594  28-Aug-2017\r\n%  >                 blog\/easter_blog.m    5   15\/29    0.553  18-Mar-2013\r\n%  >               blog\/lambertw_blog.m    3   18\/29    0.183  02-Sep-2013\r\n%  >           news\/stiff_equations.txt    2   20\/29    0.182  01-May-2003\r\n%  >                blog\/hilbert_blog.m    2   22\/29    0.139  02-Feb-2013\r\n%  >                 blog\/random_blog.m    2   24\/29    0.139  17-Apr-2015\r\n%  >                      exmm\/easter.m    1   25\/29    0.112  2016\r\n%  >                         blog\/gef.m    3   28\/29    0.100  07-Jan-2013\r\n%  >                     ncmm\/ode23tx.m    0   28\/29    0.086  2016\r\n%  >           news\/normal_behavior.txt    0   28\/29    0.070  01-May-2001\r\n%  >                     blog\/magic_2.m    0   28\/29    0.059  05-Nov-2012\r\n%                                         ..........\r\n%  >>                    blog\/denorms.m    1   29\/29    0.010  21-Jul-2014                                           .........\r\n\r\n%%\r\n% The second most relevant document, \"easter_blog.m\", is a post from 2013\r\n% that describes an algorithm, popularized by Knuth, for computing the date\r\n% each year in the Western, or Gregorian Calendar that Easter Sunday is\r\n% celebrated.  The term count is \"5, 15\/29\", so the first two documents\r\n% account for slightly over half of the total appearances of the\r\n% search term.\r\n\r\n%%\r\n% The next six lines tell us that \"knuth\" appears in blog posts about\r\n% the Lambert W function, Hilbert matrices, random numbers, and\r\n% George Forsythe (gef), as well as a MATLAB News and Notes column in 2003\r\n% about stiff differential equations, and the actual MATLAB program from\r\n% EXM for computing the date of Easter.\r\n\r\n%%\r\n% The following results with term counts of zero are blog posts that do\r\n% not contain \"knuth\", but which have LSI scores indicating they might be\r\n% relevant.  Finally, the blog post named \"denorms\" is about denormal\r\n% floating point numbers.  It is reached by right-clicking on the right\r\n% arrow to skip over documents with term counts of zero.\r\n\r\n%% c5setup\r\n% I don't know how to parse |.html| or |.pdf| files, so I have collected\r\n% the original source material for everything that I have written\r\n% that is now available on the web.  There are |.m| files for the blog\r\n% and MATLAB programs, |.tex| files for the LaTeX of the book chapters,\r\n% and |.txt| files for the newsletter columns and transcripts of the \r\n% videos.\r\n% There are 458 files totaling about 3.24 megabytes of text.\r\n\r\n%%\r\n% I have a program, |c5setup|, that I run on my own laptop to extract\r\n% all the individual words and produce the term-document matrix.\r\n% This is a sparse matrix whose |(k,j)| -th entry is the number of times\r\n% that the |k| -th term appears in the |j| -th document.  It is saved\r\n% in |c5database.mat| for use by the |c^5| app.\r\n\r\n%%\r\n% This setup processing eliminates frequently occurring English language\r\n% words, like \"the\", on a list of |stopwords|.\r\n\r\n   length(stopwords)  \r\n\r\n%% c5database\r\n\r\n   clear\r\n   load c5database\r\n   whos\r\n   \r\n%%\r\n%\r\n% * |A| is the term-document matrix.\r\n% * |D| is a string array of the file names in my personal repository\r\n%   of the source documents.\r\n% * |L| is a struct containing string arrays used to \r\n%   generate URLs of the documents on the web.\r\n% * |T| is a string array of key words or terms.\r\n\r\n%% Sparsity\r\n% The sparsity of the term-document matrix is a little over one percent.\r\n\r\n   sparsity = nnz(A)\/numel(A)\r\n   \r\n%% Spy\r\n% Spy plot of the first 1000 rows of the term-document matrix.\r\n\r\n   clf\r\n   spy(A(1:1000,:))\r\n   \r\n    \r\n%% Most frequent terms\r\n% The row sums are the total term counts.\r\n\r\n   ttc = sum(A,2);\r\n   \r\n%%\r\n% Find the terms that occur at least 1000 times.\r\n\r\n   k = find(ttc >= 1000);\r\n   fprintf('%-10s %6s\\n',[T(k) num2str(ttc(k))]')\r\n   \r\n%%\r\n% Surprise.  I write a lot about MATLAB and matrices.\r\n\r\n%% Singular values\r\n% We might as well compute all the singular values of the full matrix.\r\n% It takes less than a second.  It's important to use the economical\r\n% version of the SVD that produces a |U| the same size as |A|.\r\n% Otherwise we'd have a 16,315-by-16,315 |U|.\r\n\r\n   tic\r\n   [U,S,V] = svd(full(A),'econ'); \r\n   toc\r\n   \r\n%%\r\n% A logarithmic plot of the singular values shows that they do not\r\n% decrease very rapidly.\r\n\r\n   clf\r\n   semilogy(diag(S),'.','markersize',10)\r\n   axis([-10 450 1 1000])\r\n   title('singular values')\r\n   \r\n%% Reduced rank approximation\r\n% I wrote a post about\r\n% <https:\/\/blogs.mathworks.com\/cleve\/2017\/07\/31\/latent-semantic-indexing-svd-and-zipfs-law\/\r\n% Latent Semantic Indexing> a month ago.  LSI employs a reduced rank\r\n% approximation to the term-document matrix.  |c^5| has a slider for\r\n% choosing the rank.  The plot of the singular values shows that the\r\n% accuracy of the approximation is pretty much independent of the\r\n% chosen value.\r\n% Any value except very small values or large values near full rank gives\r\n% an approximation good to between one and ten percent.  The power of LSI\r\n% does not derive from the approximation accuracy.  I usually take the\r\n% rank to be about half the number of columns.\r\n\r\n   n = size(A,2);\r\n   k = n\/2;\r\n   Uk = U(:,1:k);\r\n   Sk = S(1:k,1:k);\r\n   Vk = V(:,1:k);\r\n   relerr = norm(Uk*Sk*Vk'-A)\/S(1,1)\r\n   \r\n%% Arrow keys\r\n% The three arrow keys in the |c^5| app can be clicked with either the\r\n% left or right mouse button (or control-click on a one-button mouse).\r\n%\r\n% * left >: next document, any term count.\r\n% * right >: next document with nonzero term count.\r\n% * left <: previous document, any term count.\r\n% * right <: previous document with nonzero term count.\r\n% * left ^: use the root of the current document for the query.\r\n% * right ^: use a random term for the query.\r\n\r\n%%\r\n% Repeatedly clicking the up arrow with the right button (an alt click)\r\n% is a good way to browse the entire collection.\r\n\r\n%% Lothar Collatz\r\n% Let's see the logs for two more examples.  Lothar Collatz has a\r\n% short log.\r\n\r\n   c5 Collatz\r\n\r\n%%\r\n\r\n% collatz\r\n%\r\n% arrow                     document     term counts     lsi   date\r\n%               blog\/threenplus1_blog.m    9    9\/19    0.904  19-Jan-2015\r\n%  >>         blog\/collatz_inequality.m    4   13\/19    0.108  16-Mar-2015\r\n%  >>                    blog\/c5_blog.m    5   18\/19    0.075  28-Aug-2017\r\n%  >>                     ncm\/intro.tex    1   19\/19   -0.003  2004\r\n\r\n%%\r\n% Collatz appears in two posts from 2015, one on his |3n+1| problem and\r\n% one on an elegant inequality that produces a surprising graphic,\r\n% and in the section of this blog post about |c^5| that you\r\n% are now reading,   He is also mentioned in the introduction to\r\n% the NCM book, but the LSI value of very small.  The double arrow at\r\n% the beginning of each line signifies a right click, skipping over\r\n% documents that do not mention him.\r\n\r\n%% Blackjack\r\n% I have written a lot about the card game Blackjack.\r\n\r\n   c5 blackjack\r\n\r\n%%\r\n\r\n% blackjack\r\n\r\n% arrow                     document     term counts     lsi   date\r\n%         news\/simulating_blackjack.txt   19   19\/68    0.536  01-Oct-2012\r\n%  >>                     ncmm\/ncmgui.m    4   23\/68    0.372  2016\r\n%  >>               blog\/random_blog2.m    4   27\/68    0.266  04-May-2015\r\n%  >>                   ncmm\/Contents.m    2   29\/68    0.244  2016\r\n%  >>                    blog\/c5_blog.m    5   34\/68    0.206  28-Aug-2017\r\n%  >>                    ncm\/random.tex   13   47\/68    0.148  2004\r\n%  >>                 lab\/thumbnails2.m    2   49\/68    0.088  2017\r\n%  >>                        lab\/lab2.m    1   50\/68    0.061  2017\r\n%  >>      news\/numerical_computing.txt    1   51\/68    0.025  01-Jun-2004\r\n%  >>                   blog\/lab_blog.m    1   52\/68    0.004  31-Oct-2016\r\n%  >>                  ncmm\/blackjack.m    8   60\/68   -0.023  2016\r\n%  >>                   lab\/blackjack.m    8   68\/68   -0.026  2017\r\n\r\n%%\r\n% We can see two newsletter columns, three blogs, a portion of a book\r\n% chapter, several code segments, and two copies of the blackjack app.\r\n% Again, I am using right clicks.\r\n\r\n%% Levenshtein distance\r\n% I recently wrote a blog post about\r\n% <https:\/\/blogs.mathworks.com\/cleve\/2017\/08\/14\/levenshtein-edit-distance-between-strings\/\r\n% Levenshtein Edit Distance Between Strings>.  If |c^5| does not recognize\r\n% the key word in a query, it uses Levenshtein distance to find the \r\n% closest match in the term list to the\r\n% unrecognized query.  This easily corrects simple spelling mistakes,\r\n% like missing letters.  For example the missing \"i\" in \"polynomal\" is\r\n% corrected to become \"polynomial\".\r\n% And \"Levenstein\" becomes \"levenshtein\".\r\n\r\n%%\r\n% I received a pleasant surprise when I entered \"Molar\", expecting it\r\n% to become \"moler\".  Instead, I got \"polar\" because only one substitution\r\n% is required to convert \"Molar\" to \"polar\", but two substitutions are\r\n% required to turn \"Molar\" into \"moler\".  (Microsoft Word spelling\r\n% correction used to turn \"MATLAB\" into \"Meatball\".)\r\n\r\n%% Multi-word queries\r\n% I'm not quite sure what to do with queries consisting of more than\r\n% one term.  What is the expected response to a query of \"Wilkinson\r\n% polynomial\", for example?  Is it documents that contain _either_\r\n% \"Wilkinson\" _or_ \"polynomial\"?  This is what LSI would provide.\r\n% But it is probably better to look for documents that contain _both_\r\n% \"Wilkinson\" _and_ \"polynomial\".  I'm not sure how to do this.\r\n\r\n%%\r\n% Worse yet, I can't look for an exact match to the two-word string\r\n% \"Wilkinson polynomial\" because the first thing the setup program\r\n% does is to break text into individual words.\r\n\r\n%% Stemming\r\n% This project is not finished.  If I work on it any more, I am going to\r\n% have learn about _scraping_, _stemming_ and _lemmatization_ of the\r\n% source texts.  This involves relatively simple tasks like removing\r\n% possessives and plurals and more complicated tasks like combining\r\n% all the words with the same root or _lemma_.  The sentence\r\n%\r\n%   \"the quick brown fox jumped over the lazy dog's back\"\r\n%\r\n% becomes\r\n%\r\n%   \"the quick brown fox jump over the lazi dog' back\"\r\n%\r\n\r\n%%\r\n% Loren's guest blogger Toshi Takeuchi\r\n% <https:\/\/blogs.mathworks.com\/loren\/2015\/04\/08\/can-you-find-love-through-text-analytics\/\r\n% posted an article> in 2015 about Latent Semantic Analysis with MATLAB.\r\n% He references\r\n% <http:\/\/tartarus.org\/martin\/PorterStemmer\/matlab.txt MATLAB code>\r\n% for stemming.\r\n\r\n%% Parsing queries\r\n% I can imagine doing a better job of parsing queries, although I could\r\n% never approach the sophistication of a system like Google or Siri.\r\n\r\n%% Limitations\r\n% A significant fraction of what I have written is not prose REPLACE_WITH_DASH_DASH it is\r\n% mathematics or code.  It cannot be parsed with the techniques of\r\n% text analytics.  For example, the source texts for the books NCM and EXM\r\n% have hundreds of snippets of LaTeX like\r\n%\r\n%  \\begin{eqnarray*}\r\n%   A V \\eqs U \\Sigma , \\\\\r\n%   A^H U \\eqs V \\Sigma^H .\r\n%  \\end{eqnarray*}\r\n\r\n%%\r\n% And earlier in this blog post I had\r\n%\r\n%  tic\r\n%  [U,S,V] = svd(full(A),'econ'); \r\n%  toc\r\n\r\n%%\r\n% My |c5setup| program now has to skip over everything like this.\r\n% In doing so, it misses much the message.\r\n\r\n%% Software\r\n% I had updated\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/59085-cleve-laboratory\r\n% Cleve's Laboratory> in the Central File Exchange to include |c5.m|\r\n% and |c5database.mat|.\r\n\r\n\r\n##### SOURCE END ##### e3bb1eb330494ccb89b42865eb518eee\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/c5_blog_01.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>I have been writing books, programs, newsletter columns and blogs since 1990.  I have now collected all of this material into one repository.  Cleve's Corner Collection consists of 458 \"documents\", all available on the internet.  There are... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/cleve\/2017\/08\/28\/c5-cleves-corner-collection-card-catalog\/\">read more >><\/a><\/p>","protected":false},"author":78,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[5,4,6,30,31,1],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/2650"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/users\/78"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/comments?post=2650"}],"version-history":[{"count":1,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/2650\/revisions"}],"predecessor-version":[{"id":2658,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/2650\/revisions\/2658"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/media?parent=2650"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/categories?post=2650"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/tags?post=2650"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}