{"id":2624,"date":"2017-07-31T12:00:02","date_gmt":"2017-07-31T17:00:02","guid":{"rendered":"https:\/\/blogs.mathworks.com\/cleve\/?p=2624"},"modified":"2017-08-01T06:40:38","modified_gmt":"2017-08-01T11:40:38","slug":"latent-semantic-indexing-svd-and-zipfs-law","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/cleve\/2017\/07\/31\/latent-semantic-indexing-svd-and-zipfs-law\/","title":{"rendered":"Latent Semantic Indexing, SVD, and Zipf&#8217;s Law"},"content":{"rendered":"\r\n\r\n<div class=\"content\"><!--introduction--><p>Latent Semantic Indexing, LSI, uses the Singular Value Decomposition of a term-by-document matrix to represent the information in the documents in a manner that facilitates responding to queries and other information retrieval tasks.  I set out to learn for myself how LSI is implemented.  I am finding that it is harder than I thought.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#1b6831e0-8b01-4949-8a22-ddf1043fadaf\">Background<\/a><\/li><li><a href=\"#926e0e76-9098-4a6d-bd96-8df1e6b13817\">Strings<\/a><\/li><li><a href=\"#96ada1db-0e12-428e-b583-6291256e8e05\">Cleve's Corner<\/a><\/li><li><a href=\"#361cddcb-561a-4a50-824d-d001cb6ffaa3\">Read posts<\/a><\/li><li><a href=\"#70ab1e38-1be7-4385-be93-8abdee85ae35\">Stop words<\/a><\/li><li><a href=\"#16e7320e-bd4a-442e-823f-c07d1a710e7c\">Find terms<\/a><\/li><li><a href=\"#a04ab702-e12d-4b9b-a003-f9d12af5c012\">No stop words<\/a><\/li><li><a href=\"#092a89be-3fc8-41e9-ac85-063e91b48baa\">Three characters or less<\/a><\/li><li><a href=\"#bca84cef-56a8-41b1-860b-cf1f0c62d7d1\">Five characters or less<\/a><\/li><li><a href=\"#03708ed7-cd8a-43b9-a78d-43ef2e7fd0f2\">Zipf's Law<\/a><\/li><li><a href=\"#0edd4c36-15ea-4bfc-9494-821877e6a2b0\">SVD<\/a><\/li><li><a href=\"#37a2cd29-d2ac-4e0f-9d31-813c77ca9653\">Query<\/a><\/li><li><a href=\"#662ea089-9fea-4a03-908a-accb6bc74dcb\">Try it<\/a><\/li><li><a href=\"#38ace8cc-cca2-48ec-aa4a-9f1fb830353b\">Code<\/a><\/li><\/ul><\/div><h4>Background<a name=\"1b6831e0-8b01-4949-8a22-ddf1043fadaf\"><\/a><\/h4><p>Latent Semantic Indexing was first described in 1990 by Susan Dumais and several colleagues at Bellcore, the descendant of the famous A.T.&amp;T. Bell Labs.  (Dumais is now at Microsoft Research.) I first heard about LSI a couple of years later from Mike Berry at the University of Tennessee.  Berry, Dumais and Gavin O'Brien have a paper, \"Using Linear Algebra for Intelligent Information Retrieval\" in <a href=\"http:\/\/epubs.siam.org\/doi\/10.1137\/1037127\">SIAM Review<\/a> in 1995. <a href=\"http:\/\/www2.denizyuret.com\/ref\/berry\/berry95using.pdf\">Here is a link<\/a> to a preprint of that paper.<\/p><p>An important problem in information retrieval involves synonyms. Different words can refer to the same, or very similar, topics. How can you retrieve a relevant document that does not actually use a key word in a query?  For example, consider the terms<\/p><div><ul><li>FFT (Finite Fast Fourier Transform)<\/li><li>SVD (Singular Value Decomposition)<\/li><li>ODE (Ordinary Differential Equation)<\/li><\/ul><\/div><p>Someone looking for information about PCA (Principal Component Analysis) would be more interested in documents about SVD than those about the other two topics.  It is possible to discover this connection because documents about PCA and SVD are more likely to share terms like \"rank\" and \"subspace\" that are not present in the others. For information retrieval purposes, PCA and SVD are synonyms. Latent Semantic Indexing can reveal such connections.<\/p><h4>Strings<a name=\"926e0e76-9098-4a6d-bd96-8df1e6b13817\"><\/a><\/h4><p>I will make use of the new <tt>string<\/tt> object, introduced in recent versions of MATLAB. The double quote has been an illegal character in MATLAB. But now it delineates strings.<\/p><pre class=\"codeinput\">   s = <span class=\"string\">\"% Let $A$ be the 4-by-4 magic square from Durer's _Melancholia_.\"<\/span>\r\n<\/pre><pre class=\"codeoutput\">s = \r\n    \"% Let $A$ be the 4-by-4 magic square from Durer's _Melancholia_.\"\r\n<\/pre><p>To erase the leading percent sign and the LaTeX between the dollar signs.<\/p><pre class=\"codeinput\">   s = erase(s,<span class=\"string\">'%'<\/span>)\r\n   s = eraseBetween(s,<span class=\"string\">'$'<\/span>,<span class=\"string\">'$'<\/span>,<span class=\"string\">'Boundaries'<\/span>,<span class=\"string\">'inclusive'<\/span>)\r\n<\/pre><pre class=\"codeoutput\">s = \r\n    \" Let $A$ be the 4-by-4 magic square from Durer's _Melancholia_.\"\r\ns = \r\n    \" Let  be the 4-by-4 magic square from Durer's _Melancholia_.\"\r\n<\/pre><h4>Cleve's Corner<a name=\"96ada1db-0e12-428e-b583-6291256e8e05\"><\/a><\/h4><p>The documents for this project are the MATLAB files that constitute the source text for this blog.  I started writing the blog edition of Cleve's Corner a little over five years ago, in June 2012.<\/p><p>Each post comes from a MATLAB file processed by the <tt>publish<\/tt> command. Most of the lines in the file are comments beginning with a single <tt>%<\/tt>. These become the prose in the post.  Comments beginning with a double <tt>%%<\/tt> generate the title and section headings. Text within '|' characters is rendered with a fixed width font to represent variables and code. Text within '$' characters is LaTeX that is eventually typeset by MathJax.<\/p><p>I have collected five years' worth of source files in one directory, <tt>Blogs<\/tt>, on my computer.  The statement<\/p><pre class=\"codeinput\">   D = doc_list(<span class=\"string\">'Blogs'<\/span>);\r\n<\/pre><p>produces a string array of the file names in <tt>Blogs<\/tt>. The first few lines of <tt>D<\/tt> are<\/p><pre class=\"codeinput\">   D(1:5)\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n  5&times;1 string array\r\n    \"apologies_blog.m\"\r\n    \"arrowhead2.m\"\r\n    \"backslash.m\"\r\n    \"balancing.m\"\r\n    \"biorhythms.m\"\r\n<\/pre><p>The statements<\/p><pre class=\"codeinput\">   n = length(D)\r\n   n\/5\r\n<\/pre><pre class=\"codeoutput\">n =\r\n   135\r\nans =\r\n    27\r\n<\/pre><p>tell me how many posts I've made, and how many posts per year. That's an average of about one every two weeks.<\/p><p>The file that generates today's post is included in the library.<\/p><pre class=\"codeinput\">    <span class=\"keyword\">for<\/span> j = 1:n\r\n        <span class=\"keyword\">if<\/span> contains(D{j},<span class=\"string\">\"lsi\"<\/span>)\r\n            j\r\n            D{j}\r\n        <span class=\"keyword\">end<\/span>\r\n    <span class=\"keyword\">end<\/span>\r\n<\/pre><pre class=\"codeoutput\">j =\r\n    75\r\nans =\r\n    'lsi_blog.m'\r\n<\/pre><h4>Read posts<a name=\"361cddcb-561a-4a50-824d-d001cb6ffaa3\"><\/a><\/h4><p>The following code prepends each file name with the directory name, reads each post into a string <tt>s<\/tt>, counts the total number of lines that I've written, and computes the average number of lines per post.<\/p><pre class=\"codeinput\">    lines = 0;\r\n    <span class=\"keyword\">for<\/span> j = 1:n\r\n        s = read_blog(<span class=\"string\">\"Blogs\/\"<\/span>+D{j});\r\n        lines = lines + length(s);\r\n    <span class=\"keyword\">end<\/span>\r\n    lines\r\n    avg = lines\/n\r\n<\/pre><pre class=\"codeoutput\">lines =\r\n       15578\r\navg =\r\n  115.3926\r\n<\/pre><h4>Stop words<a name=\"70ab1e38-1be7-4385-be93-8abdee85ae35\"><\/a><\/h4><p>The most common word in Cleve's Corner blog is \"the\". By itself, \"the\" accounts for over 8 percent of the total number of words.  This is roughly the same percentage seen in larger samples of English prose.<\/p><p><i>Stop words<\/i> are short, frequently used words, like \"the\", that can be excluded from frequency counts because they are of little value when comparing documents. <a href=\"http:\/\/nlp.stanford.edu\/IR-book\/html\/htmledition\/dropping-common-terms-stop-words-1.html\">Here is a list<\/a> of 25 commonly used stop words.<\/p><p>I am going to use a brute force, easy to implement, stategy for stop words.  I specify an integer parameter, <tt>stop<\/tt>, and simply ignore all words with <tt>stop<\/tt> or fewer characters.  A value of zero for <tt>stop<\/tt> means do not ignore any words.  After some experiments which I am about to show, I will decide to take <tt>stop = 5<\/tt>.<\/p><h4>Find terms<a name=\"16e7320e-bd4a-442e-823f-c07d1a710e7c\"><\/a><\/h4><p>Here is the core code of this project.  The input is the list of documents <tt>D<\/tt> and the stop parameter <tt>stop<\/tt>.  The code scans all the documents, finding and counting the individual words. In the process it<\/p><div><ul><li>skips all noncomment lines<\/li><li>skips all embedded LaTeX<\/li><li>skips all code fragments<\/li><li>skips all URLs<\/li><li>ignores all words with <tt>stop<\/tt> or fewer characters<\/li><\/ul><\/div><p>The code prints some intermediate results and then returns three arrays.<\/p><div><ul><li>T, string array of unique words.  These are the terms.<\/li><li>C, the frequency counts of T.<\/li><li>A, the (sparse) term\/document matrix.<\/li><\/ul><\/div><p>The term\/document matrix is <tt>m<\/tt> -by- <tt>n<\/tt>, where<\/p><div><ul><li><tt>m<\/tt> = length of T is the number of terms,<\/li><li><tt>n<\/tt> = length of D is the number of documents.<\/li><\/ul><\/div><pre class=\"codeinput\">   type <span class=\"string\">find_terms<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction [T,C,A] = find_terms(D,stop)\r\n    % [T,C,A] = find_terms(D,stop)\r\n    % Input\r\n    %     D = document list\r\n    %     stop = length of stop words\r\n    % Output\r\n    %     T = terms, sorted alphabetically\r\n    %     C = counts of terms\r\n    %     A = term\/document matrix\r\n    \r\n    [W,Cw,wt] = words_and_counts(D,stop);\r\n    T = terms(W);\r\n    m = length(T);\r\n    fprintf('\\n\\n   stop = %d\\n   words = %d\\n   m = %d\\n\\n',stop,wt,m)\r\n\r\n    A = term_doc_matrix(W,Cw,T);\r\n\r\n    C = full(sum(A,2));\r\n    [Cp,p] = sort(C,'descend');\r\n    Tp = T(p); % Terms sorted by frequency\r\n    freq = Cp\/wt;\r\n\r\n    ten = 10;\r\n    fprintf('   index         term    count  fraction\\n')\r\n    for k = 1:ten\r\n        fprintf('%8d %12s %8d %9.4f\\n',k,Tp(k),Cp(k),freq(k))\r\n    end\r\n    \r\n    total = sum(freq(1:ten));\r\n    fprintf('%s total = %7.4f\\n',blanks(24),total)\r\nend\r\n<\/pre><h4>No stop words<a name=\"a04ab702-e12d-4b9b-a003-f9d12af5c012\"><\/a><\/h4><p>Let's run that code with <tt>stop<\/tt> set to zero so there are no stop words. The ten most frequently used words account for over one-quarter of all the words and are useless for characterizing documents. Note that, by itself, \"the\" accounts for over 8 percent of all the words.<\/p><pre class=\"codeinput\">    stop = 0;\r\n    find_terms(D,stop);\r\n<\/pre><pre class=\"codeoutput\">\r\n\r\n   stop = 0\r\n   words = 114512\r\n   m = 8896\r\n\r\n   index         term    count  fraction\r\n       1          the     9404    0.0821\r\n       2           of     3941    0.0344\r\n       3            a     2885    0.0252\r\n       4          and     2685    0.0234\r\n       5           is     2561    0.0224\r\n       6           to     2358    0.0206\r\n       7           in     2272    0.0198\r\n       8         that     1308    0.0114\r\n       9          for     1128    0.0099\r\n      10         this      969    0.0085\r\n                         total =  0.2577\r\n<\/pre><h4>Three characters or less<a name=\"092a89be-3fc8-41e9-ac85-063e91b48baa\"><\/a><\/h4><p>Skipping all words with three or less characters cuts the total number of words almost in half and eliminates \"the\", but still leaves mostly uninteresting words in the top ten.<\/p><pre class=\"codeinput\">    stop = 3;\r\n    find_terms(D,stop);\r\n<\/pre><pre class=\"codeoutput\">\r\n\r\n   stop = 3\r\n   words = 67310\r\n   m = 8333\r\n\r\n   index         term    count  fraction\r\n       1         that     1308    0.0194\r\n       2         this      969    0.0144\r\n       3         with      907    0.0135\r\n       4       matrix      509    0.0076\r\n       5         from      481    0.0071\r\n       6       matlab      472    0.0070\r\n       7         have      438    0.0065\r\n       8        about      363    0.0054\r\n       9        first      350    0.0052\r\n      10        point      303    0.0045\r\n                         total =  0.0906\r\n<\/pre><h4>Five characters or less<a name=\"bca84cef-56a8-41b1-860b-cf1f0c62d7d1\"><\/a><\/h4><p>Setting <tt>stop<\/tt> to 4 is a reasonable choice, but let's be even more aggressive.  With <tt>stop<\/tt> equal to 5, the ten most frequent words are now much more characteristic of this blog.  All further calculations use these results.<\/p><pre class=\"codeinput\">    stop = 5;\r\n    [T,C,A] = find_terms(D,stop);\r\n    m = length(T);\r\n<\/pre><pre class=\"codeoutput\">\r\n\r\n   stop = 5\r\n   words = 38856\r\n   m = 6459\r\n\r\n   index         term    count  fraction\r\n       1       matrix      509    0.0131\r\n       2       matlab      472    0.0121\r\n       3     function      302    0.0078\r\n       4       number      224    0.0058\r\n       5     floating      199    0.0051\r\n       6    precision      184    0.0047\r\n       7     computer      167    0.0043\r\n       8    algorithm      164    0.0042\r\n       9       values      160    0.0041\r\n      10    numerical      158    0.0041\r\n                         total =  0.0653\r\n<\/pre><h4>Zipf's Law<a name=\"03708ed7-cd8a-43b9-a78d-43ef2e7fd0f2\"><\/a><\/h4><p>This is just an aside.  <a href=\"http:\/\/en.wikipedia.org\/wiki\/Zipf%27s_law\">Zipf's Law<\/a> is named after George Kingsley Zipf, although he did not claim originality.  The law states that in a sample of natural language, the frequency of any word is inversely proportional to its index in the frequency table.<\/p><p>In other words, if the frequency counts <tt>C<\/tt> are sorted by frequency, then a plot of <tt>log(C(1:k))<\/tt> versus <tt>log(1:k)<\/tt> should be a straight line.  Our frequency counts only vaguely follow this law, but a log-log plot makes many quantities look proportional.  (We get a little better conformance to the Law with smaller values of <tt>stop<\/tt>.)<\/p><pre class=\"codeinput\">   Zipf(C)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/lsi_blog_01.png\" alt=\"\"> <h4>SVD<a name=\"0edd4c36-15ea-4bfc-9494-821877e6a2b0\"><\/a><\/h4><p>Now we're ready to compute the SVD.  We might as well make <tt>A<\/tt> full; it has only <tt>n<\/tt> = 135 columns.  But <tt>A<\/tt> is very tall and skinny, so it's important to use the <tt>'econ'<\/tt> flag so that <tt>U<\/tt> also has only 135 columns.  Without this flag, we would asking for a square <tt>m<\/tt> -by- <tt>m<\/tt> matrix <tt>U<\/tt> with <tt>m<\/tt> over 6000.<\/p><pre class=\"codeinput\">   [U,S,V] = svd(full(A),<span class=\"string\">'econ'<\/span>);\r\n<\/pre><h4>Query<a name=\"37a2cd29-d2ac-4e0f-9d31-813c77ca9653\"><\/a><\/h4><p>Here is a preliminary stab at a function to make queries.<\/p><pre class=\"codeinput\">   type <span class=\"string\">query<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\n\r\nfunction posts = query(queries,k,cutoff,T,U,S,V,D)\r\n% posts = query(queries,k,cutoff,T,U,S,V,D) \r\n\r\n    % Rank k approximation to term\/document matrix.\r\n    Uk = U(:,1:k);\r\n    Sk = S(1:k,1:k);\r\n    Vk = V(:,1:k);\r\n    \r\n    % Construct the query vector from the relevant terms.\r\n    m = size(U,1);\r\n    q = zeros(m,1);\r\n    for i = 1:length(queries)\r\n        % Find the index of the query key word in the term list.\r\n        wi = word_index(queries{i},T);\r\n        q(wi) = 1;\r\n    end\r\n    \r\n    % Project the query onto the document space.\r\n    qhat = Sk\\Uk'*q;\r\n    v = Vk*qhat;\r\n    v = v\/norm(qhat);\r\n    \r\n    % Pick off the documents that are close to the query.\r\n    posts = D(v &gt; cutoff);\r\n    \r\n    query_plot(v,cutoff,queries)\r\nend\r\n<\/pre><h4>Try it<a name=\"662ea089-9fea-4a03-908a-accb6bc74dcb\"><\/a><\/h4><p>Set the rank k and the cutoff level.<\/p><pre class=\"codeinput\">   rank = 40;\r\n   cutoff = .2;\r\n<\/pre><p>Try a few one-word queries.<\/p><pre class=\"codeinput\">   queries = {<span class=\"string\">\"singular\"<\/span>,<span class=\"string\">\"roundoff\"<\/span>,<span class=\"string\">\"wilkinson\"<\/span>};\r\n   <span class=\"keyword\">for<\/span> j = 1:length(queries)\r\n       queryj = queries{j}\r\n       posts = query(queryj,rank,cutoff,T,U,S,V,D)\r\n       snapnow\r\n   <span class=\"keyword\">end<\/span>\r\n<\/pre><pre class=\"codeoutput\">queryj = \r\n    \"singular\"\r\nposts = \r\n  4&times;1 string array\r\n    \"eigshowp_w3.m\"\r\n    \"four_spaces_blog.m\"\r\n    \"parter_blog.m\"\r\n    \"svdshow_blog.m\"\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/lsi_blog_02.png\" alt=\"\"> <pre class=\"codeoutput\">queryj = \r\n    \"roundoff\"\r\nposts = \r\n  5&times;1 string array\r\n    \"condition_blog.m\"\r\n    \"floats.m\"\r\n    \"partial_pivot_blog.m\"\r\n    \"refine_blog.m\"\r\n    \"sanderson_blog.m\"\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/lsi_blog_03.png\" alt=\"\"> <pre class=\"codeoutput\">queryj = \r\n    \"wilkinson\"\r\nposts = \r\n  2&times;1 string array\r\n    \"dubrulle.m\"\r\n    \"jhw_1.m\"\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/lsi_blog_04.png\" alt=\"\"> <p>Note that the query about Wilkinson found the post about him and also the post about Dubrulle, who improved a Wilkinson algorithm.<\/p><p>For the finale, merge all the queries into one.<\/p><pre class=\"codeinput\">   queries\r\n   posts = query(queries,rank,cutoff,T,U,S,V,D)\r\n<\/pre><pre class=\"codeoutput\">queries =\r\n  1&times;3 cell array\r\n    [\"singular\"]    [\"roundoff\"]    [\"wilkinson\"]\r\nposts = \r\n  5&times;1 string array\r\n    \"four_spaces_blog.m\"\r\n    \"jhw_1.m\"\r\n    \"parter_blog.m\"\r\n    \"refine_blog.m\"\r\n    \"svdshow_blog.m\"\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/lsi_blog_05.png\" alt=\"\"> <p>------------------------------------------------------------------<\/p><h4>Code<a name=\"38ace8cc-cca2-48ec-aa4a-9f1fb830353b\"><\/a><\/h4><p><b>doc_list<\/b><\/p><pre class=\"codeinput\">   type <span class=\"string\">doc_list<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction D = doc_list(dir)\r\n    % D = doc_list(dir) is a string array of the file names in 'dir'.\r\n    D = \"\";\r\n    [status,L] = system(['ls ' dir]);\r\n    if status ~= 0\r\n        error(['Cannot find ' dir])\r\n    end\r\n    k1 = 1;\r\n    i = 1;\r\n    for k2 = find(L == newline)\r\n        D(i,:) = L(k1:k2-1);\r\n        i = i+1;\r\n        k1 = k2+1;\r\n    end      \r\nend\r\n<\/pre><p><b>erase_stuff<\/b><\/p><pre class=\"codeinput\">   type <span class=\"string\">erase_stuff<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction sout = erase_stuff(s)\r\n   % s = erase_stuff(s)\r\n   % erases noncomments, !system, LaTeX, URLs, Courier, and punctuation.\r\n   j = 1;\r\n   k = 1;\r\n   while k &lt;= length(s)\r\n       sk = lower(char(s(k)));\r\n       if length(sk) &gt;= 4\r\n           if sk(1) ~= '%'\r\n               % Skip non comment\r\n               break\r\n           elseif any(sk=='!') &amp;&amp; any(sk=='|')\r\n               % Skip !system | ...\r\n               break\r\n           elseif all(sk(3:4) == '$')\r\n               sk(3:4) = '  ';\r\n               % Skip display latex\r\n               while all(sk~='$')\r\n                   k = k+1;\r\n                   sk = char(s(k));\r\n               end\r\n           else\r\n               % Embedded latex\r\n               sk = eraseBetween(sk,'$','$','Boundaries','inclusive');\r\n               \r\n               % URL\r\n               if any(sk == '&lt;')\r\n                   if ~any(sk == '&gt;')\r\n                       k = k+1;\r\n                       sk = [sk lower(char(s(k)))];\r\n                   end\r\n                   sk = eraseBetween(sk,'&lt;','&gt;','Boundaries','inclusive');\r\n                   sk = erase(sk,'&gt;');\r\n               end\r\n               if contains(string(sk),\"http\")\r\n                   break\r\n               end\r\n               \r\n               % Courier\r\n               f = find(sk == '|');\r\n               assert(length(f)==1 | mod(length(f),2)==0);\r\n               for i = 1:2:length(f)-1\r\n                   w = sk(f(i)+1:f(i+1)-1);\r\n                   if length(w)&gt;2 &amp;&amp; all(w&gt;='a') &amp;&amp; all(w&lt;='z')\r\n                       sk(f(i)) = ' ';\r\n                       sk(f(i+1)) = ' ';\r\n                   else\r\n                       sk(f(i):f(i+1)) = ' ';\r\n                   end\r\n               end\r\n               \r\n               % Puncuation\r\n               sk((sk&lt;'a' | sk&gt;'z') &amp; (sk~=' ')) = [];\r\n               \r\n               sout(j,:) = string(sk);\r\n               j = j+1;\r\n           end\r\n       end\r\n       k = k+1;\r\n   end\r\n   skip = k-j;\r\nend\r\n<\/pre><p><b>find_words<\/b><\/p><pre class=\"codeinput\">   type <span class=\"string\">find_words<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction w = find_words(s,stop)\r\n% words(s,stop)   Find the words with length &gt; stop in the text string.\r\n    w = \"\";\r\n    i = 1;\r\n    for k = 1:length(s)\r\n        sk = [' ' char(s{k}) ' '];\r\n        f = strfind(sk,' ');\r\n        for j = 1:length(f)-1\r\n            t = strtrim(sk(f(j):f(j+1)));\r\n            % Skip stop words\r\n            if length(t) &lt;= stop\r\n                continue\r\n            end\r\n            if ~isempty(t)\r\n                w{i,1} = t;\r\n                i = i+1;\r\n            end\r\n        end\r\n    end   \r\n<\/pre><p><b>query_plot<\/b><\/p><pre class=\"codeinput\">   type <span class=\"string\">query_plot<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction query_plot(v,cutoff,queries) \r\n    clf\r\n    shg\r\n    plot(v,'.','markersize',12)\r\n    ax = axis;\r\n    yax = 1.1*max(abs(ax(3:4)));\r\n    axis([ax(1:2) -yax yax])\r\n    line(ax(1:2),[cutoff cutoff],'color','k')\r\n    title(sprintf('%s, ',queries{:}))\r\nend\r\n<\/pre><p><b>read_blog<\/b><\/p><pre class=\"codeinput\">   type <span class=\"string\">read_blog<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction s = read_blog(filename)\r\n% read_blog(filename)\r\n% skip over lines that do not begin with '%'.\r\n   fid = fopen(filename);\r\n   line = fgetl(fid);\r\n   k = 1;\r\n   while ~isequal(line,-1)  % -1 signals eof\r\n       if length(line) &gt; 2 &amp;&amp; line(1) == '%'\r\n           s(k,:) = string(line);\r\n           k = k+1;\r\n       end\r\n       line = fgetl(fid);\r\n   end\r\n   fclose(fid);\r\nend\r\n<\/pre><p><b>term_doc_matrix<\/b><\/p><pre class=\"codeinput\">   type <span class=\"string\">term_doc_matrix<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction A = term_doc_matrix(W,C,T)\r\n% A = term_doc_matrix(W,C,T)\r\n    m = length(T);\r\n    n = length(W);\r\n    A = sparse(m,n);\r\n    for j = 1:n\r\n        nzj = length(W{j});\r\n        i = zeros(nzj,1);\r\n        s = zeros(nzj,1);\r\n        for k = 1:nzj\r\n            i(k) = word_index(W{j}(k),T);\r\n            s(k) = C{j}(k);\r\n        end\r\n        A(i,j) = s;\r\n    end\r\nend\r\n<\/pre><p><b>word_count<\/b><\/p><pre class=\"codeinput\">   type <span class=\"string\">word_count<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction [wout,count] = word_count(w)\r\n% [w,count] = word_count(w)   Unique words with counts.\r\n    w = sort(w);\r\n    wout = \"\";\r\n    count = [];\r\n    i = 1;\r\n    k = 1;\r\n    while k &lt;= length(w)\r\n        c = 1;\r\n        while k &lt; length(w) &amp;&amp; isequal(w{k},w{k+1})\r\n            c = c+1;\r\n            k = k+1;\r\n        end\r\n        wout{i,1} = w{k};\r\n        count(i,1) = c;\r\n        i = i+1;\r\n        k = k+1;\r\n    end\r\n    [count,p] = sort(count,'descend');\r\n    wout = wout(p);\r\nend\r\n<\/pre><p><b>word_index<\/b><\/p><pre class=\"codeinput\">   type <span class=\"string\">word_index<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction p = word_index(w,list)\r\n    % Index of w in list.\r\n    % Returns empty if w is not in list.\r\n    m = length(list);\r\n    p = fix(m\/2);\r\n    q = ceil(p\/2);\r\n    t = 0;\r\n    tmax = ceil(log2(m));\r\n    % Binary search\r\n    while list(p) ~= w\r\n        if list(p) &gt; w\r\n            p = max(p-q,1);\r\n        else\r\n            p = min(p+q,m);\r\n        end\r\n        q = ceil(q\/2);\r\n        t = t+1;\r\n        if t == tmax\r\n            p = [];\r\n            break\r\n        end\r\n    end\r\nend\r\n<\/pre><p><b>words_and_counts<\/b><\/p><pre class=\"codeinput\">   type <span class=\"string\">words_and_counts<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction [W,Cw,wt] = words_and_counts(D,stop)\r\n% [W,Cw,wt] = words_and_counts(D,stop)\r\n    W = [];  % Words\r\n    Cw = [];  % Counts \r\n    wt = 0;\r\n    n = length(D);\r\n    for j = 1:n\r\n        s = read_blog(\"Blogs\/\"+D{j});\r\n        s = erase_stuff(s);\r\n        w = find_words(s,stop);\r\n        [W{j,1},Cw{j,1}] = word_count(w);\r\n        wt = wt + length(w);\r\n    end\r\nend\r\n<\/pre><p><b>Zipf<\/b><\/p><pre class=\"codeinput\">   type <span class=\"string\">Zipf<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction Zipf(C)\r\n% Zipf(C) plots log2(C(1:k)) vs. log2(1:k)\r\n    C = sort(C,'descend');  % Sort counts by frequency\r\n    figure(1)\r\n    clf\r\n    shg\r\n    k = 256;\r\n    plot(log2(1:k),log2(C(1:k)),'.','markersize',12)\r\n    axis([-1 8 4 10])\r\n    xlabel('log2(index)')\r\n    ylabel('log2(count)')\r\n    title('Zipf''s Law')\r\n    drawnow\r\nend\r\n<\/pre><script language=\"JavaScript\"> <!-- \r\n    function grabCode_2ca2ead78f7d42a7918016e210d4f0dc() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='2ca2ead78f7d42a7918016e210d4f0dc ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 2ca2ead78f7d42a7918016e210d4f0dc';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2017 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_2ca2ead78f7d42a7918016e210d4f0dc()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2017a<br><\/p><\/div><!--\r\n2ca2ead78f7d42a7918016e210d4f0dc ##### SOURCE BEGIN #####\r\n%% Latent Semantic Indexing, SVD, and Zipf's Law\r\n% Latent Semantic Indexing, LSI, uses the Singular Value Decomposition\r\n% of a term-by-document matrix to represent the information in the \r\n% documents in a manner that facilitates responding to queries and other\r\n% information retrieval tasks.  I set out to learn for myself how\r\n% LSI is implemented.  I am finding that it is harder than I thought.\r\n\r\n%% Background\r\n% Latent Semantic Indexing was first described in 1990 by Susan Dumais\r\n% and several colleagues at Bellcore, the descendant of the famous\r\n% A.T.&T. Bell Labs.  (Dumais is now at Microsoft Research.)\r\n% I first heard about LSI a couple of years later from Mike Berry at\r\n% the University of Tennessee.  Berry, Dumais and Gavin O'Brien have a\r\n% paper, \"Using Linear Algebra for Intelligent Information Retrieval\"\r\n% in <http:\/\/epubs.siam.org\/doi\/10.1137\/1037127 SIAM Review> in 1995.\r\n% <http:\/\/www2.denizyuret.com\/ref\/berry\/berry95using.pdf Here is a link>\r\n% to a preprint of that paper.\r\n\r\n%%\r\n% An important problem in information retrieval involves synonyms.\r\n% Different words can refer to the same, or very similar, topics.\r\n% How can you retrieve a relevant document that does not actually\r\n% use a key word in a query?  For example, consider the terms\r\n%\r\n% * FFT (Finite Fast Fourier Transform)\r\n% * SVD (Singular Value Decomposition)\r\n% * ODE (Ordinary Differential Equation)\r\n%\r\n% Someone looking for information about PCA (Principal Component\r\n% Analysis) would be more interested in documents about SVD than those\r\n% about the other two topics.  It is possible to discover this connection\r\n% because documents about PCA and SVD are more likely to share terms\r\n% like \"rank\" and \"subspace\" that are not present in the others.\r\n% For information retrieval purposes, PCA and SVD are synonyms.\r\n% Latent Semantic Indexing can reveal such connections.\r\n\r\n%% Strings\r\n% I will make use of the new |string| object, introduced in recent versions\r\n% of MATLAB. The double quote has been an illegal character in MATLAB.\r\n% But now it delineates strings.\r\n\r\n   s = \"% Let $A$ be the 4-by-4 magic square from Durer's _Melancholia_.\"\r\n   \r\n%%\r\n% To erase the leading percent sign and the LaTeX between the dollar signs.\r\n\r\n   s = erase(s,'%')\r\n   s = eraseBetween(s,'$','$','Boundaries','inclusive')\r\n\r\n%% Cleve's Corner\r\n% The documents for this project are the MATLAB files that constitute the\r\n% source text for this blog.  I started writing the blog edition of\r\n% Cleve's Corner a little over five years ago, in June 2012.\r\n\r\n%% \r\n% Each post comes from a MATLAB file processed by the |publish| command.\r\n% Most of the lines in the file are comments beginning with a single |%|.\r\n% These become the prose in the post.  Comments beginning with a double\r\n% |%%| generate the title and section headings.  \r\n% Text within '|' characters\r\n% is rendered with a fixed width font to represent variables and code.\r\n% Text within '$' characters is LaTeX that is eventually typeset by\r\n% MathJax.\r\n\r\n%%\r\n% I have collected five years' worth of source files in one directory,\r\n% |Blogs|, on my computer.  The statement\r\n\r\n   D = doc_list('Blogs');\r\n\r\n%%\r\n% produces a string array of the file names in |Blogs|.\r\n% The first few lines of |D| are\r\n\r\n   D(1:5)\r\n   \r\n%%\r\n% The statements\r\n\r\n   n = length(D)\r\n   n\/5\r\n   \r\n%%\r\n% tell me how many posts I've made, and how many posts per year.\r\n% That's an average of about one every two weeks.\r\n\r\n%%\r\n% The file that generates today's post is included in the library.\r\n\r\n    for j = 1:n\r\n        if contains(D{j},\"lsi\")\r\n            j\r\n            D{j}\r\n        end\r\n    end\r\n\r\n%% Read posts\r\n% The following code prepends each file name with the directory name, \r\n% reads each post into a string |s|, counts the total number of lines\r\n% that I've written, and computes the average number of lines per post.\r\n\r\n    lines = 0;\r\n    for j = 1:n\r\n        s = read_blog(\"Blogs\/\"+D{j});\r\n        lines = lines + length(s);\r\n    end\r\n    lines\r\n    avg = lines\/n\r\n    \r\n%% Stop words\r\n% The most common word in Cleve's Corner blog is \"the\".\r\n% By itself, \"the\" accounts for over 8 percent of the total number of\r\n% words.  This is roughly the same percentage seen in larger samples\r\n% of English prose.\r\n\r\n%%\r\n% _Stop words_ are short, frequently used words, like \"the\", that can be\r\n% excluded from frequency counts because they are of little value when\r\n% comparing documents.\r\n% <http:\/\/nlp.stanford.edu\/IR-book\/html\/htmledition\/dropping-common-terms-stop-words-1.html\r\n% Here is a list> of 25 commonly used stop words.\r\n\r\n%%\r\n% I am going to use a brute force, easy to implement, stategy for stop\r\n% words.  I specify an integer parameter, |stop|, and simply ignore\r\n% all words with |stop| or fewer characters.  A value of zero for |stop|\r\n% means do not ignore any words.  After some experiments which I am\r\n% about to show, I will decide to take |stop = 5|.\r\n    \r\n%% Find terms\r\n% Here is the core code of this project.  The input is the list of\r\n% documents |D| and the stop parameter |stop|.  The code scans all\r\n% the documents, finding and counting the individual words.\r\n% In the process it\r\n%\r\n% * skips all noncomment lines\r\n% * skips all embedded LaTeX\r\n% * skips all code fragments\r\n% * skips all URLs\r\n% * ignores all words with |stop| or fewer characters\r\n\r\n%%\r\n% The code prints some intermediate results and then returns three arrays.\r\n% \r\n% * T, string array of unique words.  These are the terms.\r\n% * C, the frequency counts of T.\r\n% * A, the (sparse) term\/document matrix.\r\n\r\n%% \r\n% The term\/document matrix is |m| -by- |n|, where\r\n%\r\n% * |m| = length of T is the number of terms,\r\n% * |n| = length of D is the number of documents.\r\n\r\n   type find_terms\r\n   \r\n%% No stop words\r\n% Let's run that code with |stop| set to zero so there are no stop words.\r\n% The ten most frequently used words account for over one-quarter of\r\n% all the words and are useless for characterizing documents.\r\n% Note that, by itself, \"the\" accounts for over 8 percent of all the words.\r\n\r\n    stop = 0;\r\n    find_terms(D,stop);\r\n    \r\n%% Three characters or less\r\n% Skipping all words with three or less characters cuts the total number\r\n% of words almost in half and eliminates \"the\",\r\n% but still leaves mostly uninteresting words in the top ten.\r\n\r\n    stop = 3;\r\n    find_terms(D,stop);\r\n\r\n%% Five characters or less\r\n% Setting |stop| to 4 is a reasonable choice, but let's be even more\r\n% aggressive.  With |stop| equal to 5, the ten most frequent words are \r\n% now much more characteristic of this blog.  All further calculations\r\n% use these results.\r\n \r\n    stop = 5;\r\n    [T,C,A] = find_terms(D,stop);\r\n    m = length(T);\r\n\r\n%% Zipf's Law\r\n% This is just an aside.  <http:\/\/en.wikipedia.org\/wiki\/Zipf%27s_law\r\n% Zipf's Law> is named after George Kingsley Zipf, although he did\r\n% not claim originality.  The law states that in a sample of natural\r\n% language, the frequency of any word is inversely proportional to\r\n% its index in the frequency table.\r\n\r\n%%\r\n% In other words, if the frequency counts |C| are sorted by\r\n% frequency, then a plot of |log(C(1:k))| versus |log(1:k)|\r\n% should be a straight line.  Our frequency counts\r\n% only vaguely follow this law, but a log-log plot makes many\r\n% quantities look proportional.  (We get a little better conformance\r\n% to the Law with smaller values of |stop|.)\r\n\r\n   Zipf(C)\r\n\r\n%% SVD\r\n% Now we're ready to compute the SVD.  We might as well make |A| full;\r\n% it has only |n| = 135 columns.  But |A| is very tall and skinny,\r\n% so it's important to use the |'econ'| flag so that |U| also has only\r\n% 135 columns.  Without this flag, we would asking for a square\r\n% |m| -by- |m| matrix |U| with |m| over 6000.\r\n\r\n   [U,S,V] = svd(full(A),'econ');\r\n   \r\n%% Query\r\n% Here is a preliminary stab at a function to make queries.\r\n\r\n   type query\r\n\r\n%% Try it\r\n\r\n%%\r\n% Set the rank k and the cutoff level.\r\n\r\n   rank = 40;\r\n   cutoff = .2;\r\n   \r\n%%\r\n% Try a few one-word queries.\r\n   \r\n   queries = {\"singular\",\"roundoff\",\"wilkinson\"};\r\n   for j = 1:length(queries)\r\n       queryj = queries{j}\r\n       posts = query(queryj,rank,cutoff,T,U,S,V,D)\r\n       snapnow\r\n   end\r\n   \r\n%%\r\n% Note that the query about Wilkinson found the post about him\r\n% and also the post about Dubrulle, who improved a Wilkinson algorithm.\r\n\r\n%%\r\n% For the finale, merge all the queries into one.\r\n   \r\n   queries\r\n   posts = query(queries,rank,cutoff,T,U,S,V,D)\r\n   \r\n%%\r\n% REPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASHREPLACE_WITH_DASH_DASH\r\n\r\n%% Code\r\n% *doc_list*\r\n   type doc_list\r\n  \r\n%%\r\n% *erase_stuff*\r\n   type erase_stuff\r\n   \r\n%%\r\n% *find_words*\r\n   type find_words\r\n   \r\n%%\r\n% *query_plot*\r\n   type query_plot\r\n   \r\n%%\r\n% *read_blog*\r\n   type read_blog\r\n   \r\n%%\r\n% *term_doc_matrix*\r\n   type term_doc_matrix\r\n   \r\n%%\r\n% *word_count*\r\n   type word_count\r\n   \r\n%%\r\n% *word_index*\r\n   type word_index\r\n   \r\n%%\r\n% *words_and_counts*\r\n   type words_and_counts\r\n   \r\n%%\r\n% *Zipf*\r\n   type Zipf\r\n   \r\n##### SOURCE END ##### 2ca2ead78f7d42a7918016e210d4f0dc\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img src=\"https:\/\/blogs.mathworks.com\/cleve\/files\/lsi_blog_01.png\" class=\"img-responsive attachment-post-thumbnail size-post-thumbnail wp-post-image\" alt=\"\" decoding=\"async\" loading=\"lazy\" \/><\/div><!--introduction--><p>Latent Semantic Indexing, LSI, uses the Singular Value Decomposition of a term-by-document matrix to represent the information in the documents in a manner that facilitates responding to queries and other information retrieval tasks.  I set out to learn for myself how LSI is implemented.  I am finding that it is harder than I thought.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/cleve\/2017\/07\/31\/latent-semantic-indexing-svd-and-zipfs-law\/\">read more >><\/a><\/p>","protected":false},"author":78,"featured_media":2626,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[16,30],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/2624"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/users\/78"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/comments?post=2624"}],"version-history":[{"count":5,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/2624\/revisions"}],"predecessor-version":[{"id":2634,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/2624\/revisions\/2634"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/media\/2626"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/media?parent=2624"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/categories?post=2624"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/tags?post=2624"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}