Latent Semantic Indexing, SVD, and Zipf’s Law 5

Posted by Cleve Moler,

Latent Semantic Indexing, LSI, uses the Singular Value Decomposition of a term-by-document matrix to represent the information in the documents in a manner that facilitates responding to queries and other information retrieval tasks. I set out to learn for myself how LSI is implemented. I am finding that it is harder than I thought.

Contents

Background

Latent Semantic Indexing was first described in 1990 by Susan Dumais and several colleagues at Bellcore, the descendant of the famous A.T.&T. Bell Labs. (Dumais is now at Microsoft Research.) I first heard about LSI a couple of years later from Mike Berry at the University of Tennessee. Berry, Dumais and Gavin O'Brien have a paper, "Using Linear Algebra for Intelligent Information Retrieval" in SIAM Review in 1995. Here is a link to a preprint of that paper.

An important problem in information retrieval involves synonyms. Different words can refer to the same, or very similar, topics. How can you retrieve a relevant document that does not actually use a key word in a query? For example, consider the terms

  • FFT (Finite Fast Fourier Transform)
  • SVD (Singular Value Decomposition)
  • ODE (Ordinary Differential Equation)

Someone looking for information about PCA (Principal Component Analysis) would be more interested in documents about SVD than those about the other two topics. It is possible to discover this connection because documents about PCA and SVD are more likely to share terms like "rank" and "subspace" that are not present in the others. For information retrieval purposes, PCA and SVD are synonyms. Latent Semantic Indexing can reveal such connections.

Strings

I will make use of the new string object, introduced in recent versions of MATLAB. The double quote has been an illegal character in MATLAB. But now it delineates strings.

   s = "% Let $A$ be the 4-by-4 magic square from Durer's _Melancholia_."
s = 
    "% Let $A$ be the 4-by-4 magic square from Durer's _Melancholia_."

To erase the leading percent sign and the LaTeX between the dollar signs.

   s = erase(s,'%')
   s = eraseBetween(s,'$','$','Boundaries','inclusive')
s = 
    " Let $A$ be the 4-by-4 magic square from Durer's _Melancholia_."
s = 
    " Let  be the 4-by-4 magic square from Durer's _Melancholia_."

Cleve's Corner

The documents for this project are the MATLAB files that constitute the source text for this blog. I started writing the blog edition of Cleve's Corner a little over five years ago, in June 2012.

Each post comes from a MATLAB file processed by the publish command. Most of the lines in the file are comments beginning with a single %. These become the prose in the post. Comments beginning with a double %% generate the title and section headings. Text within '|' characters is rendered with a fixed width font to represent variables and code. Text within '$' characters is LaTeX that is eventually typeset by MathJax.

I have collected five years' worth of source files in one directory, Blogs, on my computer. The statement

   D = doc_list('Blogs');

produces a string array of the file names in Blogs. The first few lines of D are

   D(1:5)
ans = 
  5×1 string array
    "apologies_blog.m"
    "arrowhead2.m"
    "backslash.m"
    "balancing.m"
    "biorhythms.m"

The statements

   n = length(D)
   n/5
n =
   135
ans =
    27

tell me how many posts I've made, and how many posts per year. That's an average of about one every two weeks.

The file that generates today's post is included in the library.

    for j = 1:n
        if contains(D{j},"lsi")
            j
            D{j}
        end
    end
j =
    75
ans =
    'lsi_blog.m'

Read posts

The following code prepends each file name with the directory name, reads each post into a string s, counts the total number of lines that I've written, and computes the average number of lines per post.

    lines = 0;
    for j = 1:n
        s = read_blog("Blogs/"+D{j});
        lines = lines + length(s);
    end
    lines
    avg = lines/n
lines =
       15578
avg =
  115.3926

Stop words

The most common word in Cleve's Corner blog is "the". By itself, "the" accounts for over 8 percent of the total number of words. This is roughly the same percentage seen in larger samples of English prose.

Stop words are short, frequently used words, like "the", that can be excluded from frequency counts because they are of little value when comparing documents. Here is a list of 25 commonly used stop words.

I am going to use a brute force, easy to implement, stategy for stop words. I specify an integer parameter, stop, and simply ignore all words with stop or fewer characters. A value of zero for stop means do not ignore any words. After some experiments which I am about to show, I will decide to take stop = 5.

Find terms

Here is the core code of this project. The input is the list of documents D and the stop parameter stop. The code scans all the documents, finding and counting the individual words. In the process it

  • skips all noncomment lines
  • skips all embedded LaTeX
  • skips all code fragments
  • skips all URLs
  • ignores all words with stop or fewer characters

The code prints some intermediate results and then returns three arrays.

  • T, string array of unique words. These are the terms.
  • C, the frequency counts of T.
  • A, the (sparse) term/document matrix.

The term/document matrix is m -by- n, where

  • m = length of T is the number of terms,
  • n = length of D is the number of documents.
   type find_terms
function [T,C,A] = find_terms(D,stop)
    % [T,C,A] = find_terms(D,stop)
    % Input
    %     D = document list
    %     stop = length of stop words
    % Output
    %     T = terms, sorted alphabetically
    %     C = counts of terms
    %     A = term/document matrix
    
    [W,Cw,wt] = words_and_counts(D,stop);
    T = terms(W);
    m = length(T);
    fprintf('\n\n   stop = %d\n   words = %d\n   m = %d\n\n',stop,wt,m)

    A = term_doc_matrix(W,Cw,T);

    C = full(sum(A,2));
    [Cp,p] = sort(C,'descend');
    Tp = T(p); % Terms sorted by frequency
    freq = Cp/wt;

    ten = 10;
    fprintf('   index         term    count  fraction\n')
    for k = 1:ten
        fprintf('%8d %12s %8d %9.4f\n',k,Tp(k),Cp(k),freq(k))
    end
    
    total = sum(freq(1:ten));
    fprintf('%s total = %7.4f\n',blanks(24),total)
end

No stop words

Let's run that code with stop set to zero so there are no stop words. The ten most frequently used words account for over one-quarter of all the words and are useless for characterizing documents. Note that, by itself, "the" accounts for over 8 percent of all the words.

    stop = 0;
    find_terms(D,stop);

   stop = 0
   words = 114512
   m = 8896

   index         term    count  fraction
       1          the     9404    0.0821
       2           of     3941    0.0344
       3            a     2885    0.0252
       4          and     2685    0.0234
       5           is     2561    0.0224
       6           to     2358    0.0206
       7           in     2272    0.0198
       8         that     1308    0.0114
       9          for     1128    0.0099
      10         this      969    0.0085
                         total =  0.2577

Three characters or less

Skipping all words with three or less characters cuts the total number of words almost in half and eliminates "the", but still leaves mostly uninteresting words in the top ten.

    stop = 3;
    find_terms(D,stop);

   stop = 3
   words = 67310
   m = 8333

   index         term    count  fraction
       1         that     1308    0.0194
       2         this      969    0.0144
       3         with      907    0.0135
       4       matrix      509    0.0076
       5         from      481    0.0071
       6       matlab      472    0.0070
       7         have      438    0.0065
       8        about      363    0.0054
       9        first      350    0.0052
      10        point      303    0.0045
                         total =  0.0906

Five characters or less

Setting stop to 4 is a reasonable choice, but let's be even more aggressive. With stop equal to 5, the ten most frequent words are now much more characteristic of this blog. All further calculations use these results.

    stop = 5;
    [T,C,A] = find_terms(D,stop);
    m = length(T);

   stop = 5
   words = 38856
   m = 6459

   index         term    count  fraction
       1       matrix      509    0.0131
       2       matlab      472    0.0121
       3     function      302    0.0078
       4       number      224    0.0058
       5     floating      199    0.0051
       6    precision      184    0.0047
       7     computer      167    0.0043
       8    algorithm      164    0.0042
       9       values      160    0.0041
      10    numerical      158    0.0041
                         total =  0.0653

Zipf's Law

This is just an aside. Zipf's Law is named after George Kingsley Zipf, although he did not claim originality. The law states that in a sample of natural language, the frequency of any word is inversely proportional to its index in the frequency table.

In other words, if the frequency counts C are sorted by frequency, then a plot of log(C(1:k)) versus log(1:k) should be a straight line. Our frequency counts only vaguely follow this law, but a log-log plot makes many quantities look proportional. (We get a little better conformance to the Law with smaller values of stop.)

   Zipf(C)

SVD

Now we're ready to compute the SVD. We might as well make A full; it has only n = 135 columns. But A is very tall and skinny, so it's important to use the 'econ' flag so that U also has only 135 columns. Without this flag, we would asking for a square m -by- m matrix U with m over 6000.

   [U,S,V] = svd(full(A),'econ');

Query

Here is a preliminary stab at a function to make queries.

   type query

function posts = query(queries,k,cutoff,T,U,S,V,D)
% posts = query(queries,k,cutoff,T,U,S,V,D) 

    % Rank k approximation to term/document matrix.
    Uk = U(:,1:k);
    Sk = S(1:k,1:k);
    Vk = V(:,1:k);
    
    % Construct the query vector from the relevant terms.
    m = size(U,1);
    q = zeros(m,1);
    for i = 1:length(queries)
        % Find the index of the query key word in the term list.
        wi = word_index(queries{i},T);
        q(wi) = 1;
    end
    
    % Project the query onto the document space.
    qhat = Sk\Uk'*q;
    v = Vk*qhat;
    v = v/norm(qhat);
    
    % Pick off the documents that are close to the query.
    posts = D(v > cutoff);
    
    query_plot(v,cutoff,queries)
end

Try it

Set the rank k and the cutoff level.

   rank = 40;
   cutoff = .2;

Try a few one-word queries.

   queries = {"singular","roundoff","wilkinson"};
   for j = 1:length(queries)
       queryj = queries{j}
       posts = query(queryj,rank,cutoff,T,U,S,V,D)
       snapnow
   end
queryj = 
    "singular"
posts = 
  4×1 string array
    "eigshowp_w3.m"
    "four_spaces_blog.m"
    "parter_blog.m"
    "svdshow_blog.m"
queryj = 
    "roundoff"
posts = 
  5×1 string array
    "condition_blog.m"
    "floats.m"
    "partial_pivot_blog.m"
    "refine_blog.m"
    "sanderson_blog.m"
queryj = 
    "wilkinson"
posts = 
  2×1 string array
    "dubrulle.m"
    "jhw_1.m"

Note that the query about Wilkinson found the post about him and also the post about Dubrulle, who improved a Wilkinson algorithm.

For the finale, merge all the queries into one.

   queries
   posts = query(queries,rank,cutoff,T,U,S,V,D)
queries =
  1×3 cell array
    ["singular"]    ["roundoff"]    ["wilkinson"]
posts = 
  5×1 string array
    "four_spaces_blog.m"
    "jhw_1.m"
    "parter_blog.m"
    "refine_blog.m"
    "svdshow_blog.m"

------------------------------------------------------------------

Code

doc_list

   type doc_list
function D = doc_list(dir)
    % D = doc_list(dir) is a string array of the file names in 'dir'.
    D = "";
    [status,L] = system(['ls ' dir]);
    if status ~= 0
        error(['Cannot find ' dir])
    end
    k1 = 1;
    i = 1;
    for k2 = find(L == newline)
        D(i,:) = L(k1:k2-1);
        i = i+1;
        k1 = k2+1;
    end      
end

erase_stuff

   type erase_stuff
function sout = erase_stuff(s)
   % s = erase_stuff(s)
   % erases noncomments, !system, LaTeX, URLs, Courier, and punctuation.
   j = 1;
   k = 1;
   while k <= length(s)
       sk = lower(char(s(k)));
       if length(sk) >= 4
           if sk(1) ~= '%'
               % Skip non comment
               break
           elseif any(sk=='!') && any(sk=='|')
               % Skip !system | ...
               break
           elseif all(sk(3:4) == '$')
               sk(3:4) = '  ';
               % Skip display latex
               while all(sk~='$')
                   k = k+1;
                   sk = char(s(k));
               end
           else
               % Embedded latex
               sk = eraseBetween(sk,'$','$','Boundaries','inclusive');
               
               % URL
               if any(sk == '<')
                   if ~any(sk == '>')
                       k = k+1;
                       sk = [sk lower(char(s(k)))];
                   end
                   sk = eraseBetween(sk,'<','>','Boundaries','inclusive');
                   sk = erase(sk,'>');
               end
               if contains(string(sk),"http")
                   break
               end
               
               % Courier
               f = find(sk == '|');
               assert(length(f)==1 | mod(length(f),2)==0);
               for i = 1:2:length(f)-1
                   w = sk(f(i)+1:f(i+1)-1);
                   if length(w)>2 && all(w>='a') && all(w<='z')
                       sk(f(i)) = ' ';
                       sk(f(i+1)) = ' ';
                   else
                       sk(f(i):f(i+1)) = ' ';
                   end
               end
               
               % Puncuation
               sk((sk<'a' | sk>'z') & (sk~=' ')) = [];
               
               sout(j,:) = string(sk);
               j = j+1;
           end
       end
       k = k+1;
   end
   skip = k-j;
end

find_words

   type find_words
function w = find_words(s,stop)
% words(s,stop)   Find the words with length > stop in the text string.
    w = "";
    i = 1;
    for k = 1:length(s)
        sk = [' ' char(s{k}) ' '];
        f = strfind(sk,' ');
        for j = 1:length(f)-1
            t = strtrim(sk(f(j):f(j+1)));
            % Skip stop words
            if length(t) <= stop
                continue
            end
            if ~isempty(t)
                w{i,1} = t;
                i = i+1;
            end
        end
    end   

query_plot

   type query_plot
function query_plot(v,cutoff,queries) 
    clf
    shg
    plot(v,'.','markersize',12)
    ax = axis;
    yax = 1.1*max(abs(ax(3:4)));
    axis([ax(1:2) -yax yax])
    line(ax(1:2),[cutoff cutoff],'color','k')
    title(sprintf('%s, ',queries{:}))
end

read_blog

   type read_blog
function s = read_blog(filename)
% read_blog(filename)
% skip over lines that do not begin with '%'.
   fid = fopen(filename);
   line = fgetl(fid);
   k = 1;
   while ~isequal(line,-1)  % -1 signals eof
       if length(line) > 2 && line(1) == '%'
           s(k,:) = string(line);
           k = k+1;
       end
       line = fgetl(fid);
   end
   fclose(fid);
end

term_doc_matrix

   type term_doc_matrix
function A = term_doc_matrix(W,C,T)
% A = term_doc_matrix(W,C,T)
    m = length(T);
    n = length(W);
    A = sparse(m,n);
    for j = 1:n
        nzj = length(W{j});
        i = zeros(nzj,1);
        s = zeros(nzj,1);
        for k = 1:nzj
            i(k) = word_index(W{j}(k),T);
            s(k) = C{j}(k);
        end
        A(i,j) = s;
    end
end

word_count

   type word_count
function [wout,count] = word_count(w)
% [w,count] = word_count(w)   Unique words with counts.
    w = sort(w);
    wout = "";
    count = [];
    i = 1;
    k = 1;
    while k <= length(w)
        c = 1;
        while k < length(w) && isequal(w{k},w{k+1})
            c = c+1;
            k = k+1;
        end
        wout{i,1} = w{k};
        count(i,1) = c;
        i = i+1;
        k = k+1;
    end
    [count,p] = sort(count,'descend');
    wout = wout(p);
end

word_index

   type word_index
function p = word_index(w,list)
    % Index of w in list.
    % Returns empty if w is not in list.
    m = length(list);
    p = fix(m/2);
    q = ceil(p/2);
    t = 0;
    tmax = ceil(log2(m));
    % Binary search
    while list(p) ~= w
        if list(p) > w
            p = max(p-q,1);
        else
            p = min(p+q,m);
        end
        q = ceil(q/2);
        t = t+1;
        if t == tmax
            p = [];
            break
        end
    end
end

words_and_counts

   type words_and_counts
function [W,Cw,wt] = words_and_counts(D,stop)
% [W,Cw,wt] = words_and_counts(D,stop)
    W = [];  % Words
    Cw = [];  % Counts 
    wt = 0;
    n = length(D);
    for j = 1:n
        s = read_blog("Blogs/"+D{j});
        s = erase_stuff(s);
        w = find_words(s,stop);
        [W{j,1},Cw{j,1}] = word_count(w);
        wt = wt + length(w);
    end
end

Zipf

   type Zipf
function Zipf(C)
% Zipf(C) plots log2(C(1:k)) vs. log2(1:k)
    C = sort(C,'descend');  % Sort counts by frequency
    figure(1)
    clf
    shg
    k = 256;
    plot(log2(1:k),log2(C(1:k)),'.','markersize',12)
    axis([-1 8 4 10])
    xlabel('log2(index)')
    ylabel('log2(count)')
    title('Zipf''s Law')
    drawnow
end


Get the MATLAB code

Published with MATLAB® R2017a

Note

Comments are closed.

5 CommentsOldest to Newest

Michael Berry replied on : 1 of 5

Well done Cleve! Great illustration of Matlab for document parsing and SVD-based modeling of text.

Tom Holden replied on : 3 of 5

Hi Cleve,
In regards to your article on quadruple precision, on which the comments are now closed, a much better approach is to represent an extended precision number as an unevaluated sum of two or more double precision numbers. This permits the use of existing fast code for double precision math, including vectorisation. A fairly comprehensive implementation (though somewhat sparsely documented!) is available here: https://github.com/tholden/DoubleDouble
Hope this is of some interest.
Best regards,
Tom

I did mention double double in my quad precision post, in the paragraph titled “Beyond Double”. One big advantage that quad precision has over double double is the much larger exponent range.

Tom Holden replied on : 5 of 5

I missed that somehow, sorry. But as something that is acceptably fast in MATLAB right now, double double surely wins. (Test my class to see!) And it’s not quite true that the exponent bits of the second double are ignored. Numbers that are 1+something_small are relatively common in numerical computing, and with double double, the exponent bits of the second double are used to insure that something_small is stored as accurately as it would be without the “1+”. In the kind of problems I work with, the need for big exponent ranges can always be simply fixed by working in logs (being careful with e.g. log(sum(exp(…))) of course).