C^5, Cleve’s Corner Collection Card Catalog

I have been writing books, programs, newsletter columns and blogs since 1990. I have now collected all of this material into one repository. Cleve's Corner Collection consists of 458 "documents", all available on the internet. There are

  • 150 posts from Cleve's Corner blog.
  • 43 columns from Cleve's Corner News and Notes edition.
  • 33 chapters from two books, Experiments with MATLAB and Numerical Computing with MATLAB.
  • 218 programs from Cleve's Laboratory, EXM and NCM.
  • 14 video transcripts from MIT Open Courseware and elsewhere.

C^5 is an app, a search tool that acts like a traditional library card catalog. It allows you to do keyword based searches through the collection and follow links to the material on the net. Responses to queries are ordered by the scores generated by Latent Semantic Indexing, LSI, which employs the singular value decomposition of a term-document matrix of key word counts.

Contents

Opening figure

Here is the opening window for C^5.

   c5

Enter a query, usually just a single key word, in the edit box at the top. This is a term. The names of the various documents that are relevant to the term are then displayed, one at a time, in the document box.

The arrow keys allow the document list to be scanned and changed. The LSI score determines the ordering of the list. The term count is the number of times, if any, that the query term appears in the document. The web button accesses a copy of the document on the internet.

Don Knuth

For my first example, let's search for material I have written that mentions Stanford Emeritus Professor of Computer Science, Donald Knuth. Enter "knuth" in the query box or on the command line.

   c5 knuth

The first document name "blog/c5_blog.m" refers to this blog post, so there is a bit of self reference happening here. The document suffix is .m because the source texts for my blog are MATLAB programs processed by the publish command.

The term count "10, 10/29" indicates that "knuth" appears 10 times in this document and that so far we have seen 10 out of the 29 times that "knuth" appears in the entire collection.

Click on the log button and then click a dozen or so times on the right arrow. This log of the successive displays is printed in the command window. Document names, dates, and term counts are displayed in decreasing order of LSI score.

% knuth

% arrow                     document     term counts     lsi   date
%                        blog/c5_blog.m   10   10/29    0.594  28-Aug-2017
%  >                 blog/easter_blog.m    5   15/29    0.553  18-Mar-2013
%  >               blog/lambertw_blog.m    3   18/29    0.183  02-Sep-2013
%  >           news/stiff_equations.txt    2   20/29    0.182  01-May-2003
%  >                blog/hilbert_blog.m    2   22/29    0.139  02-Feb-2013
%  >                 blog/random_blog.m    2   24/29    0.139  17-Apr-2015
%  >                      exmm/easter.m    1   25/29    0.112  2016
%  >                         blog/gef.m    3   28/29    0.100  07-Jan-2013
%  >                     ncmm/ode23tx.m    0   28/29    0.086  2016
%  >           news/normal_behavior.txt    0   28/29    0.070  01-May-2001
%  >                     blog/magic_2.m    0   28/29    0.059  05-Nov-2012
%                                         ..........
%  >>                    blog/denorms.m    1   29/29    0.010  21-Jul-2014                                           .........

The second most relevant document, "easter_blog.m", is a post from 2013 that describes an algorithm, popularized by Knuth, for computing the date each year in the Western, or Gregorian Calendar that Easter Sunday is celebrated. The term count is "5, 15/29", so the first two documents account for slightly over half of the total appearances of the search term.

The next six lines tell us that "knuth" appears in blog posts about the Lambert W function, Hilbert matrices, random numbers, and George Forsythe (gef), as well as a MATLAB News and Notes column in 2003 about stiff differential equations, and the actual MATLAB program from EXM for computing the date of Easter.

The following results with term counts of zero are blog posts that do not contain "knuth", but which have LSI scores indicating they might be relevant. Finally, the blog post named "denorms" is about denormal floating point numbers. It is reached by right-clicking on the right arrow to skip over documents with term counts of zero.

c5setup

I don't know how to parse .html or .pdf files, so I have collected the original source material for everything that I have written that is now available on the web. There are .m files for the blog and MATLAB programs, .tex files for the LaTeX of the book chapters, and .txt files for the newsletter columns and transcripts of the videos. There are 458 files totaling about 3.24 megabytes of text.

I have a program, c5setup, that I run on my own laptop to extract all the individual words and produce the term-document matrix. This is a sparse matrix whose (k,j) -th entry is the number of times that the k -th term appears in the j -th document. It is saved in c5database.mat for use by the c^5 app.

This setup processing eliminates frequently occurring English language words, like "the", on a list of stopwords.

   length(stopwords)
ans =
   177

c5database

   clear
   load c5database
   whos
  Name          Size               Bytes  Class     Attributes

  A         16315x458            1552776  double    sparse    
  D           458x1                40628  string              
  L             1x1               120156  struct              
  T         16315x1              1132930  string              

  • A is the term-document matrix.
  • D is a string array of the file names in my personal repository of the source documents.
  • L is a struct containing string arrays used to generate URLs of the documents on the web.
  • T is a string array of key words or terms.

Sparsity

The sparsity of the term-document matrix is a little over one percent.

   sparsity = nnz(A)/numel(A)
sparsity =
    0.0130

Spy

Spy plot of the first 1000 rows of the term-document matrix.

   clf
   spy(A(1:1000,:))

Most frequent terms

The row sums are the total term counts.

   ttc = sum(A,2);

Find the terms that occur at least 1000 times.

   k = find(ttc >= 1000);
   fprintf('%-10s %6s\n',[T(k) num2str(ttc(k))]')
function     1806
matlab       1407
matrix       1499
one          1262
two          1090

Surprise. I write a lot about MATLAB and matrices.

Singular values

We might as well compute all the singular values of the full matrix. It takes less than a second. It's important to use the economical version of the SVD that produces a U the same size as A. Otherwise we'd have a 16,315-by-16,315 U.

   tic
   [U,S,V] = svd(full(A),'econ');
   toc
Elapsed time is 0.882556 seconds.

A logarithmic plot of the singular values shows that they do not decrease very rapidly.

   clf
   semilogy(diag(S),'.','markersize',10)
   axis([-10 450 1 1000])
   title('singular values')

Reduced rank approximation

I wrote a post about Latent Semantic Indexing a month ago. LSI employs a reduced rank approximation to the term-document matrix. c^5 has a slider for choosing the rank. The plot of the singular values shows that the accuracy of the approximation is pretty much independent of the chosen value. Any value except very small values or large values near full rank gives an approximation good to between one and ten percent. The power of LSI does not derive from the approximation accuracy. I usually take the rank to be about half the number of columns.

   n = size(A,2);
   k = n/2;
   Uk = U(:,1:k);
   Sk = S(1:k,1:k);
   Vk = V(:,1:k);
   relerr = norm(Uk*Sk*Vk'-A)/S(1,1)
relerr =
    0.0357

Arrow keys

The three arrow keys in the c^5 app can be clicked with either the left or right mouse button (or control-click on a one-button mouse).

  • left >: next document, any term count.
  • right >: next document with nonzero term count.
  • left <: previous document, any term count.
  • right <: previous document with nonzero term count.
  • left ^: use the root of the current document for the query.
  • right ^: use a random term for the query.

Repeatedly clicking the up arrow with the right button (an alt click) is a good way to browse the entire collection.

Lothar Collatz

Let's see the logs for two more examples. Lothar Collatz has a short log.

   c5 Collatz
% collatz
%
% arrow                     document     term counts     lsi   date
%               blog/threenplus1_blog.m    9    9/19    0.904  19-Jan-2015
%  >>         blog/collatz_inequality.m    4   13/19    0.108  16-Mar-2015
%  >>                    blog/c5_blog.m    5   18/19    0.075  28-Aug-2017
%  >>                     ncm/intro.tex    1   19/19   -0.003  2004

Collatz appears in two posts from 2015, one on his 3n+1 problem and one on an elegant inequality that produces a surprising graphic, and in the section of this blog post about c^5 that you are now reading, He is also mentioned in the introduction to the NCM book, but the LSI value of very small. The double arrow at the beginning of each line signifies a right click, skipping over documents that do not mention him.

Blackjack

I have written a lot about the card game Blackjack.

   c5 blackjack
% blackjack

% arrow                     document     term counts     lsi   date
%         news/simulating_blackjack.txt   19   19/68    0.536  01-Oct-2012
%  >>                     ncmm/ncmgui.m    4   23/68    0.372  2016
%  >>               blog/random_blog2.m    4   27/68    0.266  04-May-2015
%  >>                   ncmm/Contents.m    2   29/68    0.244  2016
%  >>                    blog/c5_blog.m    5   34/68    0.206  28-Aug-2017
%  >>                    ncm/random.tex   13   47/68    0.148  2004
%  >>                 lab/thumbnails2.m    2   49/68    0.088  2017
%  >>                        lab/lab2.m    1   50/68    0.061  2017
%  >>      news/numerical_computing.txt    1   51/68    0.025  01-Jun-2004
%  >>                   blog/lab_blog.m    1   52/68    0.004  31-Oct-2016
%  >>                  ncmm/blackjack.m    8   60/68   -0.023  2016
%  >>                   lab/blackjack.m    8   68/68   -0.026  2017

We can see two newsletter columns, three blogs, a portion of a book chapter, several code segments, and two copies of the blackjack app. Again, I am using right clicks.

Levenshtein distance

I recently wrote a blog post about Levenshtein Edit Distance Between Strings. If c^5 does not recognize the key word in a query, it uses Levenshtein distance to find the closest match in the term list to the unrecognized query. This easily corrects simple spelling mistakes, like missing letters. For example the missing "i" in "polynomal" is corrected to become "polynomial". And "Levenstein" becomes "levenshtein".

I received a pleasant surprise when I entered "Molar", expecting it to become "moler". Instead, I got "polar" because only one substitution is required to convert "Molar" to "polar", but two substitutions are required to turn "Molar" into "moler". (Microsoft Word spelling correction used to turn "MATLAB" into "Meatball".)

Multi-word queries

I'm not quite sure what to do with queries consisting of more than one term. What is the expected response to a query of "Wilkinson polynomial", for example? Is it documents that contain either "Wilkinson" or "polynomial"? This is what LSI would provide. But it is probably better to look for documents that contain both "Wilkinson" and "polynomial". I'm not sure how to do this.

Worse yet, I can't look for an exact match to the two-word string "Wilkinson polynomial" because the first thing the setup program does is to break text into individual words.

Stemming

This project is not finished. If I work on it any more, I am going to have learn about scraping, stemming and lemmatization of the source texts. This involves relatively simple tasks like removing possessives and plurals and more complicated tasks like combining all the words with the same root or lemma. The sentence

"the quick brown fox jumped over the lazy dog's back"

becomes

"the quick brown fox jump over the lazi dog' back"

Loren's guest blogger Toshi Takeuchi posted an article in 2015 about Latent Semantic Analysis with MATLAB. He references MATLAB code for stemming.

Parsing queries

I can imagine doing a better job of parsing queries, although I could never approach the sophistication of a system like Google or Siri.

Limitations

A significant fraction of what I have written is not prose -- it is mathematics or code. It cannot be parsed with the techniques of text analytics. For example, the source texts for the books NCM and EXM have hundreds of snippets of LaTeX like

\begin{eqnarray*}
 A V \eqs U \Sigma , \\
 A^H U \eqs V \Sigma^H .
\end{eqnarray*}

And earlier in this blog post I had

tic
[U,S,V] = svd(full(A),'econ');
toc

My c5setup program now has to skip over everything like this. In doing so, it misses much the message.

Software

I had updated Cleve's Laboratory in the Central File Exchange to include c5.m and c5database.mat.




Published with MATLAB® R2017a

|
  • print

コメント

コメントを残すには、ここ をクリックして MathWorks アカウントにサインインするか新しい MathWorks アカウントを作成します。