C^5, Cleve’s Corner Collection Card Catalog
I have been writing books, programs, newsletter columns and blogs since 1990. I have now collected all of this material into one repository. Cleve's Corner Collection consists of 458 "documents", all available on the internet. There are
- 150 posts from Cleve's Corner blog.
- 43 columns from Cleve's Corner News and Notes edition.
- 33 chapters from two books, Experiments with MATLAB and Numerical Computing with MATLAB.
- 218 programs from Cleve's Laboratory, EXM and NCM.
- 14 video transcripts from MIT Open Courseware and elsewhere.
C^5 is an app, a search tool that acts like a traditional library card catalog. It allows you to do keyword based searches through the collection and follow links to the material on the net. Responses to queries are ordered by the scores generated by Latent Semantic Indexing, LSI, which employs the singular value decomposition of a term-document matrix of key word counts.
Contents
Opening figure
Here is the opening window for C^5.
c5
Enter a query, usually just a single key word, in the edit box at the top. This is a term. The names of the various documents that are relevant to the term are then displayed, one at a time, in the document box.
The arrow keys allow the document list to be scanned and changed. The LSI score determines the ordering of the list. The term count is the number of times, if any, that the query term appears in the document. The web button accesses a copy of the document on the internet.
Don Knuth
For my first example, let's search for material I have written that mentions Stanford Emeritus Professor of Computer Science, Donald Knuth. Enter "knuth" in the query box or on the command line.
c5 knuth
The first document name "blog/c5_blog.m" refers to this blog post, so there is a bit of self reference happening here. The document suffix is .m because the source texts for my blog are MATLAB programs processed by the publish command.
The term count "10, 10/29" indicates that "knuth" appears 10 times in this document and that so far we have seen 10 out of the 29 times that "knuth" appears in the entire collection.
Click on the log button and then click a dozen or so times on the right arrow. This log of the successive displays is printed in the command window. Document names, dates, and term counts are displayed in decreasing order of LSI score.
% knuth % arrow document term counts lsi date % blog/c5_blog.m 10 10/29 0.594 28-Aug-2017 % > blog/easter_blog.m 5 15/29 0.553 18-Mar-2013 % > blog/lambertw_blog.m 3 18/29 0.183 02-Sep-2013 % > news/stiff_equations.txt 2 20/29 0.182 01-May-2003 % > blog/hilbert_blog.m 2 22/29 0.139 02-Feb-2013 % > blog/random_blog.m 2 24/29 0.139 17-Apr-2015 % > exmm/easter.m 1 25/29 0.112 2016 % > blog/gef.m 3 28/29 0.100 07-Jan-2013 % > ncmm/ode23tx.m 0 28/29 0.086 2016 % > news/normal_behavior.txt 0 28/29 0.070 01-May-2001 % > blog/magic_2.m 0 28/29 0.059 05-Nov-2012 % .......... % >> blog/denorms.m 1 29/29 0.010 21-Jul-2014 .........
The second most relevant document, "easter_blog.m", is a post from 2013 that describes an algorithm, popularized by Knuth, for computing the date each year in the Western, or Gregorian Calendar that Easter Sunday is celebrated. The term count is "5, 15/29", so the first two documents account for slightly over half of the total appearances of the search term.
The next six lines tell us that "knuth" appears in blog posts about the Lambert W function, Hilbert matrices, random numbers, and George Forsythe (gef), as well as a MATLAB News and Notes column in 2003 about stiff differential equations, and the actual MATLAB program from EXM for computing the date of Easter.
The following results with term counts of zero are blog posts that do not contain "knuth", but which have LSI scores indicating they might be relevant. Finally, the blog post named "denorms" is about denormal floating point numbers. It is reached by right-clicking on the right arrow to skip over documents with term counts of zero.
c5setup
I don't know how to parse .html or .pdf files, so I have collected the original source material for everything that I have written that is now available on the web. There are .m files for the blog and MATLAB programs, .tex files for the LaTeX of the book chapters, and .txt files for the newsletter columns and transcripts of the videos. There are 458 files totaling about 3.24 megabytes of text.
I have a program, c5setup, that I run on my own laptop to extract all the individual words and produce the term-document matrix. This is a sparse matrix whose (k,j) -th entry is the number of times that the k -th term appears in the j -th document. It is saved in c5database.mat for use by the c^5 app.
This setup processing eliminates frequently occurring English language words, like "the", on a list of stopwords.
length(stopwords)
ans = 177
c5database
clear
load c5database
whos
Name Size Bytes Class Attributes A 16315x458 1552776 double sparse D 458x1 40628 string L 1x1 120156 struct T 16315x1 1132930 string
- A is the term-document matrix.
- D is a string array of the file names in my personal repository of the source documents.
- L is a struct containing string arrays used to generate URLs of the documents on the web.
- T is a string array of key words or terms.
Sparsity
The sparsity of the term-document matrix is a little over one percent.
sparsity = nnz(A)/numel(A)
sparsity = 0.0130
Spy
Spy plot of the first 1000 rows of the term-document matrix.
clf spy(A(1:1000,:))
Most frequent terms
The row sums are the total term counts.
ttc = sum(A,2);
Find the terms that occur at least 1000 times.
k = find(ttc >= 1000);
fprintf('%-10s %6s\n',[T(k) num2str(ttc(k))]')
function 1806 matlab 1407 matrix 1499 one 1262 two 1090
Surprise. I write a lot about MATLAB and matrices.
Singular values
We might as well compute all the singular values of the full matrix. It takes less than a second. It's important to use the economical version of the SVD that produces a U the same size as A. Otherwise we'd have a 16,315-by-16,315 U.
tic
[U,S,V] = svd(full(A),'econ');
toc
Elapsed time is 0.882556 seconds.
A logarithmic plot of the singular values shows that they do not decrease very rapidly.
clf semilogy(diag(S),'.','markersize',10) axis([-10 450 1 1000]) title('singular values')
Reduced rank approximation
I wrote a post about Latent Semantic Indexing a month ago. LSI employs a reduced rank approximation to the term-document matrix. c^5 has a slider for choosing the rank. The plot of the singular values shows that the accuracy of the approximation is pretty much independent of the chosen value. Any value except very small values or large values near full rank gives an approximation good to between one and ten percent. The power of LSI does not derive from the approximation accuracy. I usually take the rank to be about half the number of columns.
n = size(A,2); k = n/2; Uk = U(:,1:k); Sk = S(1:k,1:k); Vk = V(:,1:k); relerr = norm(Uk*Sk*Vk'-A)/S(1,1)
relerr = 0.0357
Arrow keys
The three arrow keys in the c^5 app can be clicked with either the left or right mouse button (or control-click on a one-button mouse).
- left >: next document, any term count.
- right >: next document with nonzero term count.
- left <: previous document, any term count.
- right <: previous document with nonzero term count.
- left ^: use the root of the current document for the query.
- right ^: use a random term for the query.
Repeatedly clicking the up arrow with the right button (an alt click) is a good way to browse the entire collection.
Lothar Collatz
Let's see the logs for two more examples. Lothar Collatz has a short log.
c5 Collatz
% collatz % % arrow document term counts lsi date % blog/threenplus1_blog.m 9 9/19 0.904 19-Jan-2015 % >> blog/collatz_inequality.m 4 13/19 0.108 16-Mar-2015 % >> blog/c5_blog.m 5 18/19 0.075 28-Aug-2017 % >> ncm/intro.tex 1 19/19 -0.003 2004
Collatz appears in two posts from 2015, one on his 3n+1 problem and one on an elegant inequality that produces a surprising graphic, and in the section of this blog post about c^5 that you are now reading, He is also mentioned in the introduction to the NCM book, but the LSI value of very small. The double arrow at the beginning of each line signifies a right click, skipping over documents that do not mention him.
Blackjack
I have written a lot about the card game Blackjack.
c5 blackjack
% blackjack % arrow document term counts lsi date % news/simulating_blackjack.txt 19 19/68 0.536 01-Oct-2012 % >> ncmm/ncmgui.m 4 23/68 0.372 2016 % >> blog/random_blog2.m 4 27/68 0.266 04-May-2015 % >> ncmm/Contents.m 2 29/68 0.244 2016 % >> blog/c5_blog.m 5 34/68 0.206 28-Aug-2017 % >> ncm/random.tex 13 47/68 0.148 2004 % >> lab/thumbnails2.m 2 49/68 0.088 2017 % >> lab/lab2.m 1 50/68 0.061 2017 % >> news/numerical_computing.txt 1 51/68 0.025 01-Jun-2004 % >> blog/lab_blog.m 1 52/68 0.004 31-Oct-2016 % >> ncmm/blackjack.m 8 60/68 -0.023 2016 % >> lab/blackjack.m 8 68/68 -0.026 2017
We can see two newsletter columns, three blogs, a portion of a book chapter, several code segments, and two copies of the blackjack app. Again, I am using right clicks.
Levenshtein distance
I recently wrote a blog post about Levenshtein Edit Distance Between Strings. If c^5 does not recognize the key word in a query, it uses Levenshtein distance to find the closest match in the term list to the unrecognized query. This easily corrects simple spelling mistakes, like missing letters. For example the missing "i" in "polynomal" is corrected to become "polynomial". And "Levenstein" becomes "levenshtein".
I received a pleasant surprise when I entered "Molar", expecting it to become "moler". Instead, I got "polar" because only one substitution is required to convert "Molar" to "polar", but two substitutions are required to turn "Molar" into "moler". (Microsoft Word spelling correction used to turn "MATLAB" into "Meatball".)
Multi-word queries
I'm not quite sure what to do with queries consisting of more than one term. What is the expected response to a query of "Wilkinson polynomial", for example? Is it documents that contain either "Wilkinson" or "polynomial"? This is what LSI would provide. But it is probably better to look for documents that contain both "Wilkinson" and "polynomial". I'm not sure how to do this.
Worse yet, I can't look for an exact match to the two-word string "Wilkinson polynomial" because the first thing the setup program does is to break text into individual words.
Stemming
This project is not finished. If I work on it any more, I am going to have learn about scraping, stemming and lemmatization of the source texts. This involves relatively simple tasks like removing possessives and plurals and more complicated tasks like combining all the words with the same root or lemma. The sentence
"the quick brown fox jumped over the lazy dog's back"
becomes
"the quick brown fox jump over the lazi dog' back"
Loren's guest blogger Toshi Takeuchi posted an article in 2015 about Latent Semantic Analysis with MATLAB. He references MATLAB code for stemming.
Parsing queries
I can imagine doing a better job of parsing queries, although I could never approach the sophistication of a system like Google or Siri.
Limitations
A significant fraction of what I have written is not prose -- it is mathematics or code. It cannot be parsed with the techniques of text analytics. For example, the source texts for the books NCM and EXM have hundreds of snippets of LaTeX like
\begin{eqnarray*} A V \eqs U \Sigma , \\ A^H U \eqs V \Sigma^H . \end{eqnarray*}
And earlier in this blog post I had
tic [U,S,V] = svd(full(A),'econ'); toc
My c5setup program now has to skip over everything like this. In doing so, it misses much the message.
Software
I had updated Cleve's Laboratory in the Central File Exchange to include c5.m and c5database.mat.
コメント
コメントを残すには、ここ をクリックして MathWorks アカウントにサインインするか新しい MathWorks アカウントを作成します。