{"id":964,"date":"2014-04-14T12:00:42","date_gmt":"2014-04-14T17:00:42","guid":{"rendered":"https:\/\/blogs.mathworks.com\/cleve\/?p=964"},"modified":"2016-12-05T13:59:17","modified_gmt":"2016-12-05T18:59:17","slug":"singular-value-analysis-of-cryptograms","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/cleve\/2014\/04\/14\/singular-value-analysis-of-cryptograms\/","title":{"rendered":"Singular Value Analysis of Cryptograms"},"content":{"rendered":"<div class=\"content\">\n\n<!--introduction-->The Singular Value Decomposition of the digram frequency matrix of a text coded with a simple substitution cypher can reveal information about the vowels and consonants in the text.\n\n<!--\/introduction-->\n<h3>Contents<\/h3>\n<div>\n<ul>\n \t<li><a href=\"#1d5da631-0647-4dbf-aead-5b5ae354e514\">Don Morrison<\/a><\/li>\n \t<li><a href=\"#3e7934c2-6699-491c-ab99-a39bd7f0378b\">Digram frequency matrix<\/a><\/li>\n \t<li><a href=\"#e808fd1c-607c-44ae-a7ad-a43c73a7e5e4\">Principal component analysis<\/a><\/li>\n \t<li><a href=\"#fb0e0e4a-fe35-4edd-85b9-9a53f78e878b\">Gettysburg Address<\/a><\/li>\n \t<li><a href=\"#48b7d9d2-d833-4ede-bcca-c53ef5330b18\">Plot<\/a><\/li>\n \t<li><a href=\"#46655b5b-42a6-434d-a54a-94c094b2d607\">Cryptograms<\/a><\/li>\n \t<li><a href=\"#18bab844-509b-4c9c-9d7c-0921be150193\">Reference<\/a><\/li>\n<\/ul>\n<\/div>\n<h4>Don Morrison<a name=\"1d5da631-0647-4dbf-aead-5b5ae354e514\"><\/a><\/h4>\nI have mentioned Don Morrison before in Cleve's Corner, in my post on <a href=\"https:\/\/blogs.mathworks.com\/cleve\/2012\/07\/30\/pythagorean-addition\">Pythagorean Addition<\/a>. He was the mathematician and computer scientist who recruited me to the University of New Mexico in 1972. He was remarkably imaginative and today's post is about another one of his creative ideas. We wrote a paper about it, cited below, in the <i>American Math Monthly<\/i> thirty years ago.\n<h4>Digram frequency matrix<a name=\"3e7934c2-6699-491c-ab99-a39bd7f0378b\"><\/a><\/h4>\nIn English, and in many other languages, vowels are usually followed by consonants and consonants are usually followed by vowels. This fact is reflected by the principal component analysis of what I will call the digram frequency matrix for a sample of text. (I am using the term <i>digram<\/i> to refer to any pair of letters, although in the discipline of orthography the term more precisely refers to a pair of characters that represents one phoneme.)\n\nEnglish text uses 26 letters, so the digram frequency matrix is a 26-by-26 matrix, $A$, with counts of pairs of letters. Blanks and all other punctuation are removed from the text and the entire sample is thought of as circular or periodic, so the first letter follows the last letter. The matrix entry $a_{i,j}$ is the number of times the $i$-th letter is followed by the $j$-th letter. The row and column sums of $A$ are the same; they count the number of times individual letters occur in the sample. The fifth row and fifth column usually have the largest sums because the fifth letter, which is \"E,\" is usually the most frequent.\n<h4>Principal component analysis<a name=\"e808fd1c-607c-44ae-a7ad-a43c73a7e5e4\"><\/a><\/h4>\nA principal component analysis of $A$ produces a first component,\n\n$$ A \\approx \\sigma_1 u_1 v_1^T $$\n\nthat reflects the individual letter frequencies. The first right- and left-singular vectors, $u_1$ and $v_1$, have elements that are all of the same sign and that are roughly proportional to the corresponding frequencies.\n\nWe are primarily interested in the second principal component,\n\n$$ A \\approx \\sigma_1 u_1 v_1^T + \\sigma_2 u_2 v_2^T $$\n\nThe second term has positive entries in vowel-consonant and consonant-vowel positions and negative entries in vowel-vowel and consonant-consonant positions. The sign patterns adjust the frequency counts to reflect the vowel-consonant properties of the language.\n<h4>Gettysburg Address<a name=\"fb0e0e4a-fe35-4edd-85b9-9a53f78e878b\"><\/a><\/h4>\nMy <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/37976-numerical-computing-with-matlab\">Numerical Computing with MATLAB<\/a> toolbox <tt>ncm<\/tt> includes a function, <a href=\"https:\/\/www.mathworks.com\/content\/dam\/mathworks\/mathworks-dot-com\/moler\/ncm\/digraph.m\"><tt>digraph.m<\/tt><\/a>, that carries out this analysis. Let's run the function on Lincoln's Gettysburg Address, <a href=\"https:\/\/www.mathworks.com\/content\/dam\/mathworks\/mathworks-dot-com\/moler\/ncm\/gettysburg.txt\"><tt>gettysburg.txt<\/tt><\/a>.\n<pre class=\"codeinput\">   fid = fopen(<span class=\"string\">'gettysburg.txt'<\/span>);\n   txt = fread(fid);\n   fclose(fid);\n<\/pre>\n<tt>Convert<\/tt> to integers between 1 and 26.\n<pre class=\"codeinput\">   k = upper(char(txt)) - <span class=\"string\">'A'<\/span> + 1;\n   k(k &lt; 1 | k &gt; 26) = [];\n<\/pre>\nUse the counting feature of <tt>sparse<\/tt> to generate the digram frequency matrix.\n<pre class=\"codeinput\">   j = k([2:length(k) 1]);\n   A = full(sparse(k,j,1,26,26));\n   cnt = sum(A);\n<\/pre>\nCompute the <tt>SVD<\/tt>.\n<pre class=\"codeinput\">   [U,S,V] = svd(A);\n<\/pre>\n<h4>Plot<a name=\"48b7d9d2-d833-4ede-bcca-c53ef5330b18\"><\/a><\/h4>\nPlot the second left and right singular vectors. The $i$-th letter of the alphabet is plotted at coordinates $(u_{i,2},v_{i,2})$. The distance of each letter from the origin is roughly proportional to its frequency, and the sign patterns cause the vowels to be plotted in one quadrant and the consonants to be plotted in the opposite quadrant.\n\nThere is more detail. The letter \"N\" is usually preceded by a vowel and often followed by another consonant, like \"D\" or \"G,\" and so it shows up in a quadrant pretty much by itself, sort of a \"left vowel, right consonant\" . On the other hand, \"H\" is often preceded by another consonant, namely \"T,\" and followed by a vowel, \"E,\" so it also gets its own quadrant, a \"right vowel, left consonant\".\n<pre class=\"codeinput\">   <span class=\"keyword\">for<\/span> i = 1:26\n      <span class=\"keyword\">if<\/span> cnt(i) &gt; 0\n         text(U(i,2),V(i,2),char(<span class=\"string\">'A'<\/span>+i-1))\n      <span class=\"keyword\">end<\/span>\n   <span class=\"keyword\">end<\/span>\n\n   s = 4\/3*max(max(abs(U(:,2))),max(abs(V(:,2))));\n   axis(s*[-1 1 -1 1])\n   axis <span class=\"string\">square<\/span>\n   line([0 0],[-s s],<span class=\"string\">'color'<\/span>,<span class=\"string\">'b'<\/span>)\n   line([-s s],[0 0],<span class=\"string\">'color'<\/span>,<span class=\"string\">'b'<\/span>)\n   box\n   title(sprintf(<span class=\"string\">'%d characters'<\/span>,length(k)))\n<\/pre>\n<img decoding=\"async\" src=\"https:\/\/blogs.mathworks.com\/images\/cleve\/digram_blog_01.png\" alt=\"\" hspace=\"5\" vspace=\"5\" \/>\n<h4>Cryptograms<a name=\"46655b5b-42a6-434d-a54a-94c094b2d607\"><\/a><\/h4>\nA simple substitution cipher is obtained by permuting the letters of the alphabet. The digram matrix $A$ is replaced by $PAP^T$ where $P$ is a permutation matrix. The singular vectors are simply multiplied by $P$. Letters in the plot are permuted, but the locations and quadrants remain the same. So it is possible to identify the vowels and consonants, and perhaps -- if the sample is large enough -- the letters standing for \"N\" and \"H\" in the cryptogram.\n\nI admit that simple substitution ciphers are not the world's most difficult codes to crack, but I found this to be an unexpected use of the SVD.\n<h4>Reference<a name=\"18bab844-509b-4c9c-9d7c-0921be150193\"><\/a><\/h4>\nCleve Moler and Donald Morrison, \"Singular Value Analysis of Cryptograms\", <i>The American Mathematical Monthly<\/i>, Vol. 90, No. 2, 1983, pp. 78-87, <a href=\"http:\/\/www.jstor.org\/stable\/2975804\" target=\"_blank\">&lt;http:\/\/www.jstor.org\/stable\/2975804<\/a>&gt;\n\n<script>\/\/ <![CDATA[\nfunction grabCode_7d26256412bd4be7b83f37349e731d9f() {\n        \/\/ Remember the title so we can use it in the new page\n        title = document.title;\n\n        \/\/ Break up these strings so that their presence\n        \/\/ in the Javascript doesn't mess up the search for\n        \/\/ the MATLAB code.\n        t1='7d26256412bd4be7b83f37349e731d9f ' + '##### ' + 'SOURCE BEGIN' + ' #####';\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 7d26256412bd4be7b83f37349e731d9f';\n    \n        b=document.getElementsByTagName('body')[0];\n        i1=b.innerHTML.indexOf(t1)+t1.length;\n        i2=b.innerHTML.indexOf(t2);\n \n        code_string = b.innerHTML.substring(i1, i2);\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\n\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \n        \/\/ in the XML parser.\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\n        \/\/ doesn't go ahead and substitute the less-than character. \n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\n\n        copyright = 'Copyright 2014 The MathWorks, Inc.';\n\n        w = window.open();\n        d = w.document;\n        d.write('\n\n\n\n\n\n\n\n<pre>\\n');\n        d.write(code_string);\n\n        \/\/ Add copyright line at the bottom if specified.\n        if (copyright.length > 0) {\n            d.writeln('');\n            d.writeln('%%');\n            if (copyright.length > 0) {\n                d.writeln('% _' + copyright + '_');\n            }\n        }\n\n        d.write('<\/pre>\n\n\n\n\n\n\n\n\n\\n');\n\n        d.title = title + ' (MATLAB code)';\n        d.close();\n    }\n\/\/ ]]><\/script>\n<p style=\"text-align: right; font-size: xx-small; font-weight: lighter; font-style: italic; color: gray;\"><a><span style=\"font-size: x-small; font-style: italic;\">Get\nthe MATLAB code<noscript>(requires JavaScript)<\/noscript><\/span><\/a><\/p>\nPublished with MATLAB\u00ae R2014a\n\n<\/div>\n<!--\n7d26256412bd4be7b83f37349e731d9f ##### SOURCE BEGIN #####\n%% Singular Value Analysis of Crytograms\n% The Singular Value Decomposition of the digram frequency matrix of a text\n% coded with a simple substitution cypher can reveal information about the\n% vowels and consonants in the text.\n%\n%% Don Morrison\n% I have mentioned Don Morrison before in Cleve's Corner, in my post on\n% <https:\/\/blogs.mathworks.com\/cleve\/2012\/07\/30\/pythagorean-addition % Pythagorean Addition>.  He was the mathematician and computer scientist\n% who recruited me to the University of New Mexico in 1972.\n% He was remarkably imaginative and today's post is about another one of\n% his creative ideas.  We wrote a paper about it, cited below, in the\n% _American Math Monthly_ thirty years ago.\n\n%% Digram frequency matrix\n% In English, and in many other languages, vowels are usually followed\n% by consonants and consonants are usually followed by vowels.  This fact\n% is reflected by the principal component analysis of what I will call\n% the digram frequency matrix for a sample of text.  (I am using the\n% term _digram_ to refer to any pair of letters, although in the discipline\n% of orthography the term more precisely refers to a pair of characters that\n% represents one phoneme.)\n\n%%\n% English text uses 26 letters, so the digram frequency matrix is a\n% 26-by-26 matrix, $A$, with counts of pairs of letters.  Blanks and all\n% other punctuation are removed from the text and the entire sample is\n% thought of as circular or periodic, so the first letter follows the\n% last letter.  The matrix entry $a_{i,j}$ is the number of times the $i$-th\n% letter is followed by the $j$-th letter.\n% The row and column sums of $A$ are the same; they count the number of times\n% individual letters occur in the sample.  The fifth row and fifth column\n% usually have the largest sums because the fifth letter, which is \"E,\"\n% is usually the most frequent.\n\n%% Principal component analysis\n% A principal component analysis of $A$ produces a first component,\n%\n% $$ A \\approx \\sigma_1 u_1 v_1^T $$\n%\n% that reflects the individual letter frequencies.  The first right- and\n% left-singular vectors, $u_1$ and $v_1$, have elements that are all\n% of the same sign and that are roughly proportional to the corresponding\n% frequencies.\n\n%%\n% We are primarily interested in the second principal component,\n%\n% $$ A \\approx \\sigma_1 u_1 v_1^T + \\sigma_2 u_2 v_2^T $$\n%\n% The second term has positive entries in vowel-consonant and consonant-vowel\n% positions and negative entries in vowel-vowel and consonant-consonant\n% positions.  The sign patterns adjust the frequency counts to reflect\n% the vowel-consonant properties of the language.\n\n%% Gettysburg Address\n% My <https:\/\/www.mathworks.com\/moler.htmlncmfilelist.html % Numerical Computing with MATLAB> toolbox |ncm| includes a function,\n% <https:\/\/www.mathworks.com\/moler.htmlncm\/digraph.m |digraph.m|>, that carries\n% out this analysis.  Let's run the function on Lincoln's Gettysburg Address,\n% <https:\/\/www.mathworks.com\/moler.htmlncm\/gettysburg.txt |gettysburg.txt|>.\n\nfid = fopen('gettysburg.txt');\ntxt = fread(fid);\nfclose(fid);\n\n%%\n% Convert to integers between 1 and 26.\n\nk = upper(char(txt)) - 'A' + 1;\nk(k < 1 | k > 26) = [];\n\n%%\n% Use the counting feature of |sparse| to generate the digram frequency matrix.\n\nj = k([2:length(k) 1]);\nA = full(sparse(k,j,1,26,26));\ncnt = sum(A);\n\n%%\n% Compute the SVD.\n\n[U,S,V] = svd(A);\n\n%% Plot\n% Plot the second left and right singular vectors.\n% The $i$-th letter of the alphabet is plotted at coordinates\n% $(u_{i,2},v_{i,2})$.  The distance of each letter from the origin\n% is roughly proportional to its frequency, and the sign patterns cause\n% the vowels to be plotted in one quadrant and the consonants to be\n% plotted in the opposite quadrant.\n\n%%\n% There is more detail.\n% The letter \"N\" is usually preceded by a vowel and often followed\n% by another consonant, like \"D\" or \"G,\" and so it shows up in a quadrant\n% pretty much by itself, sort of a \"left vowel, right consonant\" .\n% On the other hand, \"H\" is often preceded by another consonant, namely \"T,\"\n% and followed by a vowel, \"E,\" so it also gets its own quadrant,\n% a \"right vowel, left consonant\".\n\nfor i = 1:26\nif cnt(i) > 0\ntext(U(i,2),V(i,2),char('A'+i-1))\nend\nend\n\ns = 4\/3*max(max(abs(U(:,2))),max(abs(V(:,2))));\naxis(s*[-1 1 -1 1])\naxis square\nline([0 0],[-s s],'color','b')\nline([-s s],[0 0],'color','b')\nbox\ntitle(sprintf('%d characters',length(k)))\n\n%% Cryptograms\n% A simple substitution cipher is obtained by permuting the letters of the\n% alphabet.  The digram matrix $A$ is replaced by $PAP^T$ where $P$ is a\n% permutation matrix.  The singular vectors are simply multiplied by $P$.\n% Letters in the plot are permuted, but the locations and quadrants remain\n% the same.  So it is possible to identify the vowels and consonants, and\n% perhaps REPLACE_WITH_DASH_DASH if the sample is large enough REPLACE_WITH_DASH_DASH the letters standing for\n% \"N\" and \"H\" in the cryptogram.\n\n%%\n% I admit that simple substitution ciphers are not the world's most\n% difficult codes to crack, but I found this to be an unexpected use of\n% the SVD.\n\n%% Reference\n% Cleve Moler and Donald Morrison, \"Singular Value Analysis of Cryptograms\",\n% _The American Mathematical Monthly_, Vol. 90, No. 2, 1983, pp. 78-87,\n% <http:\/\/www.jstor.org\/stable\/2975804 http:\/\/www.jstor.org\/stable\/2975804>\n\n##### SOURCE END ##### 7d26256412bd4be7b83f37349e731d9f\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/cleve\/digram_blog_01.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction-->The Singular Value Decomposition of the digram frequency matrix of a text coded with a simple substitution cypher can reveal information about the vowels and consonants in the text.\n\n<!--\/introduction-->... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/cleve\/2014\/04\/14\/singular-value-analysis-of-cryptograms\/\">read more >><\/a><\/p>","protected":false},"author":78,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[13,5,4,6,8],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/964"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/users\/78"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/comments?post=964"}],"version-history":[{"count":7,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/964\/revisions"}],"predecessor-version":[{"id":2184,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/posts\/964\/revisions\/2184"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/media?parent=964"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/categories?post=964"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/cleve\/wp-json\/wp\/v2\/tags?post=964"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}