{"id":1217,"date":"2015-09-09T06:54:41","date_gmt":"2015-09-09T11:54:41","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=1217"},"modified":"2020-07-28T16:37:28","modified_gmt":"2020-07-28T20:37:28","slug":"text-mining-shakespeare-with-matlab","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2015\/09\/09\/text-mining-shakespeare-with-matlab\/","title":{"rendered":"Text Mining Shakespeare with MATLAB"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p>Have you ever wondered how Google provides the auto-complete feature in Google Suggest? Or sometimes you see results of hilarious or annoying auto-correct features on your smartphone? Today's guest blogger, Toshi Takeuchi, explains a natural language processing approach through a fun text mining example with Shakespeare.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/wordle.png\" alt=\"\"> <\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#c905b0e4-f1c8-4f25-90a5-e3c0dd0d4203\">Predictive Text Game<\/a><\/li><li><a href=\"#bafdcae5-9130-4499-82ed-8a3ff020f48c\">N-grams<\/a><\/li><li><a href=\"#3e8b57df-489f-40bb-8117-fadbaa30fd20\">Language Model<\/a><\/li><li><a href=\"#296da411-8418-4d88-975e-d8569eddbe94\">Reading and Preprocessing Shakespeare<\/a><\/li><li><a href=\"#52658e0e-b57a-408f-a556-08551fb4869a\">Building a Bigram Language Model<\/a><\/li><li><a href=\"#27946771-1314-4a88-b675-d59852880ace\">Generating Bigram Shakespeare Text<\/a><\/li><li><a href=\"#b4fa1e2a-c45e-4eb1-97a2-c37d3041e1c0\">Generating Trigram Shakespeare Text<\/a><\/li><li><a href=\"#7b46d709-3ba1-410b-95a5-6b4d29800cf5\">Create a Smartphone App<\/a><\/li><li><a href=\"#045d28d9-a216-4d3f-9f86-f2f517bf0099\">Summary<\/a><\/li><\/ul><\/div><h4>Predictive Text Game<a name=\"c905b0e4-f1c8-4f25-90a5-e3c0dd0d4203\"><\/a><\/h4><p>There is a simple but powerful natural language processing approach called <a href=\"https:\/\/en.wikipedia.org\/wiki\/N-gram\">n-gram<\/a>-based <a href=\"https:\/\/en.wikipedia.org\/wiki\/Language_model\">language models<\/a> which you can have a lot of fun with using MATLAB.<\/p><p>To see how it works, we will create a predictive text game that generates random Shakespearean text automatically. You can also specify the first word to generate a random sentence. Here are a couple of auto generated fake Shakespearan quotes:<\/p><pre>didst thou kill my cousin romeo\r\nparting is such sweet sorrow that i ask again\r\nnurse commend me to your daughter\r\nborrow cupid s wings and soar with them\r\no mischief thou art like one of these accidents\r\nlove is a most sharp sauce<\/pre><p>I happen to use <a href=\"https:\/\/www.gutenberg.org\/cache\/epub\/1513\/pg1513.txt\">Romeo and Juliet<\/a> from Project Gutenberg for this example, but you can use any collection of text data. I almost thought about using comedian <a href=\"http:\/\/thoughtcatalog.com\/kim-quindlen\/2015\/04\/26-amy-schumer-quotes-that-will-make-you-laugh-think-and-feel-understood-all-at-the-same-time\/\">Amy Schumer quotes<\/a>. If you have a collection of your own writing, such as emails, SMS and such, this can generate text that sounds like you (check out <a href=\"http:\/\/www.xkcd.com\/1068\/\">this XKCD cartoon<\/a>). If you have collection of pirate talks, you can talk like them. That's going to be fun.<\/p><h4>N-grams<a name=\"bafdcae5-9130-4499-82ed-8a3ff020f48c\"><\/a><\/h4><p>Let's start with the basics. An N-gram is a sequence of words that appear together in a sentence. Commonly word tokens are used, and they are unigrams. You can also use a pair of words, and that's a bigram. Trigrams use three words... all the way to N-grams for N words. Let's try using this <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/ngrams.m\">ngrams<\/a> function.<\/p><pre class=\"codeinput\">ngrams(<span class=\"string\">'a b c d e'<\/span>, 1)                          <span class=\"comment\">% unigrams<\/span>\r\nngrams(<span class=\"string\">'a b c d e'<\/span>, 2)                          <span class=\"comment\">% bigrams<\/span>\r\nngrams(<span class=\"string\">'a b c d e'<\/span>, 3)                          <span class=\"comment\">% trigrams<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nans = \r\n    'a'    'b'    'c'    'd'    'e'\r\n\r\nans = \r\n    'a b'    'b c'    'c d'    'd e'\r\n\r\nans = \r\n    'a b c'    'b c d'    'c d e'\r\n<\/pre><h4>Language Model<a name=\"3e8b57df-489f-40bb-8117-fadbaa30fd20\"><\/a><\/h4><p>N-grams are used to predict a sequence of words in a sentence based on chained conditional probabilities. These probabilities are estimated by mining a collection of text known as a corpus; we will use 'Romeo and Juliet' as our corpus. Language models are made up of such word sequence probabilities.<\/p><p>Here is a bigram-based example of how you would compute such a probability.<\/p><pre>P(word2|word1) = c('word1 word2')\/c(word1)<\/pre><p><tt>P(word2|word1)<\/tt> is a conditional probability of word2 following word1, and you compute it by dividing the count of the igram 'word1 word2' by the count of word1. Here is an example for trigrams.<\/p><pre>P(word3|'word1 word2') =  c('word1 word2 word3')\/c('word1 word2')<\/pre><p>Word sequences are not always determined by the previous words. This is a very simplistic approach (known as a Markov model). However, it is easy to model and works reasonably well. <a href=\"https:\/\/en.wikipedia.org\/wiki\/Language_model\">Wikipedia<\/a> provides an example of how this can be useful in resolving ambiguity in speech recognition applications, where the phrases \"recognize speech\" and \"wreck a nice beach\" are pronounced almost the same in American English but mean very different things. You can probably guess that \"recognize speech\" would have a higher probability than \"wreck a nice beach\". A speech recognition application would adopt the higher probability option as the answer.<\/p><h4>Reading and Preprocessing Shakespeare<a name=\"296da411-8418-4d88-975e-d8569eddbe94\"><\/a><\/h4><p>The Project Gutenberg text file is in a plain vanilla ASCII file format with LFCR line breaks. It comes with a lot of extra header and footer text that we want to remove. I am assuming you have downloaded the text file to your current folder.<\/p><pre class=\"codeinput\">romeo = fileread(<span class=\"string\">'pg1513.txt'<\/span>);                 <span class=\"comment\">% read file content<\/span>\r\nromeo(1:13303) = [];                            <span class=\"comment\">% remove extra header text<\/span>\r\nromeo(end-144:end) = [];                        <span class=\"comment\">% remove extra footer text<\/span>\r\ndisp(romeo(662:866))                            <span class=\"comment\">% preview the text<\/span>\r\n<\/pre><pre class=\"codeoutput\">ACT I.\r\n\r\nScene I. A public place.\r\n\r\n[Enter Sampson and Gregory armed with swords and bucklers.]\r\n\r\nSampson.\r\nGregory, o' my word, we'll not carry coals.\r\n\r\nGregory.\r\nNo, for then we should be colliers.\r\n\r\n<\/pre><p>You need to remove non-dialogue text, such as stage directions. You also need to add sentence markers at the beginning and end of each, such as &lt;s&gt; and &lt;\/s&gt;. We will use sentences with at least 3 words. This procedure is handled in the <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/preprocess.m\">preprocess<\/a> function.<\/p><pre class=\"codeinput\">processed = preprocess(romeo);                  <span class=\"comment\">% preprocess text<\/span>\r\ndisp([processed{6} char(10) processed{7}])      <span class=\"comment\">% preview the result<\/span>\r\nprocessed = lower(processed);                   <span class=\"comment\">% lowercase text<\/span>\r\n<\/pre><pre class=\"codeoutput\">&lt;s&gt; Gregory, o' my word, we'll not carry coals. &lt;\/s&gt;\r\n&lt;s&gt; No, for then we should be colliers. &lt;\/s&gt;\r\n<\/pre><h4>Building a Bigram Language Model<a name=\"52658e0e-b57a-408f-a556-08551fb4869a\"><\/a><\/h4><p>Let's use a simple bigram model with <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/bigramClass.m\"><tt>bigramClass<\/tt><\/a> to build the first Shakespeare text generator.<\/p><pre class=\"codeinput\">delimiters = {<span class=\"string\">' '<\/span>, <span class=\"string\">'!'<\/span>, <span class=\"string\">''''<\/span>, <span class=\"string\">','<\/span>, <span class=\"string\">'-'<\/span>, <span class=\"string\">'.'<\/span>,<span class=\"keyword\">...<\/span><span class=\"comment\"> % word boundary characters<\/span>\r\n    <span class=\"string\">':'<\/span>, <span class=\"string\">';'<\/span>, <span class=\"string\">'?'<\/span>, <span class=\"string\">'\\r'<\/span>, <span class=\"string\">'\\n'<\/span>, <span class=\"string\">'--'<\/span>, <span class=\"string\">'&amp;'<\/span>};\r\nbiMdl = bigramClass(delimiters);                <span class=\"comment\">% instantiate the class<\/span>\r\nbiMdl.build(processed);                         <span class=\"comment\">% build the model<\/span>\r\n<\/pre><pre class=\"codeoutput\">Generating bigrams...\r\n.........................\r\nBuilding a bigram model...\r\n................\r\n<\/pre><p>Here is an example of how you use the bigram model to get the probability of 'thou art'. Rows represents the first word in a bigram, and columns the second.<\/p><pre class=\"codeinput\">row = strcmp(biMdl.unigrams, <span class=\"string\">'thou'<\/span>);           <span class=\"comment\">% select row for 'thou'<\/span>\r\ncol= strcmp(biMdl.unigrams, <span class=\"string\">'art'<\/span>);             <span class=\"comment\">% select col for 'art'<\/span>\r\nbiMdl.mdl(row,col)                              <span class=\"comment\">% probability of 'thou art'<\/span>\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n      0.10145\r\n<\/pre><h4>Generating Bigram Shakespeare Text<a name=\"27946771-1314-4a88-b675-d59852880ace\"><\/a><\/h4><p>Using this bigram language model, you can now generate random text that hopefully sounds Shakespearean. This works by first randomly selecting a bigram that starts with &lt;s&gt; based on its probability, and then randomly selecting another bigram based on its probability, starting with the second word in the first bigram, and so forth, until we encounter &lt;\/s&gt;. This is implemented in the functions <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/textGen.m\"><tt>textGen<\/tt><\/a> and <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/nextWord.m\"><tt>nextWord<\/tt><\/a>.<\/p><pre class=\"codeinput\">rng(1)                                          <span class=\"comment\">% for reproducibility<\/span>\r\ntextGen(biMdl)                                  <span class=\"comment\">% generate random text<\/span>\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n    'blister d further than groaning for me'\r\n    'this fatal points and every day that shot through all will consents'\r\n    'conceit more most sharp ground of all the wind you and jocund day of t...'\r\n    'alas the measure and put to take thou wast the break at leisure serves...'\r\n    'cast me and said an alderman drawn among these my master and scorn the...'\r\n<\/pre><h4>Generating Trigram Shakespeare Text<a name=\"b4fa1e2a-c45e-4eb1-97a2-c37d3041e1c0\"><\/a><\/h4><p>Bigram sentences sound a bit like Shakespeare but they don't make a lot of sense. Would we do better with a trigram model? Let's try it with <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/trigramClass.m\"><tt>trigramClass<\/tt><\/a>.<\/p><pre class=\"codeinput\">triMdl = trigramClass(delimiters);              <span class=\"comment\">% generate trigrams<\/span>\r\ntriMdl.build(processed, biMdl);                 <span class=\"comment\">% build a trigram model<\/span>\r\nrng(2)                                          <span class=\"comment\">% for reproducibility<\/span>\r\ntextGen(triMdl, <span class=\"string\">'thou'<\/span>)                         <span class=\"comment\">% start with 'thou'<\/span>\r\n<\/pre><pre class=\"codeoutput\">Generating trigrams...\r\n.........................\r\nBuilding a trigram model...\r\n......................\r\nans = \r\n    'thou hither tell me good my friend'\r\n    'thou canst not teach me how i love'\r\n    'thou know st me oft for loving rosaline'\r\n    'thou hither tell me how i love thy wit that ornament to shape and love...'\r\n    'thou cutt st my lodging'\r\n<\/pre><h4>Create a Smartphone App<a name=\"7b46d709-3ba1-410b-95a5-6b4d29800cf5\"><\/a><\/h4><p>If you liked <a href=\"http:\/\/www.xkcd.com\/1068\/\">this XKCD cartoon<\/a> which shows an example of predictive text smartphone app, you may want to create your own. If so, check out this webinar that shows you how to turn MATLAB code into a mobile app through C code generation <a href=\"https:\/\/www.mathworks.com\/videos\/matlab-to-iphone-and-android-made-easy-107779.html\">MATLAB to iPhone and Android Made Easy<\/a><\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2014\/iphoneWebinar.png\" alt=\"\"> <\/p><h4>Summary<a name=\"045d28d9-a216-4d3f-9f86-f2f517bf0099\"><\/a><\/h4><p>You saw that the trigram model worked better than the bigram model, but William Shakespeare would have had nothing to fear about such models taking over his playwright job. We talked about practical uses like auto-completion, auto-correction, speech recognition, etc. We also discusssed how you could go from MATLAB code to a mobile app using C code generation.<\/p><p>In practical natural language processing applications, such as resolving ambiguity between \"recognize speech\" vs. \"wreck a nice beach\" in speech recognition, the model requires further refinements.<\/p><div><ul><li>To score a sentence, you use <a href=\"https:\/\/en.wikipedia.org\/wiki\/Chain_rule\">the chain rule<\/a> to compute a product of a bunch of conditional probabilities. Since they are small numbers, you get even a smaller number by multiplying them, causing <a href=\"https:\/\/en.wikipedia.org\/wiki\/Arithmetic_underflow\">arithmetic underflow<\/a>. We should use log probabilities instead.<\/li><li>How do you deal with new sequences or a new word not seen in the corpus? We need to use smoothing or backoff to account for unseen data.<\/li><\/ul><\/div><p>To learn what you can do with text in MATLAB, check out this awesome introductory book <a title=\"https:\/\/www.mathworks.com\/academia\/books\/book71143.html (link no longer works)\">Text Mining with MATLAB<\/a>.<\/p><p>For a casual predictive text game just for fun, you can play with the simple models I used in this post. Try out the code examples here, and building your own random text generator from any corpus of your interest. Or try to implement the <tt>score<\/tt> method that incorporates the suggested refinements using the code provided here.<\/p><p>If you have an interesting use of language models, please share in the comments <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=1217#respond\">here<\/a>.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_f3a4925b12ca4c82aa5b77b9f60c3edd() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='f3a4925b12ca4c82aa5b77b9f60c3edd ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' f3a4925b12ca4c82aa5b77b9f60c3edd';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2015 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_f3a4925b12ca4c82aa5b77b9f60c3edd()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2015a<br><\/p><\/div><!--\r\nf3a4925b12ca4c82aa5b77b9f60c3edd ##### SOURCE BEGIN #####\r\n%% Text Mining Shakespeare with MATLAB\r\n% Have you ever wondered how Google provides the auto-complete feature in\r\n% Google Suggest? Or sometimes you see results of hilarious or annoying\r\n% auto-correct features on your smartphone? Today's guest blogger, Toshi\r\n% Takeuchi, explains a natural language processing approach through a fun\r\n% text mining example with Shakespeare.\r\n%\r\n% <<wordle.png>>\r\n% \r\n%% Predictive Text Game\r\n% There is a simple but powerful natural language processing approach\r\n% called <https:\/\/en.wikipedia.org\/wiki\/N-gram n-gram>-based\r\n% <https:\/\/en.wikipedia.org\/wiki\/Language_model language models> which you\r\n% can have a lot of fun with using MATLAB.\r\n%\r\n% To see how it works, we will create a predictive text game that generates\r\n% random Shakespearean text automatically. You can also specify the first\r\n% word to generate a random sentence. Here are a couple of auto generated\r\n% fake Shakespearan quotes:\r\n% \r\n%  didst thou kill my cousin romeo\r\n%  parting is such sweet sorrow that i ask again\r\n%  nurse commend me to your daughter\r\n%  borrow cupid s wings and soar with them\r\n%  o mischief thou art like one of these accidents\r\n%  love is a most sharp sauce\r\n%\r\n% I happen to use <https:\/\/www.gutenberg.org\/cache\/epub\/1513\/pg1513.txt\r\n% Romeo and Juliet> from Project Gutenberg for this example, but you can\r\n% use any collection of text data. I almost thought about using comedian\r\n% <http:\/\/thoughtcatalog.com\/kim-quindlen\/2015\/04\/26-amy-schumer-quotes-that-will-make-you-laugh-think-and-feel-understood-all-at-the-same-time\/\r\n% Amy Schumer quotes>. If you have a collection of your own writing, such\r\n% as emails, SMS and such, this can generate text that sounds like you\r\n% (check out <http:\/\/www.xkcd.com\/1068\/ this XKCD cartoon>). If you have\r\n% collection of pirate talks, you can talk like them. That's going to be\r\n% fun.\r\n\r\n%% N-grams\r\n% Let's start with the basics. An N-gram is a sequence of words that appear\r\n% together in a sentence. Commonly word tokens are used, and they are\r\n% unigrams. You can also use a pair of words, and that's a bigram. Trigrams\r\n% use three words... all the way to N-grams for N words. Let's try using\r\n% this <https:\/\/blogs.mathworks.com\/images\/loren\/2015\/ngrams.m ngrams>\r\n% function.\r\n\r\nngrams('a b c d e', 1)                          % unigrams\r\nngrams('a b c d e', 2)                          % bigrams\r\nngrams('a b c d e', 3)                          % trigrams\r\n\r\n%% Language Model\r\n% N-grams are used to predict a sequence of words in a sentence based on\r\n% chained conditional probabilities. These probabilities are estimated by\r\n% mining a collection of text known as a corpus; we will use 'Romeo and\r\n% Juliet' as our corpus. Language models are made up of such word sequence\r\n% probabilities.\r\n% \r\n% Here is a bigram-based example of how you would compute\r\n% such a probability.\r\n%\r\n%  P(word2|word1) = c('word1 word2')\/c(word1)\r\n%\r\n% |P(word2|word1)| is a conditional probability of word2 following word1,\r\n% and you compute it by dividing the count of the igram 'word1 word2' by\r\n% the count of word1. Here is an example for trigrams.\r\n%\r\n%  P(word3|'word1 word2') =  c('word1 word2 word3')\/c('word1 word2') \r\n% \r\n% Word sequences are not always determined by the previous words. This is a\r\n% very simplistic approach (known as a Markov model). However, it is easy\r\n% to model and works reasonably well.\r\n% <https:\/\/en.wikipedia.org\/wiki\/Language_model Wikipedia> provides an\r\n% example of how this can be useful in resolving ambiguity in speech\r\n% recognition applications, where the phrases \"recognize speech\" and \"wreck\r\n% a nice beach\" are pronounced almost the same in American English but mean\r\n% very different things. You can probably guess that \"recognize speech\"\r\n% would have a higher probability than \"wreck a nice beach\". A speech\r\n% recognition application would adopt the higher probability option as the\r\n% answer.\r\n\r\n%% Reading and Preprocessing Shakespeare\r\n% The Project Gutenberg text file is in a plain vanilla ASCII file format\r\n% with LFCR line breaks. It comes with a lot of extra header and footer\r\n% text that we want to remove. I am assuming you have downloaded the text\r\n% file to your current folder. \r\n\r\nromeo = fileread('pg1513.txt');                 % read file content\r\nromeo(1:13303) = [];                            % remove extra header text\r\nromeo(end-144:end) = [];                        % remove extra footer text\r\ndisp(romeo(662:866))                            % preview the text\r\n\r\n%%\r\n% You need to remove non-dialogue text, such as stage directions. You also\r\n% need to add sentence markers at the beginning and end of each, such as\r\n% &lt;s&gt; and &lt;\/s&gt;. We will use sentences with at least 3 words.\r\n% This procedure is handled in the\r\n% <https:\/\/blogs.mathworks.com\/images\/loren\/2015\/preprocess.m preprocess>\r\n% function.\r\n\r\nprocessed = preprocess(romeo);                  % preprocess text\r\ndisp([processed{6} char(10) processed{7}])      % preview the result \r\nprocessed = lower(processed);                   % lowercase text\r\n\r\n%% Building a Bigram Language Model\r\n% Let's use a simple bigram model with\r\n% <https:\/\/blogs.mathworks.com\/images\/loren\/2015\/bigramClass.m\r\n% |bigramClass|> to build the first Shakespeare text generator.\r\n\r\ndelimiters = {' ', '!', '''', ',', '-', '.',... % word boundary characters    \r\n    ':', ';', '?', '\\r', '\\n', 'REPLACE_WITH_DASH_DASH', '&'};\r\nbiMdl = bigramClass(delimiters);                % instantiate the class\r\nbiMdl.build(processed);                         % build the model\r\n\r\n%%\r\n% Here is an example of how you use the bigram model to get the probability\r\n% of 'thou art'. Rows represents the first word in a bigram, and columns\r\n% the second.\r\n\r\nrow = strcmp(biMdl.unigrams, 'thou');           % select row for 'thou'\r\ncol= strcmp(biMdl.unigrams, 'art');             % select col for 'art'\r\nbiMdl.mdl(row,col)                              % probability of 'thou art'\r\n\r\n%% Generating Bigram Shakespeare Text\r\n% Using this bigram language model, you can now generate random text that\r\n% hopefully sounds Shakespearean. This works by first randomly selecting a\r\n% bigram that starts with &lt;s&gt; based on its probability, and then\r\n% randomly selecting another bigram based on its probability, starting\r\n% with the second word in the first bigram, and so forth, until we\r\n% encounter &lt;\/s&gt;. This is implemented in the functions\r\n% <https:\/\/blogs.mathworks.com\/images\/loren\/2015\/textGen.m |textGen|> and\r\n% <https:\/\/blogs.mathworks.com\/images\/loren\/2015\/nextWord.m |nextWord|>.\r\n\r\nrng(1)                                          % for reproducibility\r\ntextGen(biMdl)                                  % generate random text\r\n\r\n%% Generating Trigram Shakespeare Text\r\n% Bigram sentences sound a bit like Shakespeare but they don't make a lot\r\n% of sense. Would we do better with a trigram model? Let's try it with\r\n% <https:\/\/blogs.mathworks.com\/images\/loren\/2015\/trigramClass.m\r\n% |trigramClass|>.\r\n\r\ntriMdl = trigramClass(delimiters);              % generate trigrams\r\ntriMdl.build(processed, biMdl);                 % build a trigram model\r\nrng(2)                                          % for reproducibility\r\ntextGen(triMdl, 'thou')                         % start with 'thou'\r\n\r\n%% Create a Smartphone App\r\n% If you liked <http:\/\/www.xkcd.com\/1068\/ this XKCD cartoon> which shows an\r\n% example of predictive text smartphone app, you may want to create your\r\n% own. If so, check out this webinar that shows you how to turn MATLAB\r\n% code into a mobile app through C code generation\r\n% <https:\/\/www.mathworks.com\/videos\/matlab-to-iphone-and-android-made-easy-107779.html\r\n% MATLAB to iPhone and Android Made Easy>\r\n%\r\n% <<https:\/\/blogs.mathworks.com\/images\/loren\/2014\/iphoneWebinar.png>>\r\n%\r\n\r\n%% Summary\r\n% You saw that the trigram model worked better than the bigram model, but\r\n% William Shakespeare would have had nothing to fear about such models\r\n% taking over his playwright job. We talked about practical uses like\r\n% auto-completion, auto-correction, speech recognition, etc. We also\r\n% discusssed how you could go from MATLAB code to a mobile app using C code\r\n% generation.\r\n%\r\n% In practical natural language processing applications, such as resolving\r\n% ambiguity between \"recognize speech\" vs. \"wreck a nice beach\" in speech\r\n% recognition, the model requires further refinements.\r\n% \r\n% * To score a sentence, you use <https:\/\/en.wikipedia.org\/wiki\/Chain_rule\r\n% the chain rule> to compute a product of a bunch of conditional\r\n% probabilities. Since they are small numbers, you get even a smaller\r\n% number by multiplying them, causing\r\n% <https:\/\/en.wikipedia.org\/wiki\/Arithmetic_underflow arithmetic\r\n% underflow>. We should use log probabilities instead.\r\n% * How do you deal with new sequences or a new word not seen in\r\n% the corpus? We need to use smoothing or backoff to account for unseen\r\n% data.\r\n%\r\n% To learn what you can do with text in MATLAB, check out this awesome\r\n% introductory book <https:\/\/www.mathworks.com\/academia\/books\/book71143.html\r\n% Text Mining with MATLAB>.\r\n% \r\n% For a casual predictive text game just for fun, you can play with the\r\n% simple models I used in this post.\r\n% <https:\/\/www.mathworks.com\/programs\/trials\/trial_request.html Try out>\r\n% the code examples here, and building your own random text generator from\r\n% any corpus of your interest. Or try to implement the |score| method that\r\n% incorporates the suggested refinements using the code provided here.\r\n% \r\n% If you have an interesting use of language models, please share in the\r\n% comments <https:\/\/blogs.mathworks.com\/loren\/?p=1217#respond here>.\r\n##### SOURCE END ##### f3a4925b12ca4c82aa5b77b9f60c3edd\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2014\/iphoneWebinar.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>Have you ever wondered how Google provides the auto-complete feature in Google Suggest? Or sometimes you see results of hilarious or annoying auto-correct features on your smartphone? Today's guest blogger, Toshi Takeuchi, explains a natural language processing approach through a fun text mining example with Shakespeare.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/09\/09\/text-mining-shakespeare-with-matlab\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[66,43,61,48,1],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1217"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=1217"}],"version-history":[{"count":6,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1217\/revisions"}],"predecessor-version":[{"id":3794,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1217\/revisions\/3794"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=1217"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=1217"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=1217"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}