{"id":1134,"date":"2015-04-08T09:03:05","date_gmt":"2015-04-08T14:03:05","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=1134"},"modified":"2016-11-17T10:35:20","modified_gmt":"2016-11-17T15:35:20","slug":"can-you-find-love-through-text-analytics","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2015\/04\/08\/can-you-find-love-through-text-analytics\/","title":{"rendered":"Can You Find Love through Text Analytics?"},"content":{"rendered":"<div class=\"content\"><!--introduction--><a href=\"https:\/\/www.youtube.com\/watch?v=qtsNbxgPngA\">Jimmy Fallon Blew a Chance to Date Nicole Kidman<\/a>, but do you know there is supposedly a way to fall in love with anyone? Today&#8217;s guest blogger, Toshi Takeuchi, would like to talk about finding love with MATLAB.<\/p>\n<p><img decoding=\"async\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/first_date.jpg\" alt=\"\" hspace=\"5\" vspace=\"5\" \/><\/p>\n<p><!--\/introduction--><\/p>\n<h3>Contents<\/h3>\n<div>\n<ul>\n<li><a href=\"#075824a4-fcd4-4e84-b2fe-e614ef41de48\">Love Experiment<\/a><\/li>\n<li><a href=\"#97231184-c1bb-4752-9786-d753f2051581\">Latent Semantic Analysis with MATLAB<\/a><\/li>\n<li><a href=\"#25e5bf90-ac41-4fd5-8379-34f054527348\">Text Processing Pipeline<\/a><\/li>\n<li><a href=\"#519c6745-5879-4dff-bb28-f0f135989966\">TF-IDF Weighting<\/a><\/li>\n<li><a href=\"#02b28d57-e3f7-45c4-85f5-44a2ea86ed15\">Low-Rank Approximation<\/a><\/li>\n<li><a href=\"#227ab8a0-2b40-4430-832a-5cbf30dd7d8b\">Visualize Online Dating Profiles<\/a><\/li>\n<li><a href=\"#64356d29-b353-4b7d-b0ee-76fb6112babd\">Computing Similarity<\/a><\/li>\n<li><a href=\"#513e9613-3e05-4fc8-a929-507d4fe35a48\">Getting the Ranked Matches<\/a><\/li>\n<li><a href=\"#a17246ff-7b72-4209-97f0-fb3956c1150d\">What about Japanese Text?<\/a><\/li>\n<li><a href=\"#b02563c1-e2cd-4f2b-9ccf-b338d6be27a6\">Call for Action<\/a><\/li>\n<\/ul>\n<\/div>\n<h4>Love Experiment<a name=\"075824a4-fcd4-4e84-b2fe-e614ef41de48\"><\/a><\/h4>\n<p>I read a very intriguing New York Times article <a href=\"http:\/\/www.nytimes.com\/2015\/01\/11\/fashion\/modern-love-to-fall-in-love-with-anyone-do-this.html?_r=0\">To Fall in Love With Anyone, Do This<\/a>. It was about an experiment that went like this:<\/p>\n<p><i>&#8220;Two heterosexual strangers sat face to face in a lab and answered a series of 36 increasingly personal questions. Then they stared silently into each other&#8217;s eyes for four minutes. Six months later, they were married.&#8221;<\/i><\/p>\n<p>I wanted to see if someone could try it. Luckily, a friend of mine in Japan was keen to give it a try, but there was one minor issue: she couldn&#8217;t find any male counterpart who was willing to join her in this experiment.<\/p>\n<p>This is a big issue in Japan where the birthrate went negative. There is even a new word, <a href=\"http:\/\/www.wsj.com\/articles\/SB124623617832566695\">Konkatsu<\/a>, for the intensive effort required to get married. Before we can do this experiment, we need to solve this problem first. A lot of people turn to online dating for that, but that is not so easy, either. Do you need some evidence?<\/p>\n<div>\n<ul>\n<li><a href=\"http:\/\/www.wired.com\/2014\/01\/how-to-hack-okcupid\/\">How a Math Genius Hacked OkCupid to Find True Love<\/a><\/li>\n<li><a href=\"https:\/\/www.youtube.com\/watch?v=d6wG_sAdP0U\">Amy Webb: How I hacked online dating<\/a><\/li>\n<li><a href=\"http:\/\/www.theguardian.com\/lifeandstyle\/2015\/feb\/24\/i-created-a-bot-to-find-love-online-reader-it-worked\">I created a bot to find love online \u2013 reader, it worked<\/a><\/li>\n<\/ul>\n<\/div>\n<h4>Latent Semantic Analysis with MATLAB<a name=\"97231184-c1bb-4752-9786-d753f2051581\"><\/a><\/h4>\n<p>In an online dating world you need to comb through a mind-numbing volume of profiles just to get started. Then came the idea: <b>why not use MATLAB to mine online profiles to find your love?<\/b><\/p>\n<p>We need data to analyze. I don&#8217;t have access to real online dating profiles, but luckily I found <a href=\"http:\/\/laurenhallden.com\/datingipsum\/\">Online Dating Ipsum<\/a> by Lauren Hallden that randomly generates fictitious ones. I used <a href=\"http:\/\/en.wikipedia.org\/wiki\/Latent_semantic_analysis\">Latent Semantic Analysis<\/a> (LSA) to cluster online profiles based on the words they contain. I cooked up a MATLAB class <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/myLSA.m\"><tt>myLSA.m<\/tt><\/a> to implement Latent Semantic Analysis methods. Let&#8217;s initialize it into an object called <tt>LSA<\/tt>, and load the dataset and print one of those.<\/p>\n<pre class=\"codeinput\">LSA = myLSA();\r\nprofiles = readtable(<span class=\"string\">'online_profiles.xlsx'<\/span>);\r\nfprintf(<span class=\"string\">'%s\\n%s\\n%s\\n%s\\n%s\\n'<\/span>,profiles.Profile{1}(1:73),<span class=\"keyword\">...<\/span>\r\n    profiles.Profile{1}(74:145),profiles.Profile{1}(146:219),<span class=\"keyword\">...<\/span>\r\n    profiles.Profile{1}(220:291),profiles.Profile{1}(292:358))\r\n<\/pre>\n<pre class=\"codeoutput\">Working at a coffee shop adventures tacos medical school. Feminism going \r\nto the gym strong and confident Family Guy listening to music, my beard \r\nKurosawa discussing politics trying different restaurants I know I listed \r\nmore than 6 things. Snowboarding no drama outdoor activities discussing \r\npolitics pickles my friends tell me they don't get why I'm single. \r\n<\/pre>\n<p>Not bad for a random word salad, except that they are all male profiles. If you need female profiles, you need to find other sources.<\/p>\n<h4>Text Processing Pipeline<a name=\"25e5bf90-ac41-4fd5-8379-34f054527348\"><\/a><\/h4>\n<p>Before we can analyze text, we need to process it into an appropriate form. There is a fairly standard process for English text.<\/p>\n<div>\n<ol>\n<li>Tokenization: split text into word tokens using white space, etc.<\/li>\n<li>Standardization: standardize word forms, i.e., all lowercase<\/li>\n<li>Stopwords: remove common words, such as &#8216;the, a, at, to&#8217;<\/li>\n<li>Stemming: reduce words into their root forms by trimming their endings<\/li>\n<li>Indexing: sort the words by document and count word frequencies<\/li>\n<li>Document-Term Frequency Matrix: turn indexed frequency counts into a document x term matrix<\/li>\n<\/ol>\n<\/div>\n<p>The <tt>tokenizer<\/tt> method takes care of the first four steps &#8211; tokenization, normalization, stopwords and stemming. Check out the before and after.<\/p>\n<pre class=\"codeinput\">tokenized = LSA.tokenizer(profiles.Profile);\r\nbefore = profiles.Profile(1)\r\nafter = {strjoin(tokenized{1},<span class=\"string\">' '<\/span>)}\r\n<\/pre>\n<pre class=\"codeoutput\">before = \r\n    'Working at a coffee shop adventures tacos medical school. Feminism goi...'\r\nafter = \r\n    'work coffe shop adventur taco medic school femin go gym strong confid ...'\r\n<\/pre>\n<p>Next, the <tt>indexer<\/tt> method creates word lists and word count vectors.<\/p>\n<pre class=\"codeinput\">[word_lists,word_counts] = LSA.indexer(tokenized);\r\n<\/pre>\n<p>Then we create a document-term frequency matrix from these using <tt>docterm<\/tt>. The minimum frequency is set to 2 and that drops any words that only occur once through the entire collection of documents.<\/p>\n<pre class=\"codeinput\">docterm = LSA.docterm(word_lists,word_counts,2);\r\n<\/pre>\n<h4>TF-IDF Weighting<a name=\"519c6745-5879-4dff-bb28-f0f135989966\"><\/a><\/h4>\n<p>You could use the document-term frequency matrix directly, but raw word count is problematic &#8211; it gives too much weight to frequent words, and frequent words that appear in many documents are usually not so useful to understand the differences among those documents. We would like to see the weight to represent the relevancy of each word.<\/p>\n<p><a href=\"http:\/\/en.wikipedia.org\/wiki\/Tf%E2%80%93idf\">TF-IDF<\/a> is a common method for frequency weighting. It is made up of TF, which stands for Term Frequency, and IDF, Inverse Document Frequency. TF scales based on the number of times a given term appears in a document, and IDF inversely scales based on how many document a given term appears in. The more frequently a word appears in documents, the less weight it gets. TF-IDF is just a product of those two metrics. Let&#8217;s use <tt>tfidf<\/tt> to apply this weighting scheme. It also optionally returns TF.<\/p>\n<pre class=\"codeinput\">tfidf = LSA.tfidf(docterm);\r\n<\/pre>\n<p>I went through each step of text processing, but we could instead run <tt>vectorize<\/tt> to turn a raw cell array of online dating profiles into a TF-IDF weighted matrix in one shot.<\/p>\n<pre>  tfidf = LSA.vectorize(profiles.Profile,2);<\/pre>\n<h4>Low-Rank Approximation<a name=\"02b28d57-e3f7-45c4-85f5-44a2ea86ed15\"><\/a><\/h4>\n<p>Once the data is transformed into a matrix, we can apply linear algebra techniques for further analysis. In LSA, you typically apply singular value decomposition (SVD) to find a low-rank approximation.<\/p>\n<p>Let&#8217;s first get the components of SVD. U is the SVD document matrix, V is the SVD term matrix, and S is the singular values.<\/p>\n<pre class=\"codeinput\">[U,S,V] = svd(tfidf);\r\n<\/pre>\n<p>If you square <tt>S<\/tt> and divide it by sum of <tt>S<\/tt> squared, you get the percentage of variance explained. Let&#8217;s plot the cumulative values.<\/p>\n<pre class=\"codeinput\">explained = cumsum(S.^2\/sum(S.^2));\r\nfigure\r\nplot(1:size(S,1),explained)\r\nxlim([1 30]);ylim([0 1]);\r\nline([5 5],[0 explained(5)],<span class=\"string\">'Color'<\/span>,<span class=\"string\">'r'<\/span>)\r\nline([0 5],[explained(5) explained(5)],<span class=\"string\">'Color'<\/span>,<span class=\"string\">'r'<\/span>)\r\ntitle(<span class=\"string\">'Cumulative sum of S^2 divided by sum of S^2'<\/span>)\r\nxlabel(<span class=\"string\">'Column'<\/span>)\r\nylabel(<span class=\"string\">'% variance explained'<\/span>)\r\n<\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/findingLoveUpdate2_01.png\" alt=\"\" hspace=\"5\" vspace=\"5\" \/><\/p>\n<p>You see that the first 5 columns explain 60% of variance. A rank-5 approximation will retain 60% of the information of the original matrix. The <tt>myLSA<\/tt> class also provides <tt>lowrank<\/tt> that performs SVD and returns a low rank approximation based on some criteria, such as number of columns or the percentage of variance explained.<\/p>\n<pre class=\"language-matlab\">[Uk,Sk,Vk] = LSA.lowrank(tfidf,0.6);\r\n<\/pre>\n<h4>Visualize Online Dating Profiles<a name=\"227ab8a0-2b40-4430-832a-5cbf30dd7d8b\"><\/a><\/h4>\n<p>We can also use the first 2 columns to plot the SVD document matrix U and SVD term matrix V in 2D space. The blue dots represent online dating profiles and words around them are semantically associated to those profiles.<\/p>\n<pre class=\"codeinput\">figure()\r\nscatter(U(:,1), U(:,2),<span class=\"string\">'filled'<\/span>)\r\ntitle(<span class=\"string\">'Online Dating Profiles and Words'<\/span>)\r\nxlabel(<span class=\"string\">'Dimension 1'<\/span>)\r\nylabel(<span class=\"string\">'Dimension 2'<\/span>)\r\nxlim([-.3 -.03]); ylim([-.2 .45])\r\n<span class=\"keyword\">for<\/span> i = [1,4,9,12,15,16,20,22,23,24,25,27,29,33,34,35,38,47,48,53,57,58,<span class=\"keyword\">...<\/span>\r\n        64,73,75,77,80,82,83,85,88,97,98,103,113,114,116,118,120,125,131,<span class=\"keyword\">...<\/span>\r\n        136,142,143,156,161,162,166,174,181,185,187,199,200,204,206,212,<span class=\"keyword\">...<\/span>\r\n        234,251]\r\n    text(V(i,1).*3, V(i,2).*3, LSA.vocab(i))\r\n<span class=\"keyword\">end<\/span>\r\ntext(-0.25,0.4,<span class=\"string\">'Wholesome\/Sporty'<\/span>,<span class=\"string\">'FontSize'<\/span>, 12, <span class=\"string\">'Color'<\/span>, <span class=\"string\">'b'<\/span>)\r\ntext(-0.15,-0.15,<span class=\"string\">'Bad Boy\/Colorful'<\/span>,<span class=\"string\">'FontSize'<\/span>, 12, <span class=\"string\">'Color'<\/span>, <span class=\"string\">'b'<\/span>)\r\n<\/pre>\n<p><img decoding=\"async\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/findingLoveUpdate2_02.png\" alt=\"\" hspace=\"5\" vspace=\"5\" \/><\/p>\n<p>You can see there are two main clusters &#8211; what I would call the &#8220;Wholesome\/Sporty&#8221; cluster and one called the &#8220;Bad Boy\/Colorful&#8221; cluster, based on the words associated with them. This makes sense, because Lauren provides two options in her profile generator:<\/p>\n<div>\n<ul>\n<li>Typical inane jabber<\/li>\n<li>With a side of crazy sauce<\/li>\n<\/ul>\n<\/div>\n<p>Can you guess which cluster belongs to which category?<\/p>\n<p>Now you can cluster a whole bunch of profiles at once and quickly eliminate those that don&#8217;t match your taste. You can also add your own profile to see which cluster you belong to, and, if that puts you in a wrong cluster of profiles, then you may want to update your profile.<\/p>\n<h4>Computing Similarity<a name=\"64356d29-b353-4b7d-b0ee-76fb6112babd\"><\/a><\/h4>\n<p>Say you find a cluster of profiles you are interested in. Among the profiles you see there, which one is the closest to your taste? To answer this question, we need to find a way to define the similarity of two documents. If you use the Euclidean distance between vectors, the longer documents and shorter documents can have very different values even if they share many of the same words. Instead, we can use the angle between the vectors to determine the similarity. This is known as <a href=\"http:\/\/en.wikipedia.org\/wiki\/Vector_space_model\">Vector Space Model<\/a>. For ease of computation, cosine is used for similarity computation.<\/p>\n<pre>  cosine = dot(A,B)\/(norm(A)*norm(B))<\/pre>\n<p>The greater the value, the closer (= similar).<\/p>\n<pre>   Angle     Cosine\r\n___________  ______\r\n  0 degree      1\r\n 90 degrees     0\r\n180 degrees    -1<\/pre>\n<p>For practical implementation, you can just length normalize vectors by the L2 norm, and compute the dot product.<\/p>\n<pre>  cosine = dot(A\/norm(A),B\/norm(B))<\/pre>\n<p>You can apply length normalization ahead of the similarity computation. We will use the rank-5 approximation of the SVD document matrix to compare online dating profiles using <tt>normalize<\/tt>.<\/p>\n<pre class=\"codeinput\">doc_norm = LSA.normalize(U(:,1:5));\r\n<\/pre>\n<p>Now we can compute cosine similarities between profiles with <tt>score<\/tt>. Let&#8217;s compare the first profile to the first five profiles.<\/p>\n<pre class=\"codeinput\">LSA.score(doc_norm(1:5,:),doc_norm(1,:))\r\n<\/pre>\n<pre class=\"codeoutput\">ans =\r\n            1\r\n      0.20974\r\n      0.55248\r\n      0.97436\r\n      0.72994\r\n<\/pre>\n<p>The first score is 1, which means it is a perfect match, and that&#8217;s because we are comparing the first profile to itself. Other profiles got lower scores depending on how similar they are to the first profile.<\/p>\n<h4>Getting the Ranked Matches<a name=\"513e9613-3e05-4fc8-a929-507d4fe35a48\"><\/a><\/h4>\n<p>It&#8217;s probably useful if you can describe your ideal date and find the profiles that match your description ordered by similarity. It is a bit like a search engine.<\/p>\n<p>To compare the new text string to the pre-computed matrix, we need to apply the same pre-processing steps that we have already seen. <tt>query<\/tt> can take care of the tedious details.<\/p>\n<pre class=\"codeinput\">q = <span class=\"string\">'someone fun to hang out with, good sense of humor, likes sushi,'<\/span>;\r\nq = [q <span class=\"string\">'watches Game of Thrones, sees foreign films, listens to music,'<\/span>];\r\nq = [q <span class=\"string\">'do outdoor activities or fitness'<\/span>];\r\nweighted_q = LSA.query(q);\r\n<\/pre>\n<p>Now we need to transform the query vector into the rank-5 document space. This is done by transforming <tt>M = U*S*V'<\/tt> into <tt>U = M'*V*S^-1<\/tt> and substituting <tt>M'<\/tt> with the query vector and <tt>V<\/tt> and <tt>S<\/tt> with their low rank approximations.<\/p>\n<pre class=\"codeinput\">q_reduced = weighted_q * V(:,1:5) * S(1:5,1:5)^-1;\r\n<\/pre>\n<p>The <tt>myLSA<\/tt> class also provides the <tt>reduce<\/tt> method to perform the same operation.<\/p>\n<pre class=\"language-matlab\">q_reduced = LSA.reduce(weighted_q);\r\n<\/pre>\n<p>Then we can length-normalize the query vector and compute the dot products with documents. Let&#8217;s sort the cosine similarities in descending order, and check the top 3 results.<\/p>\n<pre class=\"codeinput\">q_norm = LSA.normalize(q_reduced);\r\n[scores,idx] = sort(LSA.score(doc_norm,q_norm),<span class=\"string\">'descend'<\/span>);\r\n\r\ndisp(<span class=\"string\">'Top 3 Profiles'<\/span>)\r\n<span class=\"keyword\">for<\/span> i = 1:3\r\n    profiles.Profile(idx(i))\r\n<span class=\"keyword\">end<\/span>\r\n<\/pre>\n<pre class=\"codeoutput\">Top 3 Profiles\r\nans = \r\n    'Someone who shares my sense of humor fitness my goofy smile Oxford com...'\r\nans = \r\n    'My cats I'm really good at my goofy smile mountain biking. Fixing up m...'\r\nans = \r\n    'My eyes just looking to have some fun if you think we have something i...'\r\n<\/pre>\n<p>Looks pretty reasonable to me!<\/p>\n<p>In this example, we applied TF-IDF weighting to both the document-term frequency matrix as well as the query vector. However, you only need to apply IDF just once to the query to save computing resources. This approach is known as <i>lnc.ltc<\/i> in <a href=\"http:\/\/en.wikipedia.org\/wiki\/SMART_Information_Retrieval_System\">SMART notation system<\/a>. We already processed our query in <i>ltc<\/i> format. Here is how you do <i>lnc<\/i> for your documents &#8211; you use just TF instead of TF-IDF:<\/p>\n<pre class=\"language-matlab\">[~, tf] = LSA.vectorize(profiles.Profile,2);\r\ndoc_reduced = LSA.lowrank(tf,0.6);\r\ndoc_norm = LSA.normalize(doc_reduced);\r\n<\/pre>\n<h4>What about Japanese Text?<a name=\"a17246ff-7b72-4209-97f0-fb3956c1150d\"><\/a><\/h4>\n<p>Can my Japanese friends benefit from this technique? Yes, definitely. Once you have the document-term frequency matrix, the rest is exactly the same. The hardest part is tokenization, because there is no whitespace between words in Japanese text.<\/p>\n<p>Fortunately, there are free tools to do just that &#8211; they are called Japanese Morphological Analyzers. One of the most popular analyzers is MeCab. A binary package is available for installation on Windows, but it is for 32-bit, and doesn&#8217;t work with MATLAB in 64-bit. My Japanese colleague, <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/2409625-takuya-otani\">Takuya Otani<\/a>, compiled the source code to run it on MATLAB 64-bit on Windows.<\/p>\n<p>MATLAB provides an interface to shared libraries like DLLs, and we can use <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/loadlibrary.html\"><tt>loadlibrary<\/tt><\/a> to load them into memory and access functions from those shared libraries. Here is an example of how to call the shared library <tt>libmecab.dll<\/tt> that Takuya compiled.<\/p>\n<p>You may not have any particular need for handling Japanese text, but it gives you a good example of how to load a DLL into MATLAB and call its functions. Please note some requirements in case you want to try:<\/p>\n<div>\n<ul>\n<li>Have a 64-bit Japanese Windows computer with 64-bit MATLAB<\/li>\n<li>Have a <a href=\"https:\/\/www.mathworks.com\/support\/compilers\/\">MATLAB-compatible compiler<\/a> installed and enabled on your computer<\/li>\n<li>Follow <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/mecab_x64_build_procedure_for_MATLAB.pdf\">Takuya&#8217;s instructions<\/a> to compile your own 64-bit DLL and place it in your current folder along with its header file.<\/li>\n<\/ul>\n<\/div>\n<pre class=\"language-matlab\">loadlibrary(<span class=\"string\">'libmecab.dll'<\/span>, <span class=\"string\">'mecab.h'<\/span>);\r\n<\/pre>\n<p>When you run this command, you may get several warnings, but you can ignore them. If you would like to see if the library was loaded, use <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/libfunctionsview.html\"><tt>libfunctionsview<\/tt><\/a> function to view the functions available in the DLL.<\/p>\n<pre class=\"language-matlab\">libfunctionsview(<span class=\"string\">'libmecab'<\/span>)\r\n<\/pre>\n<p>To call a function in the DLL, use <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/calllib.html\"><tt>calllib<\/tt><\/a>. In the case of Mecab, you need to initialize Mecab and obtain its pointer first.<\/p>\n<pre class=\"language-matlab\">argv = libpointer(<span class=\"string\">'stringPtrPtr'<\/span>, {<span class=\"string\">'MeCab'<\/span>});\r\nargc = 1;\r\nmecab = calllib(<span class=\"string\">'libmecab'<\/span>, <span class=\"string\">'mecab_new'<\/span>, argc, argv);\r\n<\/pre>\n<p>As an example, let&#8217;s call one of the Mecab functions you can use to analyze Japanese text &#8211; <tt>mecab_sparse_tostr<\/tt>.<\/p>\n<pre class=\"language-matlab\">text = <span class=\"string\">'Some Japanese text'<\/span>;\r\nresult = calllib(<span class=\"string\">'libmecab'<\/span>, <span class=\"string\">'mecab_sparse_tostr'<\/span>, mecab, text);\r\n<\/pre>\n<p>When finished, clear the pointer and unload the DLL from the memory using <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/unloadlibrary.html\"><tt>unloadlibrary<\/tt><\/a>.<\/p>\n<pre class=\"language-matlab\">clearvars <span class=\"string\">mecab<\/span>\r\nunloadlibrary(<span class=\"string\">'libmecab'<\/span>)\r\n<\/pre>\n<h4>Call for Action<a name=\"b02563c1-e2cd-4f2b-9ccf-b338d6be27a6\"><\/a><\/h4>\n<p>If you happen to be single and are willing to try the experiment described in the New York Times article, please report back <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=1134#respond\">here<\/a> with your results. The New York Times now provides <a href=\"http:\/\/www.nytimes.com\/2015\/02\/13\/style\/the-36-questions-on-the-way-to-love.html\">a free app<\/a> to generate 36 magical questions!<\/p>\n<p><script>\/\/ <![CDATA[\nfunction grabCode_8fc582e553824247b94d6c74851907b0() {\n        \/\/ Remember the title so we can use it in the new page\n        title = document.title;\n\n        \/\/ Break up these strings so that their presence\n        \/\/ in the Javascript doesn't mess up the search for\n        \/\/ the MATLAB code.\n        t1='8fc582e553824247b94d6c74851907b0 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 8fc582e553824247b94d6c74851907b0';\n    \n        b=document.getElementsByTagName('body')[0];\n        i1=b.innerHTML.indexOf(t1)+t1.length;\n        i2=b.innerHTML.indexOf(t2);\n \n        code_string = b.innerHTML.substring(i1, i2);\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\n\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \n        \/\/ in the XML parser.\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\n        \/\/ doesn't go ahead and substitute the less-than character. \n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\n\n        copyright = 'Copyright 2015 The MathWorks, Inc.';\n\n        w = window.open();\n        d = w.document;\n        d.write('\n\n\n\n<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\n\n\n\n\n\\n');\n\n        d.title = title + ' (MATLAB code)';\n        d.close();\n    }\n\/\/ ]]><\/script><\/p>\n<p style=\"text-align: right; font-size: xx-small; font-weight: lighter; font-style: italic; color: gray;\">\n<a><span style=\"font-size: x-small; font-style: italic;\">Get<br \/>\nthe MATLAB code<noscript>(requires JavaScript)<\/noscript><\/span><\/a><\/p>\n<p>Published with MATLAB\u00ae R2015a<\/p>\n<\/div>\n<p><!--\n8fc582e553824247b94d6c74851907b0 ##### SOURCE BEGIN #####\n%% Can You Find Love through Text Analytics?\n% <https:\/\/www.youtube.com\/watch?v=qtsNbxgPngA Jimmy Fallon Blew a Chance % to Date Nicole Kidman>, but do you know there is supposedly a way to fall\n% in love with anyone? Today's guest blogger, Toshi Takeuchi, would like to\n% talk about finding love with MATLAB.\n%\n% <<first_date.jpg>>\n%\n%% Love Experiment\n% I read a very intriguing New York Times article\n% <http:\/\/www.nytimes.com\/2015\/01\/11\/fashion\/modern-love-to-fall-in-love-with-anyone-do-this.html % To Fall in Love With Anyone, Do This>. It was about an experiment that\n% went like this:\n%\n% _\"Two heterosexual strangers sat face to face in a lab and\n% answered a series of 36 increasingly personal questions. Then they stared\n% silently into each other's eyes for four minutes. Six months later, they\n% were married.\"_\n%\n% I wanted to see if someone could try it. Luckily, a friend of mine in\n% Japan was keen to give it a try, but there was one minor issue:\n% she couldn't find any male counterpart who was willing to join her in\n% this experiment.\n%\n% This is a big issue in Japan where the birthrate went negative. There is\n% even a new word, <http:\/\/www.wsj.com\/articles\/SB124623617832566695 % Konkatsu>, for the intensive effort required to get married. Before we\n% can do this experiment, we need to solve this problem first. A lot of\n% people turn to online dating for that, but that is not so easy, either.\n% Do you need some evidence?\n%\n% * <http:\/\/www.wired.com\/2014\/01\/how-to-hack-okcupid\/ How a Math Genius % Hacked OkCupid to Find True Love>\n% * <https:\/\/www.youtube.com\/watch?v=d6wG_sAdP0U Amy Webb: How I hacked % online dating>\n% * <http:\/\/www.theguardian.com\/lifeandstyle\/2015\/feb\/24\/i-created-a-bot-to-find-love-online-reader-it-worked % I created a bot to find love online \u00e2\u20ac\u201c reader, it worked>\n%\n%% Latent Semantic Analysis with MATLAB\n% In an online dating world you need to comb through a mind-numbing volume\n% of profiles just to get started. Then came the idea: *why not use MATLAB\n% to mine online profiles to find your love?*\n%\n% We need data to analyze. I don't have access to real online dating\n% profiles, but luckily I found <http:\/\/laurenhallden.com\/datingipsum\/ % Online Dating Ipsum> by Lauren Hallden that randomly generates fictitious\n% ones. I used <http:\/\/en.wikipedia.org\/wiki\/Latent_semantic_analysis % Latent Semantic Analysis> (LSA) to cluster online profiles based on the\n% words they contain. I cooked up a MATLAB class\n% <https:\/\/blogs.mathworks.com\/images\/loren\/2015\/myLSA.m |myLSA.m|> to\n% implement Latent Semantic Analysis methods. Let's initialize it into an\n% object called |LSA|, and load the dataset and print one of those.\n\nLSA = myLSA();\nprofiles = readtable('online_profiles.xlsx');\nfprintf('%s\\n%s\\n%s\\n%s\\n%s\\n',profiles.Profile{1}(1:73),...\nprofiles.Profile{1}(74:145),profiles.Profile{1}(146:219),...\nprofiles.Profile{1}(220:291),profiles.Profile{1}(292:358))\n\n%%\n% Not bad for a random word salad, except that they are all male profiles.\n% If you need female profiles, you need to find other sources.\n%\n%% Text Processing Pipeline\n% Before we can analyze text, we need to process it into an appropriate\n% form. There is a fairly standard process for English text.\n%\n% # Tokenization: split text into word tokens using white space, etc.\n% # Standardization: standardize word forms, i.e., all lowercase\n% # Stopwords: remove common words, such as 'the, a, at, to'\n% # Stemming: reduce words into their root forms by trimming their endings\n% # Indexing: sort the words by document and count word frequencies\n% # Document-Term Frequency Matrix: turn indexed frequency counts into a\n% document x term matrix\n%\n% The |tokenizer| method takes care of the first four steps - tokenization,\n% normalization, stopwords and stemming. Check out the before and after.\n\ntokenized = LSA.tokenizer(profiles.Profile);\nbefore = profiles.Profile(1)\nafter = {strjoin(tokenized{1},' ')}\n\n%%\n% Next, the |indexer| method creates word lists and word count vectors.\n[word_lists,word_counts] = LSA.indexer(tokenized);\n\n%%\n% Then we create a document-term frequency matrix from these using\n% |docterm|. The minimum frequency is set to 2 and that drops any words\n% that only occur once through the entire collection of documents.\ndocterm = LSA.docterm(word_lists,word_counts,2);\n\n%% TF-IDF Weighting\n% You could use the document-term frequency matrix directly, but raw word\n% count is problematic - it gives too much weight to frequent words, and\n% frequent words that appear in many documents are usually not so useful to\n% understand the differences among those documents. We would like to see\n% the weight to represent the relevancy of each word.\n%\n% <http:\/\/en.wikipedia.org\/wiki\/Tf%E2%80%93idf TF-IDF> is a common method\n% for frequency weighting. It is made up of TF, which stands for Term\n% Frequency, and IDF, Inverse Document Frequency. TF scales based on the\n% number of times a given term appears in a document, and IDF inversely\n% scales based on how many document a given term appears in. The more\n% frequently a word appears in documents, the less weight it gets. TF-IDF\n% is just a product of those two metrics. Let's use |tfidf| to apply this\n% weighting scheme. It also optionally returns TF.\n\ntfidf = LSA.tfidf(docterm);\n\n%%\n% I went through each step of text processing, but we could instead run\n% |vectorize| to turn a raw cell array of online dating profiles\n% into a TF-IDF weighted matrix in one shot.\n%\n%    tfidf = LSA.vectorize(profiles.Profile,2);\n\n%% Low-Rank Approximation\n% Once the data is transformed into a matrix, we can apply linear algebra\n% techniques for further analysis. In LSA, you typically apply singular\n% value decomposition (SVD) to find a low-rank approximation.\n%\n% Let's first get the components of SVD. U is the SVD document matrix, V is\n% the SVD term matrix, and S is the singular values.\n\n[U,S,V] = svd(tfidf);\n\n%%\n% If you square |S| and divide it by sum of |S| squared, you get the\n% percentage of variance explained. Let's plot the cumulative values.\n\nexplained = cumsum(S.^2\/sum(S.^2));\nfigure\nplot(1:size(S,1),explained)\nxlim([1 30]);ylim([0 1]);\nline([5 5],[0 explained(5)],'Color','r')\nline([0 5],[explained(5) explained(5)],'Color','r')\ntitle('Cumulative sum of S^2 divided by sum of S^2')\nxlabel('Column')\nylabel('% variance explained')\n\n%%\n% You see that the first 5 columns explain 60% of variance. A rank-5\n% approximation will retain 60% of the information of the original matrix.\n% The |myLSA| class also provides |lowrank| that performs SVD and returns a\n% low rank approximation based on some criteria, such as number of columns\n% or the percentage of variance explained.\n%\n%   [Uk,Sk,Vk] = LSA.lowrank(tfidf,0.6);\n\n%% Visualize Online Dating Profiles\n% We can also use the first 2 columns to plot the SVD document matrix U and\n% SVD term matrix V in 2D space. The blue dots represent online dating\n% profiles and words around them are semantically associated to those\n% profiles.\n\nfigure()\nscatter(U(:,1), U(:,2),'filled')\ntitle('Online Dating Profiles and Words')\nxlabel('Dimension 1')\nylabel('Dimension 2')\nxlim([-.3 -.03]); ylim([-.2 .45])\nfor i = [1,4,9,12,15,16,20,22,23,24,25,27,29,33,34,35,38,47,48,53,57,58,...\n64,73,75,77,80,82,83,85,88,97,98,103,113,114,116,118,120,125,131,...\n136,142,143,156,161,162,166,174,181,185,187,199,200,204,206,212,...\n234,251]\ntext(V(i,1).*3, V(i,2).*3, LSA.vocab(i))\nend\ntext(-0.25,0.4,'Wholesome\/Sporty','FontSize', 12, 'Color', 'b')\ntext(-0.15,-0.15,'Bad Boy\/Colorful','FontSize', 12, 'Color', 'b')\n\n%%\n% You can see there are two main clusters - what I would call the\n% \"Wholesome\/Sporty\" cluster and one called the \"Bad Boy\/Colorful\" cluster,\n% based on the words associated with them. This makes sense, because Lauren\n% provides two options in her profile generator:\n%\n% * Typical inane jabber\n% * With a side of crazy sauce\n%\n% Can you guess which cluster belongs to which category?\n%\n% Now you can cluster a whole bunch of profiles at once and quickly\n% eliminate those that don't match your taste. You can also add your own\n% profile to see which cluster you belong to, and, if that puts you in a\n% wrong cluster of profiles, then you may want to update your profile.\n%\n%% Computing Similarity\n% Say you find a cluster of profiles you are interested in. Among the\n% profiles you see there, which one is the closest to your taste? To answer\n% this question, we need to find a way to define the similarity of two\n% documents. If you use the Euclidean distance between vectors, the longer\n% documents and shorter documents can have very different values even if\n% they share many of the same words. Instead, we can use the angle between\n% the vectors to determine the similarity. This is known as\n% <http:\/\/en.wikipedia.org\/wiki\/Vector_space_model Vector Space Model>. For\n% ease of computation, cosine is used for similarity computation.\n%\n%    cosine = dot(A,B)\/(norm(A)*norm(B))\n%\n% The greater the value, the closer (= similar).\n%\n%     Angle     Cosine\n%  ___________  ______\n%    0 degree      1\n%   90 degrees     0\n%  180 degrees    -1\n%\n% For practical implementation, you can just length normalize vectors by\n% the L2 norm, and compute the dot product.\n%\n%    cosine = dot(A\/norm(A),B\/norm(B))\n%\n% You can apply length normalization ahead of the similarity computation.\n% We will use the rank-5 approximation of the SVD document matrix to\n% compare online dating profiles using |normalize|.\n\ndoc_norm = LSA.normalize(U(:,1:5));\n\n%%\n% Now we can compute cosine similarities between profiles with |score|.\n% Let's compare the first profile to the first five profiles.\n\nLSA.score(doc_norm(1:5,:),doc_norm(1,:))\n\n%%\n% The first score is 1, which means it is a perfect match, and that's\n% because we are comparing the first profile to itself. Other profiles got\n% lower scores depending on how similar they are to the first profile.\n%\n%% Getting the Ranked Matches\n% It's probably useful if you can describe your ideal date and find the\n% profiles that match your description ordered by similarity. It is a bit\n% like a search engine.\n%\n% To compare the new text string to the pre-computed matrix, we need to\n% apply the same pre-processing steps that we have already seen. |query|\n% can take care of the tedious details.\n\nq = 'someone fun to hang out with, good sense of humor, likes sushi,';\nq = [q 'watches Game of Thrones, sees foreign films, listens to music,'];\nq = [q 'do outdoor activities or fitness'];\nweighted_q = LSA.query(q);\n\n%%\n% Now we need to transform the query vector into the rank-5 document space.\n% This is done by transforming |M = U*S*V'| into |U = M'*V*S^-1| and\n% substituting |M'| with the query vector and |V| and |S| with their low\n% rank approximations.\n\nq_reduced = weighted_q * V(:,1:5) * S(1:5,1:5)^-1;\n\n%%\n% The |myLSA| class also provides the |reduce| method to perform the same\n% operation.\n%\n%   q_reduced = LSA.reduce(weighted_q);\n\n%%\n% Then we can length-normalize the query vector and compute the dot\n% products with documents. Let's sort the cosine similarities in descending\n% order, and check the top 3 results.\nq_norm = LSA.normalize(q_reduced);\n[scores,idx] = sort(LSA.score(doc_norm,q_norm),'descend');\n\ndisp('Top 3 Profiles')\nfor i = 1:3\nprofiles.Profile(idx(i))\nend\n\n%%\n% Looks pretty reasonable to me!\n%\n% In this example, we applied TF-IDF weighting to both the document-term\n% frequency matrix as well as the query vector. However, you only need to\n% apply IDF just once to the query to save computing resources. This\n% approach is known as _lnc.ltc_ in\n% <http:\/\/en.wikipedia.org\/wiki\/SMART_Information_Retrieval_System SMART % notation system>. We already processed our query in _ltc_ format. Here is\n% how you do _lnc_ for your documents - you use just TF instead of TF-IDF:\n%\n%   [~, tf] = LSA.vectorize(profiles.Profile,2);\n%   doc_reduced = LSA.lowrank(tf,0.6);\n%   doc_norm = LSA.normalize(doc_reduced);\n\n%% What about Japanese Text?\n% Can my Japanese friends benefit from this technique? Yes, definitely.\n% Once you have the document-term frequency matrix, the rest is exactly the\n% same. The hardest part is tokenization, because there is no whitespace\n% between words in Japanese text.\n%\n% Fortunately, there are free tools to do just that - they are called\n% Japanese Morphological Analyzers. One of the most popular analyzers is\n% <http:\/\/mecab.googlecode.com\/svn\/trunk\/mecab\/doc\/index.html MeCab> (this\n% link goes to a Japanese page). A binary package is available for\n% installation on Windows, but it is for 32-bit, and doesn't work with\n% MATLAB in 64-bit. My Japanese colleague,\n% <https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/2409625-takuya-otani % Takuya Otani>, compiled the source code to run it on MATLAB 64-bit on\n% Windows.\n%\n% MATLAB provides an interface to shared libraries like DLLs, and we can\n% use <https:\/\/www.mathworks.com\/help\/matlab\/ref\/loadlibrary.html % |loadlibrary|> to load them into memory and access functions from those\n% shared libraries. Here is an example of how to call the shared library\n% |libmecab.dll| that Takuya compiled.\n%\n% You may not have any particular need for handling Japanese text, but it\n% gives you a good example of how to load a DLL into MATLAB and call its\n% functions. Please note some requirements in case you want to try:\n%\n% * Have a 64-bit Japanese Windows computer with 64-bit MATLAB\n% * Have a <https:\/\/www.mathworks.com\/support\/compilers\/ % MATLAB-compatible compiler> installed and enabled on your computer\n% * Follow\n% <https:\/\/blogs.mathworks.com\/images\/loren\/2015\/mecab_x64_build_procedure_for_MATLAB.pdf % Takuya's instructions> to compile your own 64-bit DLL\n% and place it in your current folder along with its header file.\n%\n%   loadlibrary('libmecab.dll', 'mecab.h');\n\n%%\n% When you run this command, you may get several warnings, but you can\n% ignore them. If you would like to see if the library was loaded, use\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/libfunctionsview.html % |libfunctionsview|> function to view the functions available in the DLL.\n%\n%   libfunctionsview('libmecab')\n\n%%\n% To call a function in the DLL, use\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/calllib.html |calllib|>.\n% In the case of Mecab, you need to initialize Mecab and obtain its pointer\n% first.\n%\n%   argv = libpointer('stringPtrPtr', {'MeCab'});\n%   argc = 1;\n%   mecab = calllib('libmecab', 'mecab_new', argc, argv);\n\n%%\n% As an example, let's call one of the Mecab functions you can use to\n% analyze Japanese text - |mecab_sparse_tostr|.\n%\n%   text = 'Some Japanese text';\n%   result = calllib('libmecab', 'mecab_sparse_tostr', mecab, text);\n\n%%\n% When finished, clear the pointer and unload the DLL from the memory using\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/unloadlibrary.html % |unloadlibrary|>.\n%\n%   clearvars mecab\n%   unloadlibrary('libmecab')\n\n%% Call for Action\n% If you happen to be single and are willing to try the experiment\n% described in the New York Times article, please report back\n% <https:\/\/blogs.mathworks.com\/loren\/?p=1134#respond here> with your results.\n% The New York Times now provides\n% <http:\/\/www.nytimes.com\/2015\/02\/13\/style\/the-36-questions-on-the-way-to-love.html % a free app> to generate 36 magical questions!\n\n##### SOURCE END ##### 8fc582e553824247b94d6c74851907b0\n--><\/p>\n","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/findingLoveUpdate2_02.png\" onError=\"this.style.display ='none';\" \/><\/div>\n<p>&#8230; <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/04\/08\/can-you-find-love-through-text-analytics\/\">read more >><\/a><\/p>\n","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[63,33,48,2],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1134"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=1134"}],"version-history":[{"count":5,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1134\/revisions"}],"predecessor-version":[{"id":2118,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1134\/revisions\/2118"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=1134"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=1134"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=1134"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}