{"id":2422,"date":"2017-09-21T08:32:29","date_gmt":"2017-09-21T13:32:29","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=2422"},"modified":"2021-09-13T14:22:15","modified_gmt":"2021-09-13T18:22:15","slug":"math-with-words-word-embeddings-with-matlab-and-text-analytics-toolbox","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2017\/09\/21\/math-with-words-word-embeddings-with-matlab-and-text-analytics-toolbox\/","title":{"rendered":"Math with Words &#8211; Word Embeddings with MATLAB and Text Analytics Toolbox"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p>Text data has become an important part of <a href=\"https:\/\/www.mathworks.com\/solutions\/data-analytics.html\">data analytics<\/a>,  thanks to advances in natural language processing that transform unstructured text into meaningful data. The new <a href=\"https:\/\/www.mathworks.com\/products\/text-analytics.html\">Text Analytics Toolbox<\/a> provides tools to process and analyze text data in MATLAB.<\/p><p>Today's guest blogger, <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/951521\">Toshi Takeuchi<\/a> introduces some cool features available in the new toolbox, starting with <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/wordembedding.html\">word embeddings<\/a>. Check out how he uses <a href=\"https:\/\/en.wikipedia.org\/wiki\/Sentiment_analysis\">sentiment analysis<\/a> to find <a href=\"https:\/\/www.airbnb.com\/locations\/boston\">good AirBnB locations to stay in Boston<\/a>!<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/wordcloud.png\" alt=\"\"> <\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#34dd85a0-738d-40e4-8a83-13d3e4d3c536\">What is a Word Embedding?<\/a><\/li><li><a href=\"#1f2eaf3a-6ca1-4c84-9c4a-930a2e648004\">Ingredients<\/a><\/li><li><a href=\"#1547795e-790c-4fb2-a65e-19121faa5629\">Loading a Pre-Trained Word Embedding from GloVe<\/a><\/li><li><a href=\"#3462323d-fcdd-4b4b-842e-ed1bc584e551\">Vector Math Example<\/a><\/li><li><a href=\"#df19770c-6485-484c-849c-e1fa1f80954d\">Visualizing the Word Embedding<\/a><\/li><li><a href=\"#a264f337-6ea3-435b-8659-441fb3f040bd\">Using Word Embeddings for Sentiment Analysis<\/a><\/li><li><a href=\"#f90de3bd-4be1-40d4-8eae-b7d00830b39d\">Word Embeddings Meet Machine Learning<\/a><\/li><li><a href=\"#71b2ffba-d52c-45c4-bf5e-6bc01138ae87\">Prepare Data for Machine Learning<\/a><\/li><li><a href=\"#dc411517-8fc1-4377-95d4-17b3697873b9\">Training and Evaluating the Sentiment Classifier<\/a><\/li><li><a href=\"#6df9e54e-2be8-4aab-829f-5cd08115988d\">Boston Airbnb Open Data<\/a><\/li><li><a href=\"#4afa4c4a-9bf5-4ac7-8fbc-f93521092483\">Airbnb Review Ratings<\/a><\/li><li><a href=\"#45c0bbe3-5891-4696-8e08-59f899019cc8\">Computing Sentiment Scores<\/a><\/li><li><a href=\"#3aed8487-fe5b-4009-9eec-9e43fcfa643e\">Sentiment by Location<\/a><\/li><li><a href=\"#07e0d6bc-024c-4b21-b0db-51a6e3cc5ffc\">Summary<\/a><\/li><\/ul><\/div><h4>What is a Word Embedding?<a name=\"34dd85a0-738d-40e4-8a83-13d3e4d3c536\"><\/a><\/h4><p>Have you heard about <a href=\"https:\/\/code.google.com\/archive\/p\/word2vec\/\">word2vec<\/a> or <a href=\"https:\/\/nlp.stanford.edu\/projects\/glove\/\">GloVe<\/a>? These are part of very powerful natural language processing technique called word embeddings, and you can now take advantage of it in MATLAB via Text Analytics Toolbox.<\/p><p>Why am I excited about it? It \"embeds\" words into a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Vector_space_model\">vector space model<\/a> based on how often a word appears close to other words. Done at an internet scale, you can attempt to capture the semantics of the words in the vectors, so that similar words have similar vectors.<\/p><p>One very famous example of how word embeddings can represent such relationship is that you can do a vector computation like this:<\/p><p>$$king - man + woman \\approx queen$$<\/p><p>Yes, \"queen\" is like \"king\" except that it is a woman, rather than a man! How cool is that? This kind of magic has become possible thanks to vast availability of raw text data on the internet, <a href=\"https:\/\/www.mathworks.com\/solutions\/big-data-matlab.html\">greater computing capability that can process it<\/a>, and advances in artificial neural networks, such as <a href=\"https:\/\/www.mathworks.com\/discovery\/deep-learning.html\">deep learning<\/a>.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/vecs.png\" alt=\"\"> <\/p><p>Even more exciting is the fact that you don't have to be a natural language processing expert to harness the power of word embeddings if you use pre-trained models! Let me show you how you can use it for your own text analytics purposes, such as <a href=\"https:\/\/en.wikipedia.org\/wiki\/Document_classification\">document classification<\/a>, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Information_retrieval\">information retrieval<\/a> and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Sentiment_analysis\">sentiment analysis<\/a>.<\/p><h4>Ingredients<a name=\"1f2eaf3a-6ca1-4c84-9c4a-930a2e648004\"><\/a><\/h4><p>In this example, I will use a pre-trained word embedding from <a href=\"https:\/\/nlp.stanford.edu\/projects\/glove\/\">GloVe<\/a>. To follow along, please<\/p><div><ul><li>Get the source code of this post by clicking on \"Get the MATLAB code\" at the bottom of this page<\/li><li>Download a <a href=\"https:\/\/www.mathworks.com\/campaigns\/products\/trials.html\">free trial version<\/a> of Text Analytics Toolbox (MATLAB and Statistics and Machine Learning Toolbox R2017b or later are also required).<\/li><li>Download the pre-trained model <a href=\"http:\/\/nlp.stanford.edu\/data\/glove.6B.zip\">glove.6B.300d.txt<\/a> (6 billion tokens, 400K vocabulary, 300 dimensions) from <a href=\"https:\/\/nlp.stanford.edu\/projects\/glove\/\">GloVe<\/a>.<\/li><li>Download the <a href=\"http:\/\/www.cs.uic.edu\/~liub\/FBS\/opinion-lexicon-English.rar\">sentiment lexicon<\/a> from <a href=\"https:\/\/www.cs.uic.edu\/~liub\/FBS\/sentiment-analysis.html#lexicon\">University of Illinois at Chicago<\/a><\/li><li>Download the data from the <a href=\"https:\/\/www.kaggle.com\/airbnb\/boston\">Boston Airbnb Open Data page<\/a> on <a href=\"https:\/\/www.kaggle.com\">Kaggle<\/a><\/li><li>Download my custom function <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/load_lexicon.m\">load_lexicon.m<\/a> and class <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/sentiment.m\">sentiment.m<\/a> as well as <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/boston_map.mat\">the raster map of Boston<\/a><\/li><\/ul><\/div><p>Please extract the content from the archive files into your current folder.<\/p><h4>Loading a Pre-Trained Word Embedding from GloVe<a name=\"1547795e-790c-4fb2-a65e-19121faa5629\"><\/a><\/h4><p>You can use the function <tt><a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/readwordembedding.html\">readWordEmbedding<\/a><\/tt> in Text Analytics Toolbox to read pre-trained word embeddings. To see a word vector, use <tt><a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/wordembedding.word2vec.html\">word2vec<\/a><\/tt> to get the vector representation of a given word. Because the dimension for this embedding is 300, we get a vector of 300 elements for each word.<\/p><pre class=\"codeinput\">filename = <span class=\"string\">\"glove.6B.300d\"<\/span>;\r\n<span class=\"keyword\">if<\/span> exist(filename + <span class=\"string\">'.mat'<\/span>, <span class=\"string\">'file'<\/span>) ~= 2\r\n    emb = readWordEmbedding(filename + <span class=\"string\">'.txt'<\/span>);\r\n    save(filename + <span class=\"string\">'.mat'<\/span>, <span class=\"string\">'emb'<\/span>, <span class=\"string\">'-v7.3'<\/span>);\r\n<span class=\"keyword\">else<\/span>\r\n    load(filename + <span class=\"string\">'.mat'<\/span>)\r\n<span class=\"keyword\">end<\/span>\r\nv_king = word2vec(emb,<span class=\"string\">'king'<\/span>)';\r\nwhos <span class=\"string\">v_king<\/span>\r\n<\/pre><pre class=\"codeoutput\">  Name          Size            Bytes  Class     Attributes\r\n\r\n  v_king      300x1              1200  single              \r\n\r\n<\/pre><h4>Vector Math Example<a name=\"3462323d-fcdd-4b4b-842e-ed1bc584e551\"><\/a><\/h4><p>Let's try the vector math! Here is another famous example:<\/p><p>$$paris - france + poland \\approx warsaw$$<\/p><p>Apparently, the vector subtraction \"paris - france\" encodes the concept of \"capital\" and if you add \"poland\", you get \"warsaw\".<\/p><p>Let's try it with MATLAB. <tt>word2vec<\/tt> returns vectors for given words in the word embedding, and <tt><a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/wordembedding.vec2word.html\">vec2word<\/a><\/tt> finds the closest words to the vectors in the word embedding.<\/p><pre class=\"codeinput\">v_paris = word2vec(emb,<span class=\"string\">'paris'<\/span>);\r\nv_france = word2vec(emb,<span class=\"string\">'france'<\/span>);\r\nv_poland = word2vec(emb,<span class=\"string\">'poland'<\/span>);\r\nvec2word(emb, v_paris - v_france +  v_poland)\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n    \"warsaw\"\r\n<\/pre><h4>Visualizing the Word Embedding<a name=\"df19770c-6485-484c-849c-e1fa1f80954d\"><\/a><\/h4><p>We would like to visualize this word embedding using <tt><a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/textscatter.html\">textscatter<\/a><\/tt> plot, but it is hard to visualize it if all 400,000 words from the word embedding are included. I found a list of 4,000 English nouns. Let's use those words only and reduce the dimensions from 300 to 2 using <tt><a href=\"https:\/\/www.mathworks.com\/help\/stats\/tsne.html\">tsne<\/a><\/tt> (t-Distributed Stochastic Neighbor Embedding) for dimensionality reduction. To make it easier to see words, I zoomed into a specific area of the plot that contains food related-words. You can see that related words are placed close together.<\/p><pre class=\"codeinput\"><span class=\"keyword\">if<\/span> exist(<span class=\"string\">'nouns.mat'<\/span>,<span class=\"string\">'file'<\/span>) ~= 2\r\n    url = <span class=\"string\">'http:\/\/www.desiquintans.com\/downloads\/nounlist\/nounlist.txt'<\/span>;\r\n    nouns = webread(url);\r\n    nouns = split(nouns);\r\n    save(<span class=\"string\">'nouns.mat'<\/span>,<span class=\"string\">'nouns'<\/span>);\r\n<span class=\"keyword\">else<\/span>\r\n    load(<span class=\"string\">'nouns.mat'<\/span>)\r\n<span class=\"keyword\">end<\/span>\r\nnouns(~ismember(nouns,emb.Vocabulary)) = [];\r\nvec = word2vec(emb,nouns);\r\nrng(<span class=\"string\">'default'<\/span>); <span class=\"comment\">% for reproducibility<\/span>\r\nxy = tsne(vec);\r\n\r\nfigure\r\ntextscatter(xy,nouns)\r\ntitle(<span class=\"string\">'GloVe Word Embedding (6B.300d) - Food Related Area'<\/span>)\r\naxis([-35 -10 -36 -14]);\r\nset(gca,<span class=\"string\">'clipping'<\/span>,<span class=\"string\">'off'<\/span>)\r\naxis <span class=\"string\">off<\/span>\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/glove_airbnb2_sbd_01.png\" alt=\"\"> <h4>Using Word Embeddings for Sentiment Analysis<a name=\"a264f337-6ea3-435b-8659-441fb3f040bd\"><\/a><\/h4><p>For a practical application of word embeddings, let's consider sentiment analysis. We would typically take advantage of pre-existing sentiment lexicons such as <a href=\"https:\/\/www.cs.uic.edu\/~liub\/FBS\/sentiment-analysis.html#lexicon\">this one from the University of Illinois at Chicago<\/a>. It comes with 2,006 positive words and 4,783 negative words. Let's load the lexicon using the custom function <tt>load_lexicon<\/tt>.<\/p><p>If we just rely on the available words in the lexicon, we can only score sentiment for 6,789 words. One idea to expand on this is to use the word embedding to find words that are close to these sentiment words.<\/p><pre class=\"codeinput\">pos = load_lexicon(<span class=\"string\">'positive-words.txt'<\/span>);\r\nneg = load_lexicon(<span class=\"string\">'negative-words.txt'<\/span>);\r\n[length(pos) length(neg)]\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n        2006        4783\r\n<\/pre><h4>Word Embeddings Meet Machine Learning<a name=\"f90de3bd-4be1-40d4-8eae-b7d00830b39d\"><\/a><\/h4><p>What if we use word vectors as the training data to develop a classifier that can score all words in the 400,000-word embedding? We can take advantage of the fact that related words are close together in word embeddings to do this. Let's make a sentiment classifier that takes advantage of the vectors from the word embedding.<\/p><p>As the first step, we will get vectors from the word embedding for words in the lexicon to create a matrix of predictors with 300 columns, and then use positive or negative sentiment labels as the response variable. Here is the preview of the word, response variable and the first 7 predictor variables out of 300.<\/p><pre class=\"codeinput\"><span class=\"comment\">% Drop words not in the embedding<\/span>\r\npos = pos(ismember(pos,emb.Vocabulary));\r\nneg = neg(ismember(neg,emb.Vocabulary));\r\n\r\n<span class=\"comment\">% Get corresponding word vectors<\/span>\r\nv_pos = word2vec(emb,pos);\r\nv_neg = word2vec(emb,neg);\r\n\r\n<span class=\"comment\">% Initialize the table and add the data<\/span>\r\ndata = table;\r\ndata.word = [pos;neg];\r\npred = [v_pos;v_neg];\r\ndata = [data array2table(pred)];\r\ndata.resp = zeros(height(data),1);\r\ndata.resp(1:length(pos)) = 1;\r\n\r\n<span class=\"comment\">% Preview the table<\/span>\r\nhead(data(:,[1,end,2:8 ]))\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n  8&times;9 table\r\n        word         resp      pred1        pred2        pred3       pred4        pred5         pred6        pred7   \r\n    _____________    ____    _________    _________    _________    ________    __________    _________    __________\r\n    \"abound\"         1        0.081981     -0.27295      0.32238     0.19932      0.099266      0.60253       0.18819\r\n    \"abounds\"        1       -0.037126     0.085212      0.26952     0.20927     -0.014547      0.52336       0.11287\r\n    \"abundance\"      1       -0.038408     0.076613    -0.094277    -0.10652      -0.43257      0.74405       0.41298\r\n    \"abundant\"       1        -0.29317    -0.068101     -0.44659    -0.31563      -0.13791      0.44888       0.31894\r\n    \"accessible\"     1        -0.45096     -0.46794      0.11761    -0.70256       0.19879      0.44775       0.26262\r\n    \"acclaim\"        1         0.07426     -0.11164       0.3615     -0.4499    -0.0061991      0.44146    -0.0067972\r\n    \"acclaimed\"      1         0.69129      0.04812      0.29267      0.1242      0.083869      0.25791       -0.5444\r\n    \"acclamation\"    1       -0.026593     -0.60759     -0.15785     0.36048      -0.45289    0.0092178      0.074671\r\n<\/pre><h4>Prepare Data for Machine Learning<a name=\"71b2ffba-d52c-45c4-bf5e-6bc01138ae87\"><\/a><\/h4><p>Let's partition the data into a training set and holdout set for performance evaluation. The holdout set contains 30% of the available data.<\/p><pre class=\"codeinput\">rng(<span class=\"string\">'default'<\/span>) <span class=\"comment\">% for reproducibility<\/span>\r\nc = cvpartition(data.resp,<span class=\"string\">'Holdout'<\/span>,0.3);\r\ntrain = data(training(c),2:end);\r\nXtest = data(test(c),2:end-1);\r\nYtest = data.resp(test(c));\r\nLtest = data(test(c),1);\r\nLtest.label = Ytest;\r\n<\/pre><h4>Training and Evaluating the Sentiment Classifier<a name=\"dc411517-8fc1-4377-95d4-17b3697873b9\"><\/a><\/h4><p>We want to build a classifier that can separate positive words and negative words in the vector space defined by the word embedding. For a quick performance evaluation, I chose the fast and easy linear discriminant among possible machine learning algorithms.<\/p><p>Here is the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Confusion_matrix\">confusion matrix<\/a> of this model. The result was 91.1% classification accuracy. Not bad.<\/p><pre class=\"codeinput\"><span class=\"comment\">% Train model<\/span>\r\nmdl = fitcdiscr(train,<span class=\"string\">'resp'<\/span>);\r\n\r\n<span class=\"comment\">% Predict on test data<\/span>\r\nYpred = predict(mdl,Xtest);\r\ncf = confusionmat(Ytest,Ypred);\r\n\r\n<span class=\"comment\">% Display results<\/span>\r\nfigure\r\nvals = {<span class=\"string\">'Negative'<\/span>,<span class=\"string\">'Positive'<\/span>};\r\nheatmap(vals,vals,cf);\r\nxlabel(<span class=\"string\">'Predicted Label'<\/span>)\r\nylabel(<span class=\"string\">'True Label'<\/span>)\r\ntitle({<span class=\"string\">'Confusion Matrix of Linear Discriminant'<\/span>; <span class=\"keyword\">...<\/span>\r\n    sprintf(<span class=\"string\">'Classification Accuracy %.1f%%'<\/span>, <span class=\"keyword\">...<\/span>\r\n    sum(cf(logical(eye(2))))\/sum(sum(cf))*100)})\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/glove_airbnb2_sbd_02.png\" alt=\"\"> <p>Let's check the predicted sentiment score against the actual label. The custom class <tt>sentiment<\/tt> uses the linear discriminant model to score sentiment.<\/p><p>The <tt>scoreWords<\/tt> method of the class scores words. A positive score represents positive sentiment, and a negative score is negative. Now we can use 400,000 words to score sentiment.<\/p><pre class=\"codeinput\">dbtype <span class=\"string\">sentiment.m<\/span> <span class=\"string\">18:26<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\n18            function scores = scoreWords(obj,words)\r\n19                %SCOREWORDS scores sentiment of words\r\n20                vec = word2vec(obj.emb,words);          % word vectors\r\n21                if size(vec,2) ~= obj.emb.Dimension     % check num cols\r\n22                    vec =  vec';                        % transpose as needed\r\n23                end\r\n24                [~,scores,~] = predict(obj.mdl,vec);    % get class probabilities\r\n25                scores = scores(:,2) - scores(:,1);     % positive scores - negative scores\r\n26            end\r\n<\/pre><p>Let's test this custom class. If the label is 0 and score is negative or the label is 1 and score is positive, then the model classified the word correctly. Otherwise, the word was misclassified.<\/p><p>Here is the table that shows 10 examples from the test set:<\/p><div><ul><li>the word<\/li><li>its sentiment label (0 = negative, 1 = positive)<\/li><li>its sentiment score (negative = negative, positive = positive)<\/li><li>evaluation (true = correct, false = incorrect)<\/li><\/ul><\/div><pre class=\"codeinput\">sent = sentiment(emb,mdl);\r\nLtest.score = sent.scoreWords(Ltest.word);\r\nLtest.eval = Ltest.score &gt; 0 == Ltest.label;\r\ndisp(Ltest(randsample(height(Ltest),10),:))\r\n<\/pre><pre class=\"codeoutput\">        word         label     score      eval \r\n    _____________    _____    ________    _____\r\n    \"fugitive\"       0        -0.90731    true \r\n    \"misfortune\"     0        -0.98667    true \r\n    \"outstanding\"    1         0.99999    true \r\n    \"reluctant\"      0        -0.99694    true \r\n    \"botch\"          0        -0.99957    true \r\n    \"carefree\"       1         0.97568    true \r\n    \"mesmerize\"      1          0.4801    true \r\n    \"slug\"           0        -0.88944    true \r\n    \"angel\"          1         0.43419    true \r\n    \"wheedle\"        0        -0.98412    true \r\n<\/pre><p>Now we need a way to score the sentiment of human-language text, rather than a single word. The <tt>scoreText<\/tt> method of the sentiment class averages the sentiment scores of each word in the text. This may not be the best way to do it, but it's a simple place to start.<\/p><pre class=\"codeinput\">dbtype <span class=\"string\">sentiment.m<\/span> <span class=\"string\">28:33<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\n28            function score = scoreText(obj,text)\r\n29                %SCORETEXT scores sentiment of text\r\n30                tokens = split(lower(text));            % split text into tokens\r\n31                scores = obj.scoreWords(tokens);        % get score for each token\r\n32                score = mean(scores,'omitnan');         % average scores\r\n33            end\r\n<\/pre><p>Here are the sentiment scores on sentences given by the <tt>scoreText<\/tt> method - very positive, somewhat positive, and negative.<\/p><pre class=\"codeinput\">[sent.scoreText(<span class=\"string\">'this is fantastic'<\/span>) <span class=\"keyword\">...<\/span>\r\nsent.scoreText(<span class=\"string\">'this is okay'<\/span>) <span class=\"keyword\">...<\/span>\r\nsent.scoreText(<span class=\"string\">'this sucks'<\/span>)]\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n      0.91458      0.80663    -0.073585\r\n<\/pre><h4>Boston Airbnb Open Data<a name=\"6df9e54e-2be8-4aab-829f-5cd08115988d\"><\/a><\/h4><p>Let's try this on review data from the Boston Airbnb Open Data page on Kaggle. First, we would like to see what people say in their reviews as a <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/ldamodel.wordcloud.html\">word cloud<\/a>. Text Analytics Toolbox provides functionality to simplify text preprocessing workflows, such as <tt><a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/tokenizeddocument.html\">tokenizedDocument<\/a><\/tt> which parses documents into an array of tokens, and <tt><a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/bagofwords.html\">bagOfWords<\/a><\/tt> that generates the term frequency count model (this can be used to build a machine learning model).<\/p><p>The commented-out code will generate the word cloud shown at the top of this post. However, you can also generate word clouds using two-word phrases known as bigrams. You can generate bigrams with <tt><a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/tokenizeddocument.docfun.html\">docfun<\/a><\/tt>, which operates on the array of tokens. You can also see that it is possible to generate trigrams and other <a href=\"https:\/\/en.wikipedia.org\/wiki\/N-gram\">n-grams<\/a> by modifying the function handle.<\/p><p>It seems a lot of comments were about locations!<\/p><pre class=\"codeinput\">opts = detectImportOptions(<span class=\"string\">'listings.csv'<\/span>);\r\nl = readtable(<span class=\"string\">'listings.csv'<\/span>,opts);\r\nreviews = readtable(<span class=\"string\">'reviews.csv'<\/span>);\r\ncomments = tokenizedDocument(reviews.comments);\r\ncomments = lower(comments);\r\ncomments = removeWords(comments,stopWords);\r\ncomments = removeShortWords(comments,2);\r\ncomments = erasePunctuation(comments);\r\n\r\n<span class=\"comment\">% == uncomment to generate a word cloud ==<\/span>\r\n<span class=\"comment\">% bag = bagOfWords(comments);<\/span>\r\n<span class=\"comment\">% figure<\/span>\r\n<span class=\"comment\">% wordcloud(bag);<\/span>\r\n<span class=\"comment\">% title('AirBnB Review Word Cloud')<\/span>\r\n\r\n<span class=\"comment\">% Generate a Bigram Word Cloud<\/span>\r\nf = @(s)s(1:end-1) + <span class=\"string\">\" \"<\/span> + s(2:end);\r\nbigrams = docfun(f,comments);\r\nbag2 = bagOfWords(bigrams);\r\nfigure\r\nwordcloud(bag2);\r\ntitle(<span class=\"string\">'AirBnB Review Bigram Cloud'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/glove_airbnb2_sbd_03.png\" alt=\"\"> <h4>Airbnb Review Ratings<a name=\"4afa4c4a-9bf5-4ac7-8fbc-f93521092483\"><\/a><\/h4><p>Review ratings are also available, but ratings are really skewed towards 100, meaning the vast majority of listings are just perfectly wonderful (really?). As <a href=\"https:\/\/xkcd.com\/1098\/\">this XCKD comic<\/a> shows, we have <a href=\"http:\/\/sloanreview.mit.edu\/article\/the-problem-with-online-ratings-2\/\">the problem with online ratings<\/a> with regards to review ratings. This is not very useful.<\/p><pre class=\"codeinput\">figure\r\nhistogram(l.review_scores_rating)\r\ntitle(<span class=\"string\">'Distribution of AirBnB Review Ratings'<\/span>)\r\nxlabel(<span class=\"string\">'Review Ratings'<\/span>)\r\nylabel(<span class=\"string\">'# Listings'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/glove_airbnb2_sbd_04.png\" alt=\"\"> <h4>Computing Sentiment Scores<a name=\"45c0bbe3-5891-4696-8e08-59f899019cc8\"><\/a><\/h4><p>Now let's score sentiment of Airbnb listing reviews instead. Since a listing can have number of reviews, I would use the median sentiment score per listing. The median sentiment scores in Boston are generally in the positive range, but it follows a normal distribution. This looks more realistic.<\/p><pre class=\"codeinput\"><span class=\"comment\">% Score the reviews<\/span>\r\nf = @(str) sent.scoreText(str);\r\nreviews.sentiment = cellfun(f,reviews.comments);\r\n\r\n<span class=\"comment\">% Calculate the median review score by listing<\/span>\r\n[G,listings] = findgroups(reviews(:,<span class=\"string\">'listing_id'<\/span>));\r\nlistings.sentiment = splitapply(@median, <span class=\"keyword\">...<\/span>\r\n    reviews.sentiment,G);\r\n\r\n<span class=\"comment\">% Visualize the results<\/span>\r\nfigure\r\nhistogram(listings.sentiment)\r\ntitle(<span class=\"string\">'Sentiment by Boston AirBnB Listing'<\/span>)\r\nxlabel(<span class=\"string\">'Median Sentiment Score'<\/span>)\r\nylabel(<span class=\"string\">'Number of Listings'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/glove_airbnb2_sbd_05.png\" alt=\"\"> <h4>Sentiment by Location<a name=\"3aed8487-fe5b-4009-9eec-9e43fcfa643e\"><\/a><\/h4><p>The bigram cloud showed reviewers often commented on location and distance. You can use latitude and longitude of the listings to see where listings with very high or low sentiment scores are located. If you see clusters of high scores, perhaps they may indicate good locations to stay in.<\/p><pre class=\"codeinput\"><span class=\"comment\">% Join sentiment scores and listing info<\/span>\r\njoined = innerjoin( <span class=\"keyword\">...<\/span>\r\n    listings,l(:,{<span class=\"string\">'id'<\/span>,<span class=\"string\">'latitude'<\/span>,<span class=\"string\">'longitude'<\/span>, <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'neighbourhood_cleansed'<\/span>}), <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'LeftKeys'<\/span>,<span class=\"string\">'listing_id'<\/span>,<span class=\"string\">'RightKeys'<\/span>,<span class=\"string\">'id'<\/span>);\r\njoined.Properties.VariableNames{end} = <span class=\"string\">'ngh'<\/span>;\r\n\r\n<span class=\"comment\">% Discard listings with a NaN sentiment score<\/span>\r\njoined(isnan(joined.sentiment),:) = [];\r\n\r\n<span class=\"comment\">% Discretize the sentiment scores into buckets<\/span>\r\njoined.cat = discretize(joined.sentiment,0:0.25:1, <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'categorical'<\/span>,{<span class=\"string\">'&lt; 0.25'<\/span>,<span class=\"string\">'&lt; 0.50'<\/span>,<span class=\"string\">'&lt; 0.75'<\/span>,<span class=\"string\">'&lt;=1.00'<\/span>});\r\n\r\n<span class=\"comment\">% Remove undefined categories<\/span>\r\ncats = categories(joined.cat);\r\njoined(isundefined(joined.cat),:) = [];\r\n\r\n<span class=\"comment\">% Variable for color<\/span>\r\ncolorlist = winter(length(cats));\r\n\r\n<span class=\"comment\">% Generate the plot<\/span>\r\nlatlim = [42.300 42.386];\r\nlonlim = [-71.1270 -71.0174];\r\nload <span class=\"string\">boston_map.mat<\/span>\r\nfigure\r\nimagesc(lonlim,latlim, map)\r\nhold <span class=\"string\">on<\/span>\r\ngscatter(joined.longitude,joined.latitude,joined.cat,colorlist,<span class=\"string\">'o'<\/span>)\r\nhold <span class=\"string\">off<\/span>\r\ndar = [1, cosd(mean(latlim)), 1];\r\ndaspect(dar)\r\nset(gca,<span class=\"string\">'ydir'<\/span>,<span class=\"string\">'normal'<\/span>);\r\naxis([lonlim,latlim])\r\ntitle(<span class=\"string\">'Sentiment Scores by Boston Airbnb Listing'<\/span>)\r\n[g,ngh] = findgroups(joined(:,<span class=\"string\">'ngh'<\/span>));\r\nngh.Properties.VariableNames{end} = <span class=\"string\">'name'<\/span>;\r\nngh.lat = splitapply(@mean,joined.latitude,g);\r\nngh.lon = splitapply(@mean,joined.longitude,g);\r\n\r\n<span class=\"comment\">% Annotations<\/span>\r\ntext(ngh.lon(2),ngh.lat(2),ngh.name(2),<span class=\"string\">'Color'<\/span>,<span class=\"string\">'w'<\/span>)\r\ntext(ngh.lon(4),ngh.lat(4),ngh.name(4),<span class=\"string\">'Color'<\/span>,<span class=\"string\">'w'<\/span>)\r\ntext(ngh.lon(6),ngh.lat(6),ngh.name(6),<span class=\"string\">'Color'<\/span>,<span class=\"string\">'w'<\/span>)\r\ntext(ngh.lon(11),ngh.lat(11),ngh.name(11),<span class=\"string\">'Color'<\/span>,<span class=\"string\">'w'<\/span>)\r\ntext(ngh.lon(13),ngh.lat(13),ngh.name(13),<span class=\"string\">'Color'<\/span>,<span class=\"string\">'w'<\/span>)\r\ntext(ngh.lon(17),ngh.lat(17),ngh.name(17),<span class=\"string\">'Color'<\/span>,<span class=\"string\">'w'<\/span>)\r\ntext(ngh.lon(18),ngh.lat(18),ngh.name(18),<span class=\"string\">'Color'<\/span>,<span class=\"string\">'w'<\/span>)\r\ntext(ngh.lon(22),ngh.lat(22),ngh.name(22),<span class=\"string\">'Color'<\/span>,<span class=\"string\">'w'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/glove_airbnb2_sbd_06.png\" alt=\"\"> <h4>Summary<a name=\"07e0d6bc-024c-4b21-b0db-51a6e3cc5ffc\"><\/a><\/h4><p>In this post, I focused on word embeddings and sentiment analysis as an example of new features available in Text Analytics Toolbox. Hopefully you saw that the toolbox makes advanced text processing techniques very accessible. You can do more with word embeddings besides sentiment analysis, and the toolbox offers many more features besides word embeddings, such as <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/lsamodel.html\">Latent Semantic Analysis<\/a> or <a href=\"https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/ldamodel.html\">Latent Dirichlet Allocation<\/a>.<\/p><p>Hopefully I have more opportunities to discuss those other interesting features in Text Analytics Toolbox in the future.<\/p><p>Get a free trial version to play with it and let us know what you think <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=2422#respond\">here<\/a>!<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_755624bb94664840921f2c1b9f97b830() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='755624bb94664840921f2c1b9f97b830 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 755624bb94664840921f2c1b9f97b830';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2017 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_755624bb94664840921f2c1b9f97b830()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2017b<br><\/p><\/div><!--\r\n755624bb94664840921f2c1b9f97b830 ##### SOURCE BEGIN #####\r\n%% Math with Words - Word Embeddings with MATLAB and Text Analytics Toolbox\r\n% Text data has become an important part of\r\n% <https:\/\/www.mathworks.com\/solutions\/data-analytics.html data analytics>,\r\n%  thanks to advances in natural language processing that transform\r\n% unstructured text into meaningful data. The new\r\n% <https:\/\/www.mathworks.com\/products\/text-analytics.html Text Analytics\r\n% Toolbox> provides tools to process and analyze text data in\r\n% MATLAB.\r\n% \r\n% Today's guest blogger,\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/951521 Toshi\r\n% Takeuchi> introduces some cool features available in the new toolbox,\r\n% starting with\r\n% <https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/wordembedding.html word\r\n% embeddings>. Check out how he uses\r\n% <https:\/\/en.wikipedia.org\/wiki\/Sentiment_analysis sentiment analysis> to\r\n% find <https:\/\/www.airbnb.com\/locations\/boston good AirBnB locations to\r\n% stay in Boston>!\r\n% \r\n% <<wordcloud.png>>\r\n% \r\n%% What is a Word Embedding?\r\n% Have you heard about <https:\/\/code.google.com\/archive\/p\/word2vec\/\r\n% word2vec> or <https:\/\/nlp.stanford.edu\/projects\/glove\/ GloVe>? These are\r\n% part of very powerful natural language processing technique called word\r\n% embeddings, and you can now take advantage of it in MATLAB via Text\r\n% Analytics Toolbox. \r\n%\r\n% Why am I excited about it? It \"embeds\" words into a\r\n% <https:\/\/en.wikipedia.org\/wiki\/Vector_space_model vector space model>\r\n% based on how often a word appears close to other words. Done at an\r\n% internet scale, you can attempt to capture the semantics of the words in\r\n% the vectors, so that similar words have similar vectors.\r\n% \r\n% One very famous example of how word embeddings can represent such\r\n% relationship is that you can do a vector computation like this:\r\n% \r\n% $$king - man + woman \\approx queen$$\r\n% \r\n% Yes, \"queen\" is like \"king\" except that it is a woman, rather than a man!\r\n% How cool is that? This kind of magic has become possible thanks to vast\r\n% availability of raw text data on the internet,\r\n% <https:\/\/www.mathworks.com\/solutions\/big-data-matlab.html greater\r\n% computing capability that can process it>, and advances in artificial\r\n% neural networks, such as\r\n% <https:\/\/www.mathworks.com\/discovery\/deep-learning.html deep learning>.\r\n% \r\n% <<vecs.png>>\r\n% \r\n% Even more exciting is the fact that you don't have to be a natural\r\n% language processing expert to harness the power of word embeddings if you\r\n% use pre-trained models! Let me show you how you can use it for your own\r\n% text analytics purposes, such as\r\n% <https:\/\/en.wikipedia.org\/wiki\/Document_classification document\r\n% classification>, <https:\/\/en.wikipedia.org\/wiki\/Information_retrieval\r\n% information retrieval> and\r\n% <https:\/\/en.wikipedia.org\/wiki\/Sentiment_analysis sentiment analysis>.\r\n% \r\n%% Ingredients\r\n% In this example, I will use a pre-trained word embedding from <https:\/\/nlp.stanford.edu\/projects\/glove\/\r\n% GloVe>. To follow along, please\r\n% \r\n% * Get the source code of this post by clicking on \"Get the MATLAB code\" at the\r\n% bottom of this page\r\n% * Download a\r\n% <https:\/\/www.mathworks.com\/programs\/trials\/trial_request.html?prodcode=TA\r\n% free trial version> of Text Analytics Toolbox (MATLAB and Statistics\r\n% and Machine Learning Toolbox R2017b or later are also required).\r\n% * Download the pre-trained model <http:\/\/nlp.stanford.edu\/data\/glove.6B.zip\r\n% glove.6B.300d.txt> (6 billion tokens, 400K vocabulary, 300 dimensions)\r\n% from <https:\/\/nlp.stanford.edu\/projects\/glove\/ GloVe>.\r\n% * Download the\r\n% <http:\/\/www.cs.uic.edu\/~liub\/FBS\/opinion-lexicon-English.rar sentiment\r\n% lexicon> from\r\n% <https:\/\/www.cs.uic.edu\/~liub\/FBS\/sentiment-analysis.html#lexicon\r\n% University of Illinois at Chicago>\r\n% * Download the data from the <https:\/\/www.kaggle.com\/airbnb\/boston\r\n% Boston Airbnb Open Data page> on <https:\/\/www.kaggle.com Kaggle>\r\n% * Download my custom function\r\n% <https:\/\/blogs.mathworks.com\/images\/loren\/2017\/load_lexicon.m\r\n% load_lexicon.m> and class\r\n% <https:\/\/blogs.mathworks.com\/images\/loren\/2017\/sentiment.m sentiment.m> as\r\n% well as <https:\/\/blogs.mathworks.com\/images\/loren\/2017\/boston_map.mat the\r\n% raster map of Boston>\r\n% \r\n% Please extract the content from the archive files into your current folder. \r\n%\r\n%% Loading a Pre-Trained Word Embedding from GloVe\r\n% You can use the function \r\n% |<https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/readwordembedding.html\r\n% readWordEmbedding>| in Text Analytics Toolbox to read pre-trained word\r\n% embeddings. To see a word vector, use\r\n% |<https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/wordembedding.word2vec.html\r\n% word2vec>| to get the vector representation of a given word. Because the \r\n% dimension for this embedding is 300, we get a vector of 300 elements for \r\n% each word.\r\n\r\nfilename = \"glove.6B.300d\";              \r\nif exist(filename + '.mat', 'file') ~= 2          \r\n    emb = readWordEmbedding(filename + '.txt');  \r\n    save(filename + '.mat', 'emb', '-v7.3');       \r\nelse                                         \r\n    load(filename + '.mat')                     \r\nend\r\nv_king = word2vec(emb,'king')';             \r\nwhos v_king\r\n\r\n%% Vector Math Example\r\n% Let's try the vector math! Here is another famous example:\r\n%\r\n% $$paris - france + poland \\approx warsaw$$\r\n%\r\n% Apparently, the vector subtraction \"paris - france\" encodes the concept\r\n% of \"capital\" and if you add \"poland\", you get \"warsaw\".\r\n% \r\n% Let's try it with MATLAB. |word2vec| returns vectors for given words in\r\n% the word embedding, and\r\n% |<https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/wordembedding.vec2word.html\r\n% vec2word>| finds the closest words to the vectors in the word embedding.\r\n\r\nv_paris = word2vec(emb,'paris');                  \r\nv_france = word2vec(emb,'france');              \r\nv_poland = word2vec(emb,'poland');            \r\nvec2word(emb, v_paris - v_france +  v_poland)  \r\n\r\n%% Visualizing the Word Embedding\r\n% We would like to visualize this word embedding using\r\n% |<https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/textscatter.html\r\n% textscatter>| plot, but it is hard to visualize it if all 400,000 words\r\n% from the word embedding are included. I found a list of 4,000 English\r\n% nouns. Let's use those words only and reduce the dimensions from 300 to 2\r\n% using |<https:\/\/www.mathworks.com\/help\/stats\/tsne.html tsne>|\r\n% (t-Distributed Stochastic Neighbor Embedding) for dimensionality\r\n% reduction. To make it easier to see words, I zoomed into a specific area\r\n% of the plot that contains food related-words. You can see that related\r\n% words are placed close together.\r\n\r\nif exist('nouns.mat','file') ~= 2              \r\n    url = 'http:\/\/www.desiquintans.com\/downloads\/nounlist\/nounlist.txt';\r\n    nouns = webread(url);                      \r\n    nouns = split(nouns);                     \r\n    save('nouns.mat','nouns');     \r\nelse                              \r\n    load('nouns.mat')                      \r\nend\r\nnouns(~ismember(nouns,emb.Vocabulary)) = [];  \r\nvec = word2vec(emb,nouns);              \r\nrng('default'); % for reproducibility                   \r\nxy = tsne(vec);                      \r\n\r\nfigure                              \r\ntextscatter(xy,nouns)                         \r\ntitle('GloVe Word Embedding (6B.300d) - Food Related Area')\r\naxis([-35 -10 -36 -14]);                        \r\nset(gca,'clipping','off')                       \r\naxis off                                    \r\n\r\n%% Using Word Embeddings for Sentiment Analysis\r\n% For a practical application of word embeddings, let's consider\r\n% sentiment analysis. We would typically take advantage of pre-existing\r\n% sentiment lexicons such as\r\n% <https:\/\/www.cs.uic.edu\/~liub\/FBS\/sentiment-analysis.html#lexicon this\r\n% one from the University of Illinois at Chicago>. It comes with 2,006 positive\r\n% words and 4,783 negative words. Let's load the lexicon using the custom\r\n% function |load_lexicon|.\r\n% \r\n% If we just rely on the available words in the lexicon, we can only score\r\n% sentiment for 6,789 words. One idea to expand on this is to use the word\r\n% embedding to find words that are close to these sentiment words.\r\n\r\npos = load_lexicon('positive-words.txt');           \r\nneg = load_lexicon('negative-words.txt');     \r\n[length(pos) length(neg)]                      \r\n\r\n%% Word Embeddings Meet Machine Learning\r\n% What if we use word vectors as the training data to develop a classifier\r\n% that can score all words in the 400,000-word embedding? We can take\r\n% advantage of the fact that related words are close together in word\r\n% embeddings to do this. Let's make a sentiment classifier that takes\r\n% advantage of the vectors from the word embedding.\r\n%\r\n% As the first step, we will get vectors from the word embedding for words\r\n% in the lexicon to create a matrix of predictors with 300 columns, and\r\n% then use positive or negative sentiment labels as the response variable.\r\n% Here is the preview of the word, response variable and the first 7\r\n% predictor variables out of 300.\r\n\r\n% Drop words not in the embedding\r\npos = pos(ismember(pos,emb.Vocabulary));  \r\nneg = neg(ismember(neg,emb.Vocabulary));\r\n\r\n% Get corresponding word vectors\r\nv_pos = word2vec(emb,pos);      \r\nv_neg = word2vec(emb,neg);      \r\n\r\n% Initialize the table and add the data\r\ndata = table;                          \r\ndata.word = [pos;neg];                 \r\npred = [v_pos;v_neg];                  \r\ndata = [data array2table(pred)];       \r\ndata.resp = zeros(height(data),1);     \r\ndata.resp(1:length(pos)) = 1;          \r\n\r\n% Preview the table\r\nhead(data(:,[1,end,2:8 ]))                \r\n\r\n%% Prepare Data for Machine Learning\r\n% Let's partition the data into a training set and holdout set for\r\n% performance evaluation. The holdout set contains 30% of the available\r\n% data.\r\n\r\nrng('default') % for reproducibility\r\nc = cvpartition(data.resp,'Holdout',0.3);\r\ntrain = data(training(c),2:end);         \r\nXtest = data(test(c),2:end-1);           \r\nYtest = data.resp(test(c));              \r\nLtest = data(test(c),1);                 \r\nLtest.label = Ytest;                     \r\n\r\n%% Training and Evaluating the Sentiment Classifier\r\n% We want to build a classifier that can separate positive words and\r\n% negative words in the vector space defined by the word embedding. For a\r\n% quick performance evaluation, I chose the fast and easy linear\r\n% discriminant among possible machine learning algorithms.\r\n% \r\n% Here is the <https:\/\/en.wikipedia.org\/wiki\/Confusion_matrix confusion\r\n% matrix> of this model. The result was 91.1% classification accuracy. Not\r\n% bad.\r\n\r\n% Train model\r\nmdl = fitcdiscr(train,'resp');   \r\n\r\n% Predict on test data\r\nYpred = predict(mdl,Xtest);                  \r\ncf = confusionmat(Ytest,Ypred);  \r\n\r\n% Display results\r\nfigure                                         \r\nvals = {'Negative','Positive'};           \r\nheatmap(vals,vals,cf);                      \r\nxlabel('Predicted Label')           \r\nylabel('True Label')                  \r\ntitle({'Confusion Matrix of Linear Discriminant'; ...\r\n    sprintf('Classification Accuracy %.1f%%', ...\r\n    sum(cf(logical(eye(2))))\/sum(sum(cf))*100)})   \r\n\r\n%%\r\n% Let's check the predicted sentiment score against the actual label. The\r\n% custom class |sentiment| uses the linear discriminant model to score\r\n% sentiment.\r\n%\r\n% The |scoreWords| method of the class scores words. A positive score represents\r\n% positive sentiment, and a negative score is negative. Now we can use\r\n% 400,000 words to score sentiment.\r\n\r\ndbtype sentiment.m 18:26\r\n\r\n%%\r\n% Let's test this custom class. If the label is 0 and score is negative or\r\n% the label is 1 and score is positive, then the model classified the word\r\n% correctly. Otherwise, the word was misclassified. \r\n% \r\n% Here is the table that shows 10 examples from the test set:\r\n% \r\n% * the word\r\n% * its sentiment label (0 = negative, 1 = positive)\r\n% * its sentiment score (negative = negative, positive = positive)\r\n% * evaluation (true = correct, false = incorrect)\r\n%\r\n\r\nsent = sentiment(emb,mdl);                       \r\nLtest.score = sent.scoreWords(Ltest.word);  \r\nLtest.eval = Ltest.score > 0 == Ltest.label;  \r\ndisp(Ltest(randsample(height(Ltest),10),:))   \r\n\r\n%%\r\n% Now we need a way to score the sentiment of human-language text, rather\r\n% than a single word. The |scoreText| method of the sentiment class\r\n% averages the sentiment scores of each word in the text. This may not be\r\n% the best way to do it, but it's a simple place to start.\r\n\r\ndbtype sentiment.m 28:33\r\n\r\n%%\r\n% Here are the sentiment scores on sentences given by the |scoreText| method -\r\n% very positive, somewhat positive, and negative.\r\n\r\n[sent.scoreText('this is fantastic') ...         \r\nsent.scoreText('this is okay') ...     \r\nsent.scoreText('this sucks')]           \r\n\r\n%% Boston Airbnb Open Data \r\n% Let's try this on review data from the Boston Airbnb Open Data page on\r\n% Kaggle. First, we would like to see what people say in their reviews as a\r\n% <https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/ldamodel.wordcloud.html\r\n% word cloud>. Text Analytics Toolbox provides functionality to simplify\r\n% text preprocessing workflows, such as\r\n% |<https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/tokenizeddocument.html\r\n% tokenizedDocument>| which parses documents into an array of tokens, and\r\n% |<https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/bagofwords.html\r\n% bagOfWords>| that generates the term frequency count model (this can be\r\n% used to build a machine learning model).\r\n%\r\n% The commented-out code will generate the word cloud shown at the top of\r\n% this post. However, you can also generate word clouds using two-word\r\n% phrases known as bigrams. You can generate bigrams with\r\n% |<https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/tokenizeddocument.docfun.html\r\n% docfun>|, which operates on the array of tokens. You can also see that it\r\n% is possible to generate trigrams and other\r\n% <https:\/\/en.wikipedia.org\/wiki\/N-gram n-grams> by modifying the function\r\n% handle.\r\n%\r\n% It seems a lot of comments were about locations!\r\n\r\nopts = detectImportOptions('listings.csv');       \r\nl = readtable('listings.csv',opts);               \r\nreviews = readtable('reviews.csv');              \r\ncomments = tokenizedDocument(reviews.comments);   \r\ncomments = lower(comments);                       \r\ncomments = removeWords(comments,stopWords);        \r\ncomments = removeShortWords(comments,2);   \r\ncomments = erasePunctuation(comments);  \r\n\r\n% == uncomment to generate a word cloud ==\r\n% bag = bagOfWords(comments);                       \r\n% figure                                          \r\n% wordcloud(bag);                                    \r\n% title('AirBnB Review Word Cloud')          \r\n\r\n% Generate a Bigram Word Cloud\r\nf = @(s)s(1:end-1) + \" \" + s(2:end);           \r\nbigrams = docfun(f,comments);                 \r\nbag2 = bagOfWords(bigrams);                 \r\nfigure                                     \r\nwordcloud(bag2);                                 \r\ntitle('AirBnB Review Bigram Cloud')         \r\n\r\n%% Airbnb Review Ratings\r\n% Review ratings are also available, but ratings are really skewed towards\r\n% 100, meaning the vast majority of listings are just perfectly wonderful\r\n% (really?). As <https:\/\/xkcd.com\/1098\/ this XCKD comic> shows, we have\r\n% <http:\/\/sloanreview.mit.edu\/article\/the-problem-with-online-ratings-2\/\r\n% the problem with online ratings> with regards to review ratings. This is\r\n% not very useful.\r\n\r\nfigure                                           \r\nhistogram(l.review_scores_rating)               \r\ntitle('Distribution of AirBnB Review Ratings')  \r\nxlabel('Review Ratings')                   \r\nylabel('# Listings')                            \r\n\r\n%% Computing Sentiment Scores\r\n% Now let's score sentiment of Airbnb listing reviews instead. Since a\r\n% listing can have number of reviews, I would use the median sentiment\r\n% score per listing. The median sentiment scores in Boston are generally in\r\n% the positive range, but it follows a normal distribution. This looks more\r\n% realistic.\r\n\r\n% Score the reviews\r\nf = @(str) sent.scoreText(str);                    \r\nreviews.sentiment = cellfun(f,reviews.comments);  \r\n\r\n% Calculate the median review score by listing\r\n[G,listings] = findgroups(reviews(:,'listing_id'));\r\nlistings.sentiment = splitapply(@median, ...       \r\n    reviews.sentiment,G);\r\n\r\n% Visualize the results\r\nfigure                 \r\nhistogram(listings.sentiment)                       \r\ntitle('Sentiment by Boston AirBnB Listing')         \r\nxlabel('Median Sentiment Score')                    \r\nylabel('Number of Listings')                        \r\n\r\n%% Sentiment by Location\r\n% The bigram cloud showed reviewers often commented on location and\r\n% distance. You can use latitude and longitude of the listings to see where\r\n% listings with very high or low sentiment scores are located. If you see\r\n% clusters of high scores, perhaps they may indicate good locations to\r\n% stay in.\r\n\r\n% Join sentiment scores and listing info\r\njoined = innerjoin( ...                            \r\n    listings,l(:,{'id','latitude','longitude', ...  \r\n    'neighbourhood_cleansed'}), ...\r\n    'LeftKeys','listing_id','RightKeys','id');\r\njoined.Properties.VariableNames{end} = 'ngh';       \r\n\r\n% Discard listings with a NaN sentiment score\r\njoined(isnan(joined.sentiment),:) = [];      \r\n\r\n% Discretize the sentiment scores into buckets\r\njoined.cat = discretize(joined.sentiment,0:0.25:1, ...\r\n    'categorical',{'< 0.25','< 0.50','< 0.75','<=1.00'});\r\n\r\n% Remove undefined categories\r\ncats = categories(joined.cat);\r\njoined(isundefined(joined.cat),:) = [];\r\n\r\n% Variable for color\r\ncolorlist = winter(length(cats));\r\n\r\n% Generate the plot\r\nlatlim = [42.300 42.386];        \r\nlonlim = [-71.1270 -71.0174];    \r\nload boston_map.mat\r\nfigure             \r\nimagesc(lonlim,latlim, map) \r\nhold on                     \r\ngscatter(joined.longitude,joined.latitude,joined.cat,colorlist,'o')\r\nhold off                    \r\ndar = [1, cosd(mean(latlim)), 1];\r\ndaspect(dar)                     \r\nset(gca,'ydir','normal');        \r\naxis([lonlim,latlim])                               \r\ntitle('Sentiment Scores by Boston Airbnb Listing')\r\n[g,ngh] = findgroups(joined(:,'ngh'));            \r\nngh.Properties.VariableNames{end} = 'name';       \r\nngh.lat = splitapply(@mean,joined.latitude,g);    \r\nngh.lon = splitapply(@mean,joined.longitude,g);   \r\n\r\n% Annotations\r\ntext(ngh.lon(2),ngh.lat(2),ngh.name(2),'Color','w')\r\ntext(ngh.lon(4),ngh.lat(4),ngh.name(4),'Color','w')\r\ntext(ngh.lon(6),ngh.lat(6),ngh.name(6),'Color','w')\r\ntext(ngh.lon(11),ngh.lat(11),ngh.name(11),'Color','w')\r\ntext(ngh.lon(13),ngh.lat(13),ngh.name(13),'Color','w')\r\ntext(ngh.lon(17),ngh.lat(17),ngh.name(17),'Color','w')\r\ntext(ngh.lon(18),ngh.lat(18),ngh.name(18),'Color','w')\r\ntext(ngh.lon(22),ngh.lat(22),ngh.name(22),'Color','w')\r\n\r\n%% Summary\r\n% In this post, I focused on word embeddings and sentiment analysis as an\r\n% example of new features available in Text Analytics Toolbox. Hopefully\r\n% you saw that the toolbox makes advanced text processing techniques very\r\n% accessible. You can do more with word embeddings besides sentiment\r\n% analysis, and the toolbox offers many more features besides word\r\n% embeddings, such as\r\n% <https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/lsamodel.html Latent\r\n% Semantic Analysis> or\r\n% <https:\/\/www.mathworks.com\/help\/textanalytics\/ref\/ldamodel.html Latent\r\n% Dirichlet Allocation>.\r\n%\r\n% Hopefully I have more opportunities to discuss those other interesting\r\n% features in Text Analytics Toolbox in the future.\r\n%\r\n% Get a free trial version to play with it and let us know what you think\r\n% <https:\/\/blogs.mathworks.com\/loren\/?p=2422#respond here>!\r\n##### SOURCE END ##### 755624bb94664840921f2c1b9f97b830\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/glove_airbnb2_sbd_06.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>Text data has become an important part of <a href=\"https:\/\/www.mathworks.com\/solutions\/data-analytics.html\">data analytics<\/a>,  thanks to advances in natural language processing that transform unstructured text into meaningful data. The new <a href=\"https:\/\/www.mathworks.com\/products\/text-analytics.html\">Text Analytics Toolbox<\/a> provides tools to process and analyze text data in MATLAB.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2017\/09\/21\/math-with-words-word-embeddings-with-matlab-and-text-analytics-toolbox\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[43,48,72],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/2422"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=2422"}],"version-history":[{"count":6,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/2422\/revisions"}],"predecessor-version":[{"id":4698,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/2422\/revisions\/4698"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=2422"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=2422"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=2422"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}