{"id":1089,"date":"2015-01-29T08:20:54","date_gmt":"2015-01-29T13:20:54","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=1089"},"modified":"2015-01-16T16:21:10","modified_gmt":"2015-01-16T21:21:10","slug":"introduction-to-market-basket-analysis","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2015\/01\/29\/introduction-to-market-basket-analysis\/","title":{"rendered":"Introduction to Market Basket Analysis"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p>You probably heard about the \"beer and diapers\" story as the often quoted example of what data mining can achieve. It goes like this: some supermarket placed beer next to diapers and got more business because they mined their sales data and found that men often bought those two items together.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/shopping.jpg\" alt=\"\"> <\/p><p>Today's guest blogger, Toshi Takeuchi, who works in the web marketing team here at MathWorks, gives you a quick introduction to how such analysis is actually done, and will follow up with how you can scale it for larger dataset with <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/mapreduce.html\"><tt>MapReduce<\/tt><\/a> (new feature in R2014b) in a future post.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#5a43eca4-e5f3-433b-a70f-1b8b98a64a26\">Motivation: \"Introduction to Data Mining\" PDF<\/a><\/li><li><a href=\"#ab5b88d4-d4d6-48d2-be64-20a39b920a07\">Itemset and Support<\/a><\/li><li><a href=\"#2c86034d-fbb0-49f2-bc97-accab62a691f\">Association Rules, Confidence and Lift<\/a><\/li><li><a href=\"#dd75d035-2129-4417-b848-b2725411df56\">Apriori Algorithm<\/a><\/li><li><a href=\"#7a226213-f05b-4be7-9bcb-2bb1bc0195a0\">Rule Generation Algorithm<\/a><\/li><li><a href=\"#b19b52db-d70c-4b33-8503-87085fa3c2d3\">Test Example: Congressional Voting Records<\/a><\/li><li><a href=\"#ba3ed37d-7359-41d2-9875-1e3ffdf3f829\">Visualization of Metrics<\/a><\/li><li><a href=\"#a67720f0-2712-4a6d-a4a3-2d092049b7d3\">Closing<\/a><\/li><li><a href=\"#b703f9d8-ac27-4065-aaca-bda29e260495\">Appendix: Code used for this post<\/a><\/li><\/ul><\/div><h4>Motivation: \"Introduction to Data Mining\" PDF<a name=\"5a43eca4-e5f3-433b-a70f-1b8b98a64a26\"><\/a><\/h4><p>I have been interested in Market Basket Analysis not because I work at a supermarket but because it can be used for <b>web usage pattern mining<\/b> among many applications. Fortunately, I came across a good introduction in <a href=\"http:\/\/www-users.cs.umn.edu\/~kumar\/dmbook\/ch6.pdf\">Chapter 6<\/a> (sample chapter available for free download) of <a href=\"http:\/\/www-users.cs.umn.edu\/~kumar\/dmbook\/index.php\">Introduction to Data Mining<\/a>. I obviously had to give it a try with MATLAB!<\/p><p>Let's start by loading the example dataset used in the textbook. Raw transaction data is turned into a binary matrix representation with 1 meaning the given item was in the given transaction, but otherwise 0.<\/p><pre class=\"codeinput\">transactions = {{<span class=\"string\">'Bread'<\/span>,<span class=\"string\">'Milk'<\/span>};<span class=\"keyword\">...<\/span>\r\n    {<span class=\"string\">'Bread'<\/span>,<span class=\"string\">'Diapers'<\/span>,<span class=\"string\">'Beer'<\/span>,<span class=\"string\">'Eggs'<\/span>};<span class=\"keyword\">...<\/span>\r\n    {<span class=\"string\">'Milk'<\/span>,<span class=\"string\">'Diapers'<\/span>,<span class=\"string\">'Beer'<\/span>,<span class=\"string\">'Cola'<\/span>};<span class=\"keyword\">...<\/span>\r\n    {<span class=\"string\">'Bread'<\/span>,<span class=\"string\">'Milk'<\/span>,<span class=\"string\">'Diapers'<\/span>,<span class=\"string\">'Beer'<\/span>};<span class=\"keyword\">...<\/span>\r\n    {<span class=\"string\">'Bread'<\/span>,<span class=\"string\">'Milk'<\/span>,<span class=\"string\">'Diapers'<\/span>,<span class=\"string\">'Cola'<\/span>}};\r\n\r\nitems = unique([transactions{:}]);\r\nT = zeros(size(transactions,1), length(items));\r\n<span class=\"keyword\">for<\/span> i = 1:size(transactions,1)\r\n    T(i,ismember(items,transactions{i,:})) = 1;\r\n<span class=\"keyword\">end<\/span>\r\ndisp(array2table(T,<span class=\"string\">'VariableNames'<\/span>,items,<span class=\"string\">'RowNames'<\/span>,{<span class=\"string\">'T1'<\/span>,<span class=\"string\">'T2'<\/span>,<span class=\"string\">'T3'<\/span>,<span class=\"string\">'T4'<\/span>,<span class=\"string\">'T5'<\/span>}))\r\n<\/pre><pre class=\"codeoutput\">          Beer    Bread    Cola    Diapers    Eggs    Milk\r\n          ____    _____    ____    _______    ____    ____\r\n    T1    0       1        0       0          0       1   \r\n    T2    1       1        0       1          1       0   \r\n    T3    1       0        1       1          0       1   \r\n    T4    1       1        0       1          0       1   \r\n    T5    0       1        1       1          0       1   \r\n<\/pre><h4>Itemset and Support<a name=\"ab5b88d4-d4d6-48d2-be64-20a39b920a07\"><\/a><\/h4><p>Market Basket Analysis is also known as Association Analysis or Frequent Itemset Mining. Let&#8217;s go over some basic concepts: itemsets, support, confidence, and lift.<\/p><p>The column headers of the table shows all the items in this tiny dataset. A subset of those items in any combination is an <b>itemset<\/b>. An itemset can contain anywhere from zero items to all the items in the dataset.<\/p><pre>0 items  --&gt; null itemset (usually ignored)\r\n1 item   --&gt; 1-itemset\r\n2 items  --&gt; 2-itemset\r\n      :\r\nk items  --&gt; k-itemset<\/pre><p>This is, for example, a 3-itemset:<\/p><pre>{Beer, Diapers, Milk}<\/pre><p>A transaction can contain multiple itemsets as its subsets. For example, an itemset <tt>{Bread, Diapers}<\/tt> is contained in T2, but not <tt>{Bread, Milk}<\/tt>.<\/p><p><b>Support count<\/b> is the count of how often a given itemset appears across all the transactions. Frequency of its appearance is given by a metric called <b>support<\/b>.<\/p><pre>Support = itemset support count \/ number of transactions<\/pre><pre class=\"codeinput\">itemset = {<span class=\"string\">'Beer'<\/span>,<span class=\"string\">'Diapers'<\/span>,<span class=\"string\">'Milk'<\/span>}\r\n<span class=\"comment\">% get col indices of items in itemset<\/span>\r\ncols = ismember(items,itemset);\r\nN = size(T,1); fprintf(<span class=\"string\">'Number of transactions = %d\\n'<\/span>,N)\r\n<span class=\"comment\">% count rows that include all items<\/span>\r\nsupportCount = sum(all(T(:,cols),2));\r\nfprintf(<span class=\"string\">'Support count for this itemset = %d\\n'<\/span>,supportCount)\r\nitemSetSupport = supportCount\/size(T,1);\r\nfprintf(<span class=\"string\">'Support = %.2f\\n'<\/span>,itemSetSupport)\r\n<\/pre><pre class=\"codeoutput\">itemset = \r\n    'Beer'    'Diapers'    'Milk'\r\nNumber of transactions = 5\r\nSupport count for this itemset = 2\r\nSupport = 0.40\r\n<\/pre><h4>Association Rules, Confidence and Lift<a name=\"2c86034d-fbb0-49f2-bc97-accab62a691f\"><\/a><\/h4><p>Association rules are made up of antecedents (ante) and consequents (conseq) and take the following form:<\/p><pre>{ante} =&gt; {conseq}<\/pre><p>A k-itemset where k &gt; 1 can be randomly divided into ante and conseq to form such a rule. Here is an example:<\/p><pre class=\"codeinput\">ante = {<span class=\"string\">'Diapers'<\/span>,<span class=\"string\">'Milk'<\/span>}; <span class=\"comment\">% pick some subset itemset as |ante|<\/span>\r\n<span class=\"comment\">% get items not in |ante|<\/span>\r\nconseq = setdiff(itemset,ante);\r\nfprintf(<span class=\"string\">'Itemset: {%s, %s, %s}\\n'<\/span>, itemset{:})\r\nfprintf(<span class=\"string\">'Ante   : {%s, %s}\\n'<\/span>,ante{:})\r\nfprintf(<span class=\"string\">'Conseq : {%s}\\n'<\/span>,conseq{:})\r\nfprintf(<span class=\"string\">'Rule   : {%s, %s} =&gt; {%s}\\n'<\/span>, ante{:},conseq{:})\r\n<\/pre><pre class=\"codeoutput\">Itemset: {Beer, Diapers, Milk}\r\nAnte   : {Diapers, Milk}\r\nConseq : {Beer}\r\nRule   : {Diapers, Milk} =&gt; {Beer}\r\n<\/pre><p>You can think of this rule as \"when diapers and milk appear in the same transaction, you see beer in the same transaction as well at some frequency\". How do you determine which randomly generated association rules are actually meaningful?<\/p><p>The most basic measure of rules is <b>confidence<\/b>, which tells us how often a given rule applies within the transactions that contain the ante.<\/p><p>A given rule applies when all items from both antes and conseqs are present in a transaction, so it is the same thing as an itemset that contains the same items. We can use the support metric for the itemset to compute confidence of the rule.<\/p><pre>Confidence = itemset support \/ ante support<\/pre><pre class=\"codeinput\">cols = ismember(items,ante);\r\nanteCount = sum(all(T(:,cols),2));\r\nfprintf(<span class=\"string\">'Support Count for Ante = %d\\n'<\/span>,anteCount)\r\nanteSupport = anteCount\/N;\r\nfprintf(<span class=\"string\">'Support for Ante = %.2f\\n'<\/span>,anteSupport)\r\nconfidence = itemSetSupport\/anteSupport;\r\nfprintf(<span class=\"string\">'Confidence = %.2f (= itemset support \/ ante support)\\n'<\/span>,confidence)\r\n<\/pre><pre class=\"codeoutput\">Support Count for Ante = 3\r\nSupport for Ante = 0.60\r\nConfidence = 0.67 (= itemset support \/ ante support)\r\n<\/pre><p>Another measure of rules is <b>lift<\/b>. It compares the probability of ante and conseq happening together independently to the observed frequency of such a combination, as a measure of interestingness. If lift is 1, then the probabilities of ante and conseq occurring together is independent and there is no special relationship. If it is larger than 1, then lift tells us how strongly ante and conseq are dependent to each other. We can use respective support metrics to make this comparison.<\/p><pre>Lift = itemset support \/ (ante support x conseq support)<\/pre><pre class=\"codeinput\">cols = ismember(items,conseq);\r\nconseqCount = sum(all(T(:,cols),2));\r\nfprintf(<span class=\"string\">'Support Count for Conseq = %d\\n'<\/span>,conseqCount)\r\nconseqSupport = conseqCount\/N;\r\nfprintf(<span class=\"string\">'Support for Conseq = %.2f\\n'<\/span>,conseqSupport)\r\nlift = itemSetSupport\/(anteSupport*conseqSupport);\r\nfprintf(<span class=\"string\">'Lift = %.2f (= itemset support \/ (ante support x conseq support))\\n'<\/span>,lift)\r\n<\/pre><pre class=\"codeoutput\">Support Count for Conseq = 3\r\nSupport for Conseq = 0.60\r\nLift = 1.11 (= itemset support \/ (ante support x conseq support))\r\n<\/pre><h4>Apriori Algorithm<a name=\"dd75d035-2129-4417-b848-b2725411df56\"><\/a><\/h4><p>Now that we know the basic concepts, we can define the goal of our analysis: finding association rules with a sufficient level of support (happens often enough) and confidence (association is strong). Lift can be a secondary measure to score the interestingness of the rules found.<\/p><div><ol><li>Generate frequent itemsets that clear the minimum support threshold recursively from 1-itemsets to higher level itemsets, pruning candidates along the way<\/li><li>Generate rules that clear the minimum confidence threshold in a similar way<\/li><\/ol><\/div><p>In a brute force method, you would calculate the support and confidence of all possible itemset combinations, but that would be computationally expensive, because the number of candidates grows exponentially.<\/p><p>The apriori algorithm addresses this issue by generating candidates selectively. To get an intuition, think about the frequency of an itemset that contains some infrequent items. That itemset will never be more frequent than the least frequent item it contains. So if you construct your candidates by combining the frequent itemsets only, starting from 1-itemset and continue to higher levels, then you avoid creating useless candidates.<\/p><p>Let's start with generating frequent itemsets and get their support measures. I wrote a couple of MATLAB functions to implement this algorithm.<\/p><pre class=\"codeinput\">minSup = 0.6; <span class=\"comment\">% minimum support threshold 0.6<\/span>\r\n[F,S] = findFreqItemsets(transactions,minSup);\r\nfprintf(<span class=\"string\">'Minimum Support        : %.2f\\n'<\/span>, minSup)\r\nfprintf(<span class=\"string\">'Frequent Itemsets Found: %d\\n'<\/span>, sum(arrayfun(@(x) length(x.freqSets), F)))\r\nfprintf(<span class=\"string\">'Max Level Reached      : %d-itemsets\\n'<\/span>, length(F))\r\nfprintf(<span class=\"string\">'Number of Support Data : %d\\n'<\/span>, length(S))\r\n<\/pre><pre class=\"codeoutput\">Minimum Support        : 0.60\r\nFrequent Itemsets Found: 8\r\nMax Level Reached      : 2-itemsets\r\nNumber of Support Data : 13\r\n<\/pre><p>When we computed support for each itemset we evaluated, we stored the result in a <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/overview-of-the-map-data-structure.html\">Map object<\/a> <tt>S<\/tt>. This is used for rule generation in order to avoid recalculating support as part of the confidence computation. You can now retrieve support for a given itemset with this object, by supplying the string representation of the itemset as the key. Let's try [2,4,6]:<\/p><pre class=\"codeinput\">itemset = [2,4,6];\r\nfprintf(<span class=\"string\">'Support for the itemset {%s %s %s}: %.2f\\n'<\/span>,items{itemset(:)},S(num2str(itemset)))\r\n<\/pre><pre class=\"codeoutput\">Support for the itemset {Bread Diapers Milk}: 0.40\r\n<\/pre><p>This itemset clearly didn't meet the minimum support criteria of 0.60.<\/p><h4>Rule Generation Algorithm<a name=\"7a226213-f05b-4be7-9bcb-2bb1bc0195a0\"><\/a><\/h4><p>We saw earlier that you can generate rule candidates from frequent itemsets by splitting their contents into antes and conseqs, and computing their confidence.<\/p><p>If we generate every possible candidates by such brute force method, it will be very time consuming. Apriori algorithm is also used to generate rules selectively. Let's say that this rule has low confidence.<\/p><pre>{Beer, Diapers} =&gt; {Milk}<\/pre><p>Then any other rules generated from this itemset that contain {Milk} in rule consequent will have low confidence.<\/p><pre>{Beer} =&gt; {Diapers, Milk} {Diapers} =&gt; {Beer, Milk}<\/pre><p>Why? because support for those antes will be always greater than the initial ante {Beer, Diapers}, while the support for the itemset (hence also for the rule) remains the same, and confidence is based on the ratio of support between the rule and the ante.<\/p><p>We can take advantage of this intuition by first generating rules with only one item in conseq and drop those that don't meet the minimum criteria, and then merge those conseqs to generate rules with two items in conseq, and so forth.<\/p><p>Now we can generate association rules from the frequent itemsets we generated in the previous step.<\/p><pre class=\"codeinput\">minConf = 0.8; <span class=\"comment\">% minimum confidence threshold 0.8<\/span>\r\nrules = generateRules(F,S,minConf);\r\nfprintf(<span class=\"string\">'Minimum Confidence     : %.2f\\n'<\/span>, minConf)\r\nfprintf(<span class=\"string\">'Rules Found            : %d\\n\\n'<\/span>, length(rules))\r\n\r\n<span class=\"keyword\">for<\/span> i = 1:length(rules)\r\n    disp([sprintf(<span class=\"string\">'{%s}'<\/span>,items{rules(i).Ante}),<span class=\"string\">' =&gt; '<\/span>,<span class=\"keyword\">...<\/span>\r\n        sprintf(<span class=\"string\">'{%s}'<\/span>, items{rules(i).Conseq}),<span class=\"keyword\">...<\/span>\r\n        sprintf(<span class=\"string\">'     Conf: %.2f '<\/span>,rules(i).Conf),<span class=\"keyword\">...<\/span>\r\n        sprintf(<span class=\"string\">'Lift: %.2f '<\/span>,rules(i).Lift),<span class=\"keyword\">...<\/span>\r\n        sprintf(<span class=\"string\">'Sup: %.2f'<\/span>,rules(i).Sup)])\r\n<span class=\"keyword\">end<\/span>\r\n<\/pre><pre class=\"codeoutput\">Minimum Confidence     : 0.80\r\nRules Found            : 1\r\n\r\n{Beer} =&gt; {Diapers}     Conf: 1.00 Lift: 1.25 Sup: 0.60\r\n<\/pre><p>With a minimum support of 0.6 and a minimum confidence 0.8, we found only one rule that clears those thresholds: {Beer} =&gt; {Diapers}. The confidence is 1.00, which means we see diapers in all transactions that include beer, and the lift is 1.25, so this rule is fairly interesting.<\/p><h4>Test Example: Congressional Voting Records<a name=\"b19b52db-d70c-4b33-8503-87085fa3c2d3\"><\/a><\/h4><p>As a non-supermarket use case of Market Basket Analysis, the textbook uses the <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Congressional+Voting+Records\">1984 Congressional Voting Records dataset<\/a> from the UCI Machine Learning Repository. We can test our code against the result in the textbook.<\/p><p>For the purpose of Market Basket Analysis, you can think of individual members of congress as transactions and bills they voted for and their party affiliation as items.<\/p><pre class=\"codeinput\">getData\r\n<\/pre><pre class=\"codeoutput\">       party        handicapped_infants    water_project_cost_sharing    adoption_of_the_budget_resolution\r\n    ____________    ___________________    __________________________    _________________________________\r\n    'republican'    'n'                    'y'                           'n'                              \r\n    'republican'    'n'                    'y'                           'n'                              \r\n    'democrat'      '?'                    'y'                           'y'                              \r\n    'democrat'      'n'                    'y'                           'y'                              \r\n    'democrat'      'y'                    'y'                           'y'                              \r\n<\/pre><p>Now we generate frequent itemsets with a minimum support threshold of 0.3, but we need to use a slightly lower threshold because MATLAB uses double precision numbers.<\/p><pre class=\"codeinput\">minSup = 0.2988;\r\n[F,S] = findFreqItemsets(votes,minSup);\r\nfprintf(<span class=\"string\">'Minimum Support        : %.2f\\n'<\/span>, minSup)\r\nfprintf(<span class=\"string\">'Frequent Itemsets Found: %d\\n'<\/span>, sum(arrayfun(@(x) length(x.freqSets), F)))\r\nfprintf(<span class=\"string\">'Max Level Reached      : %d-itemsets\\n'<\/span>, length(F))\r\nfprintf(<span class=\"string\">'Number of Support Data : %d\\n'<\/span>, length(S))\r\n<\/pre><pre class=\"codeoutput\">Minimum Support        : 0.30\r\nFrequent Itemsets Found: 1026\r\nMax Level Reached      : 7-itemsets\r\nNumber of Support Data : 2530\r\n<\/pre><p>Then we generate some rules.<\/p><pre class=\"codeinput\">minConf = 0.9; <span class=\"comment\">% minimum confidence threshold 0.9<\/span>\r\nrules = generateRules(F,S,minConf);\r\nfprintf(<span class=\"string\">'Minimum Confidence     : %.2f\\n'<\/span>, minConf)\r\nfprintf(<span class=\"string\">'Rules Found            : %d\\n\\n'<\/span>, length(rules))\r\n<\/pre><pre class=\"codeoutput\">Minimum Confidence     : 0.90\r\nRules Found            : 2942\r\n\r\n<\/pre><p>Finally, let's compare our results with the results from the textbook.<\/p><pre class=\"codeinput\">testAntes = [7,21,27;5,11,23;6,15,16;22,31,32];\r\ntestConseqs = [2,1,2,1];\r\ntestConf = [0.91,0.975,0.935,1];\r\n\r\n<span class=\"comment\">% Get the rules with 1-item conseq with party classification<\/span>\r\ndem = rules(arrayfun(@(x) isequal(x.Conseq,1),rules));\r\nrep = rules(arrayfun(@(x) isequal(x.Conseq,2),rules));\r\n\r\n<span class=\"comment\">% Compare the results<\/span>\r\n<span class=\"keyword\">for<\/span> i = 1:size(testAntes,1)\r\n    rec = dem(arrayfun(@(x) isequal(x.Ante,testAntes(i,:)),dem));\r\n    rec = [rec,rep(arrayfun(@(x) isequal(x.Ante,testAntes(i,:)),rep))];\r\n    disp([<span class=\"string\">'{'<\/span>, sprintf(<span class=\"string\">'%s, %s, %s'<\/span>,vars{rec.Ante}),<span class=\"string\">'} =&gt; '<\/span>,<span class=\"keyword\">...<\/span>\r\n        sprintf(<span class=\"string\">'{%s}'<\/span>, vars{rec.Conseq})])\r\n    fprintf(<span class=\"string\">'   Conf: %.2f Lift: %.2f Sup: %.2f\\n'<\/span>,rec.Conf,rec.Lift,rec.Sup)\r\n    <span class=\"keyword\">if<\/span> isequal(rec.Conseq,testConseqs(i));\r\n        fprintf(<span class=\"string\">'   Correct!  Expected Conf %.2f\\n\\n'<\/span>,testConf(i))\r\n    <span class=\"keyword\">end<\/span>\r\n<span class=\"keyword\">end<\/span>\r\n<\/pre><pre class=\"codeoutput\">{El Salvador = Yes, Budget Resolution = No, Mx Missile = No} =&gt; {Republican}\r\n   Conf: 0.91 Lift: 2.36 Sup: 0.30\r\n   Correct!  Expected Conf 0.91\r\n\r\n{Budget Resolution = Yes, Mx Missile = Yes, El Salvador = No} =&gt; {Democrat}\r\n   Conf: 0.97 Lift: 1.59 Sup: 0.36\r\n   Correct!  Expected Conf 0.97\r\n\r\n{Physician Fee Freeze = Yes, Right To Sue = Yes, Crime = Yes} =&gt; {Republican}\r\n   Conf: 0.94 Lift: 2.42 Sup: 0.30\r\n   Correct!  Expected Conf 0.94\r\n\r\n{Physician Fee Freeze = No, Right To Sue = No, Crime = No} =&gt; {Democrat}\r\n   Conf: 1.00 Lift: 1.63 Sup: 0.31\r\n   Correct!  Expected Conf 1.00\r\n\r\n<\/pre><h4>Visualization of Metrics<a name=\"ba3ed37d-7359-41d2-9875-1e3ffdf3f829\"><\/a><\/h4><p>MATLAB has strong data visualization capabilities. Let's visualize <tt>conf<\/tt>, <tt>lift<\/tt> and <tt>support<\/tt>.<\/p><pre class=\"codeinput\">conf = arrayfun(@(x) x.Conf, rules); <span class=\"comment\">% get conf as a vector<\/span>\r\nlift = arrayfun(@(x) x.Lift, rules); <span class=\"comment\">% get lift as a vector<\/span>\r\nsup = arrayfun(@(x) x.Sup, rules); <span class=\"comment\">% get support as a vector<\/span>\r\ncolormap <span class=\"string\">cool<\/span>\r\nscatter(sup,conf,lift*30, lift, <span class=\"string\">'filled'<\/span>)\r\nxlabel(<span class=\"string\">'Support'<\/span>); ylabel(<span class=\"string\">'Confidence'<\/span>)\r\nt = colorbar(<span class=\"string\">'peer'<\/span>,gca);\r\nset(get(t,<span class=\"string\">'ylabel'<\/span>),<span class=\"string\">'String'<\/span>, <span class=\"string\">'Lift'<\/span>);\r\ntitle(<span class=\"string\">'1984 Congressional Voting Records'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/MarketBasketAnalysis1_01.png\" alt=\"\"> <p>You see that rules we generated match with those in the textbook. The plot of support, confidence and lift is useful to identify rules that are high support, high confidence (upper right region of the plot) and high lift (more magenta). If you type <tt>gname<\/tt> into MATLAB prompt, you can interactively identify the indices of those points, and hit Enter to end the interactive session.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/gname.png\" alt=\"\"> <\/p><p>What was Rule 51? Let's find out.<\/p><pre class=\"codeinput\">disp(<span class=\"string\">'Rule ID = 51'<\/span>)\r\nfprintf(<span class=\"string\">'{%s, %s} =&gt; {%s}, Conf: %.2f\\n'<\/span>,<span class=\"keyword\">...<\/span>\r\n    vars{rules(51).Ante(1)},vars{rules(51).Ante(2)},<span class=\"keyword\">...<\/span>\r\n    vars{rules(51).Conseq},rules(51).Conf)\r\n<\/pre><pre class=\"codeoutput\">Rule ID = 51\r\n{Budget Resolution = Yes, Physician Fee Freeze = No} =&gt; {Democrat}, Conf: 1.00\r\n<\/pre><h4>Closing<a name=\"a67720f0-2712-4a6d-a4a3-2d092049b7d3\"><\/a><\/h4><p>We used a small sample dataset in order to learn the basics of Market Basket Analysis. However, for real applications such as web usage pattern mining, we will typically be dealing with quite a large dataset. In a later post, I would like to figure out how to extend what we learned here with the help of MapReduce, introduced in R2014b.  Any thoughts about your creative use of \"Market Basket Analysis\" with your data? Let us know <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=1089#respond\">here<\/a>.<\/p><h4>Appendix: Code used for this post<a name=\"b703f9d8-ac27-4065-aaca-bda29e260495\"><\/a><\/h4><p>Here is the code used for this post.<\/p><p>findFreqItemsets() generates frequent itemsets using the Apriori method.<\/p><pre class=\"codeinput\">type <span class=\"string\">findFreqItemsets.m<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction [F,S,items] = findFreqItemsets(transactions,minSup,oneItemsets)\r\n%FINDFREQITEMSETS generates frequent itemsets using Apriori method\r\n%   |transactions| is a nested cell array of transaction data or a table\r\n%   with |Value| column that contains such cell array. Each row contains a\r\n%   nested cell array of items in a single transaction. If supplied as a\r\n%   table, it also needs |Key| column with a cell array of transaction ids.\r\n%\r\n%   |minSup| is a scalar that represents the minimum support threshold.\r\n%   Itemsets that meet this criteria are 'frequent' itemsets.\r\n%\r\n%   |oneItemSets| is an optional table that list all single items that\r\n%   appear in |transactions| along with their frequencies. Items are listed\r\n%   in |Key| column and corresponding counts in |Value| column.\r\n%\r\n%   |items| is a cell array of unique items. \r\n% \r\n%   |F| is a structure array of frequent itemsets that meet that\r\n%   criteria. Items are represented as indices of cell arrat |items|.\r\n%\r\n%   |S| is a Map object that maps itemsets to their support values. Items\r\n%   are represented as indices of cell arrat |items|.\r\n%\r\n%   To learn more about the underlying alogrithm itself, please consult   \r\n%   with Ch6 of http:\/\/www-users.cs.umn.edu\/~kumar\/dmbook\/index.php \r\n\r\n    % check the number of input arguments\r\n    narginchk(2, 3)\r\n    if iscell(transactions)\r\n        transactions = table(num2cell(1:length(transactions))',transactions,'VariableNames',{'Key','Value'});\r\n    end\r\n    \r\n    if nargin == 2\r\n        items = transactions.Value;\r\n        [uniqItems,~,idx] = unique([items{:}]');\r\n        oneItemsets = table(uniqItems,num2cell(accumarray(idx,1)),'VariableNames',{'Key','Value'});\r\n    end\r\n    \r\n    % get the total number of transactions\r\n    N = height(transactions);\r\n    % sort the tables\r\n    transactions = sortrows(transactions,'Key');\r\n    oneItemsets = sortrows(oneItemsets,'Key');\r\n    % get all frequent 1-itemsets\r\n    [F,S,items] = getFreqOneItemsets(oneItemsets,N,minSup);\r\n    if isempty(F.freqSets)\r\n        fprintf('No frequent itemset found at minSup = %.2f\\n',minSup)\r\n        return\r\n    end\r\n    \r\n    % get all frequent k-itemsets where k &gt;= 2\r\n    k = 2;\r\n    while true\r\n        % generate candidate itemsets\r\n        Ck = aprioriGen(F(k-1).freqSets, k);\r\n        % prune candidates below minimum support threshold\r\n        [Fk, support] = pruneCandidates(transactions,Ck,N,items,minSup);\r\n\r\n        % update support data; if empty, exit the loop\r\n        if ~isempty(support)\r\n            % create a map object to store suppor data\r\n            mapS = containers.Map();\r\n            % convert vectors to chars for use as keys\r\n            for i = 1:length(support)\r\n                mapS(num2str(Ck(i,:))) = support(i);\r\n            end\r\n            % update S\r\n            S = [S; mapS];\r\n        else\r\n            break;\r\n        end\r\n        % store the frequent itemsets above minSup\r\n        % if empty, exit the loop\r\n        if ~isempty(Fk)\r\n            F(k).freqSets = Fk;\r\n            k = k + 1;\r\n        else\r\n            break;\r\n        end\r\n    end\r\n    \r\n\r\n    function [F1,S,C1]= getFreqOneItemsets(T,N,minSup)\r\n    % GETFREQ1ITEMSETS geneates all frequent 1-itemsets\r\n    %   1-items are generated from 1-itemset table |T| and pruned with the\r\n    %   minimum support threshold |minSup|.\r\n    \r\n        % get 1-itemset candidates\r\n        C1 = T.Key;\r\n        % get their count\r\n        count = cell2mat(T.Value);\r\n        % calculate support for all candidates\r\n        sup = count .\/N;\r\n        % create a map object and store support data\r\n        S = containers.Map();\r\n        for j = 1:length(C1)\r\n            S(num2str(j)) = sup(j);\r\n        end\r\n        % prune candidates by minSup\r\n        freqSet = find(sup &gt;= minSup);\r\n        % store result in a structure array\r\n        F1 = struct('freqSets',freqSet);\r\n    end\r\n\r\n    function [Fk,support] = pruneCandidates(T,Ck,N,items,minSup)\r\n    %PRUNECANDIDATES returns frequent k-itemsets \r\n    %   Compute support for each candidndate in |Ck| by scanning\r\n    %   transactions table |T| to identify itemsets that clear |minSup|\r\n    %   threshold\r\n\r\n        % calculate support count for all candidates\r\n        support = zeros(size(Ck,1),1);\r\n        % for each transaction\r\n        for l = 1:N\r\n            % get the item idices\r\n            t = find(ismember(items, T.Value{l}));\r\n            % increment the support count\r\n            support(all(ismember(Ck,t),2)) = support(all(ismember(Ck,t),2)) + 1;\r\n        end\r\n        % calculate support\r\n        support = support.\/N;\r\n        \r\n        % return the candidates that meet the criteria\r\n        Fk = Ck(support &gt;= minSup,:);\r\n    end\r\n\r\nend\r\n\r\n<\/pre><p>aprioriGen() is a helper function to findFreqItemsets() and generates candidate k-itemsets using the Apriori algorithm.<\/p><pre class=\"codeinput\">type <span class=\"string\">aprioriGen.m<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction Ck = aprioriGen(freqSets, k)\r\n% APRIORIGEN generates candidate k-itemsets using Apriori algorithm\r\n%   This function implements F_k-1 x F_k-1 method, which merges two pairs\r\n%   (k-1)-itemsets to generate new k-itemsets if the first (k-2) items are\r\n%   identical between the pair.\r\n%\r\n%   To learn more about the underlying alogrithm itself, please consult   \r\n%   with Ch6 of http:\/\/www-users.cs.umn.edu\/~kumar\/dmbook\/index.php \r\n\r\n    % generate candidate 2-itemsets\r\n    if k == 2\r\n        Ck = combnk(freqSets,2);\r\n    else\r\n        % generate candidate k-itemsets (k &gt; 2)     \r\n        Ck = [];\r\n        numSets = size(freqSets,1);\r\n        % generate two pairs of frequent itemsets to merge\r\n        for i = 1:numSets\r\n            for j = i+1:numSets\r\n                % compare the first to k-2 items\r\n                pair1 = sort(freqSets(i,1:k-2));\r\n                pair2 = sort(freqSets(j,1:k-2));\r\n                % if they are the same, merge\r\n                if isequal(pair1,pair2)\r\n                    Ck = [Ck; union(freqSets(i,:),freqSets(j,:))];\r\n                end\r\n            end\r\n        end\r\n    end\r\nend\r\n\r\n<\/pre><p>generateRules() returns the association rules that meets the minium confidence criteria. It also uses aprioriGen() to generate rule candidates.<\/p><pre class=\"codeinput\">type <span class=\"string\">generateRules.m<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction rules = generateRules(F,S,minConf)\r\n%GENERATERULES returns the association rules found from the frequent\r\n%itemsets based on the minimum confidence threshold |minConf|. \r\n%Association rules are expressed as {ante} =&gt; {conseq}.\r\n%   |F| is a structure array of frequent itemsets\r\n%   |S| is a Map object that maps itemsets to their support values\r\n%   |rules| is a structure array of association rules that meet |minConf|\r\n%   criteria, along with confidence, lift and support metrics. \r\n%\r\n%   To learn more about the underlying alogrithm itself, please consult   \r\n%   with Ch6 of http:\/\/www-users.cs.umn.edu\/~kumar\/dmbook\/index.php \r\n\r\n    rules = struct('Ante',{},'Conseq',{},'Conf',{},'Lift',{},'Sup',{});\r\n    % iterate over itemset levels |k| where k &gt;= 2\r\n    for k = 2:length(F)\r\n        % iterate over k-itemsets \r\n        for n = 2:size(F(k).freqSets,1)\r\n            % get one k-itemset\r\n            freqSet = F(k).freqSets(n,:);\r\n            % get 1-item consequents from the k-itemset\r\n            H1 = freqSet';\r\n            % if the itemset contains more than 3 items\r\n            if k &gt; 2\r\n                % go to ap_genrules()\r\n                rules = ap_genrules(freqSet,H1,S,rules,minConf);\r\n            else\r\n                % go to pruneRules()\r\n                [~,rules] = pruneRules(freqSet,H1,S,rules,minConf);\r\n            end\r\n        end\r\n    end\r\n    \r\n    function rules = ap_genrules(Fk,H,S,rules,minConf)\r\n    %AP_GENRULES generate candidate rules from rule consequent\r\n    %   |Fk| is a row vector representing a frequent itemset\r\n    %   |H| is a matrix that contains a rule consequents per row\r\n    %   |S| is a Map object that stores support values\r\n    %   |rules| is a structure array that stores the rules\r\n    %   |minConf| is the threshold to prune the rules\r\n    \r\n        m = size(H,2); % size of rule consequent\r\n        % if frequent itemset is longer than consequent by more than 1\r\n        if length(Fk) &gt; m+1\r\n            % prune 1-item consequents by |minConf|\r\n            if m == 1\r\n                [~,rules] = pruneRules(Fk,H,S,rules,minConf);\r\n            end\r\n            % use aprioriGen to generate longer consequents\r\n            Hm1 = aprioriGen(H,m+1);\r\n            % prune consequents by |minConf|\r\n            [Hm1,rules] = pruneRules(Fk,Hm1,S,rules,minConf);\r\n            % if we have consequents that meet the criteria, recurse until\r\n            % we run out of such candidates\r\n            if ~isempty(Hm1)\r\n                rules = ap_genrules(Fk,Hm1,S,rules,minConf);\r\n            end\r\n        end\r\n    end\r\n\r\n    function [prunedH,rules] = pruneRules(Fk,H,S,rules,minConf)\r\n    %PRUNERULES calculates confidence of given rules and drops rule\r\n    %candiates that don't meet the |minConf| threshold\r\n    %   |Fk| is a row vector representing a frequent itemset\r\n    %   H| is a matrix that contains a rule consequents per row\r\n    %   |S| is a Map object that stores support values\r\n    %   |rules| is a structure array that stores the rules\r\n    %   |minConf| is the threshold to prune the rules\r\n    %   |prunedH| is a matrix of consequents that met |minConf|\r\n        \r\n        % initialize a return variable\r\n        prunedH = [];\r\n        % iterate over the rows of H\r\n        for i = 1:size(H,1);\r\n            % a row in H is a conseq\r\n            conseq = H(i,:);\r\n            % ante = Fk - conseq \r\n            ante = setdiff(Fk, conseq);\r\n            % retrieve support for Fk\r\n            supFk =S(num2str(Fk));\r\n            % retrieve support for ante\r\n            supAnte =S(num2str(ante));\r\n            % retrieve support for conseq\r\n            supConseq =S(num2str(conseq));\r\n            % calculate confidence\r\n            conf = supFk \/ supAnte;\r\n            % calculate lift\r\n            lift = supFk\/(supAnte*supConseq);\r\n            \r\n            % if the threshold is met\r\n            if conf &gt;= minConf\r\n                % append the conseq to prunedH\r\n                prunedH = [prunedH, conseq];\r\n                % generate a rule\r\n                rule = struct('Ante',ante,'Conseq',conseq,...\r\n                    'Conf',conf,'Lift',lift,'Sup',supFk);\r\n                % append the rule\r\n                if isempty(rules)\r\n                    rules = rule;\r\n                else\r\n                    rules = [rules, rule];\r\n                end             \r\n            end\r\n        end\r\n    end\r\n\r\nend\r\n\r\n<\/pre><p>getData is a script that fetches the congressional voting data using the RESTful API access function webread() that was introduced in R2014b, and preprocesses the data for Market Basket Analysis.<\/p><pre class=\"codeinput\">type <span class=\"string\">getData.m<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\n% Let's load the dataset\r\n\r\nclearvars; close all; clc\r\n\r\n% the URL of the source data\r\nurl = 'https:\/\/archive.ics.uci.edu\/ml\/machine-learning-databases\/voting-records\/house-votes-84.data';\r\n\r\n% define the function handle to process the data you read from the web\r\nmyreadtable = @(filename)readtable(filename,'FileType','text','Delimiter',',','ReadVariableNames',false);\r\n% set up the function handle as |ContentReader| for |webread| function\r\noptions = weboptions('ContentReader',myreadtable);\r\n% get the data\r\nvotes = webread(url,options);\r\n% get the variable names\r\nvotes.Properties.VariableNames = {'party',...\r\n    'handicapped_infants',...\r\n    'water_project_cost_sharing',...\r\n    'adoption_of_the_budget_resolution',...\r\n    'physician_fee_freeze',...\r\n    'el_salvador_aid',...\r\n    'religious_groups_in_schools',...\r\n    'anti_satellite_test_ban',...\r\n    'aid_to_nicaraguan_contras',...\r\n    'mx_missile',...\r\n    'immigration',...\r\n    'synfuels_corporation_cutback',...\r\n    'education_spending',...\r\n    'superfund_right_to_sue',...\r\n    'crime',...\r\n    'duty_free_exports',...\r\n    'export_administration_act_south_africa'};\r\n\r\n% display the small portion of the data\r\ndisp(votes(1:5,1:4))\r\n\r\n% turn table into matrix\r\nC = table2array(votes);\r\n% create a logical array of yes votes\r\narr = strcmp(C(:,2:end), 'y');\r\n% append logical array of no votes\r\narr = [arr, strcmp(C(:,2:end), 'n')];\r\n% create a logical arrays of party affiliation\r\ndem = strcmp(C(:,1),'democrat');\r\nrep = strcmp(C(:,1),'republican');\r\n% combine them all\r\narr = [dem,rep,arr];\r\n\r\n% create cell to hold the votes of each member of congress\r\nvotes = cell(size(dem));\r\n% for each member, find the indices of the votes\r\nfor i = 1:length(dem)\r\n    votes{i} = find(arr(i,:));\r\nend\r\n\r\n% variable names that maps to the indices\r\nvars = {'Democrat',...\r\n    'Republican',...\r\n    'Handicapped Infants = Yes',...\r\n    'Water Project = Yes',...\r\n    'Budget Resolution = Yes',...\r\n    'Physician Fee Freeze = Yes',...\r\n    'El Salvador = Yes',...\r\n    'Religious Groups = Yes',...\r\n    'Anti Satellite = Yes',...\r\n    'Nicaraguan Contras = Yes',...\r\n    'Mx Missile = Yes',...\r\n    'Immigration = Yes',...\r\n    'Synfuels Corp = Yes',...\r\n    'Education Spending = Yes',...\r\n    'Right To Sue = Yes',...\r\n    'Crime = Yes',...\r\n    'Duty Free = Yes',...\r\n    'South Africa = Yes',...\r\n    'Handicapped Infants = No'...,\r\n    'Water Project = No',...\r\n    'Budget Resolution = No',...\r\n    'Physician Fee Freeze = No',...\r\n    'El Salvador = No',...\r\n    'Religious Groups = No',...\r\n    'Anti Satellite = No',...\r\n    'Nicaraguan Contras = No',...\r\n    'Mx Missile = No',...\r\n    'Immigration = No',...\r\n    'Synfuels Corp = No',...\r\n    'Education Spending = No',...\r\n    'Right To Sue = No',...\r\n    'Crime = No',...\r\n    'Duty Free = No',...\r\n    'South Africa = No'};\r\n<\/pre><script language=\"JavaScript\"> <!-- \r\n    function grabCode_0396ce612a234f948ca47e69f2db3a54() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='0396ce612a234f948ca47e69f2db3a54 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 0396ce612a234f948ca47e69f2db3a54';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2015 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_0396ce612a234f948ca47e69f2db3a54()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2014b<br><\/p><\/div><!--\r\n0396ce612a234f948ca47e69f2db3a54 ##### SOURCE BEGIN #####\r\n%% Introduction to Market Basket Analysis\r\n% You probably heard about the \"beer and diapers\" story as the often\r\n% quoted example of what data mining can achieve. It goes like this: some\r\n% supermarket placed beer next to diapers and got more business because\r\n% they mined their sales data and found that men often bought those two\r\n% items together.\r\n%\r\n% <<shopping.jpg>>\r\n% \r\n% Today's guest blogger, Toshi Takeuchi, who works in the web marketing\r\n% team here at MathWorks, gives you a quick introduction to how such\r\n% analysis is actually done, and will follow up with how you can scale it\r\n% for larger dataset with\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/mapreduce.html |MapReduce|> (new\r\n% feature in R2014b) in a future post.\r\n%\r\n%% Motivation: \"Introduction to Data Mining\" PDF\r\n% I have been interested in Market Basket Analysis not because I work at a\r\n% supermarket but because it can be used for *web usage pattern mining*\r\n% among many applications. Fortunately, I came across a good introduction\r\n% in <http:\/\/www-users.cs.umn.edu\/~kumar\/dmbook\/ch6.pdf Chapter 6> (sample\r\n% chapter available for free download) of\r\n% <http:\/\/www-users.cs.umn.edu\/~kumar\/dmbook\/index.php Introduction to Data\r\n% Mining>. I obviously had to give it a try with MATLAB!\r\n% \r\n% Let's start by loading the example dataset used in the textbook. Raw\r\n% transaction data is turned into a binary matrix representation with 1\r\n% meaning the given item was in the given transaction, but otherwise 0.\r\n\r\ntransactions = {{'Bread','Milk'};...\r\n    {'Bread','Diapers','Beer','Eggs'};...\r\n    {'Milk','Diapers','Beer','Cola'};...\r\n    {'Bread','Milk','Diapers','Beer'};...\r\n    {'Bread','Milk','Diapers','Cola'}};\r\n\r\nitems = unique([transactions{:}]);\r\nT = zeros(size(transactions,1), length(items));\r\nfor i = 1:size(transactions,1)\r\n    T(i,ismember(items,transactions{i,:})) = 1;\r\nend\r\ndisp(array2table(T,'VariableNames',items,'RowNames',{'T1','T2','T3','T4','T5'}))\r\n\r\n%% Itemset and Support\r\n% Market Basket Analysis is also known as Association Analysis or Frequent\r\n% Itemset Mining. Let\u00e2\u20ac\u2122s go over some basic concepts: itemsets, support,\r\n% confidence, and lift.\r\n%\r\n% The column headers of the table shows all the items in this tiny dataset.\r\n% A subset of those items in any combination is an *itemset*. An itemset\r\n% can contain anywhere from zero items to all the items in the dataset.\r\n%\r\n%\r\n%  0 items  REPLACE_WITH_DASH_DASH> null itemset (usually ignored) \r\n%  1 item   REPLACE_WITH_DASH_DASH> 1-itemset \r\n%  2 items  REPLACE_WITH_DASH_DASH> 2-itemset\r\n%        :\r\n%  k items  REPLACE_WITH_DASH_DASH> k-itemset\r\n% \r\n% This is, for example, a 3-itemset:\r\n%\r\n%  {Beer, Diapers, Milk}\r\n%\r\n% A transaction can contain multiple itemsets as its subsets. For example,\r\n% an itemset |{Bread, Diapers}| is contained in T2, but not |{Bread,\r\n% Milk}|.\r\n%\r\n% *Support count* is the count of how often a given itemset appears across\r\n% all the transactions. Frequency of its appearance is given by a metric\r\n% called *support*.\r\n%\r\n%  Support = itemset support count \/ number of transactions\r\n%\r\n\r\nitemset = {'Beer','Diapers','Milk'}\r\n% get col indices of items in itemset\r\ncols = ismember(items,itemset); \r\nN = size(T,1); fprintf('Number of transactions = %d\\n',N)\r\n% count rows that include all items\r\nsupportCount = sum(all(T(:,cols),2)); \r\nfprintf('Support count for this itemset = %d\\n',supportCount)\r\nitemSetSupport = supportCount\/size(T,1);\r\nfprintf('Support = %.2f\\n',itemSetSupport)\r\n\r\n%% Association Rules, Confidence and Lift\r\n% Association rules are made up of antecedents (ante) and consequents\r\n% (conseq) and take the following form:\r\n%\r\n%  {ante} => {conseq}\r\n%\r\n% A k-itemset where k > 1 can be randomly divided into ante and conseq to\r\n% form such a rule. Here is an example:\r\n\r\nante = {'Diapers','Milk'}; % pick some subset itemset as |ante|\r\n% get items not in |ante|\r\nconseq = setdiff(itemset,ante); \r\nfprintf('Itemset: {%s, %s, %s}\\n', itemset{:})\r\nfprintf('Ante   : {%s, %s}\\n',ante{:})\r\nfprintf('Conseq : {%s}\\n',conseq{:})\r\nfprintf('Rule   : {%s, %s} => {%s}\\n', ante{:},conseq{:})\r\n\r\n%%\r\n% You can think of this rule as \"when diapers and milk appear in the same\r\n% transaction, you see beer in the same transaction as well at some\r\n% frequency\". How do you determine which randomly generated association\r\n% rules are actually meaningful?\r\n%\r\n% The most basic measure of rules is *confidence*, which tells us how often\r\n% a given rule applies within the transactions that contain the ante.\r\n% \r\n% A given rule applies when all items from both antes and conseqs are\r\n% present in a transaction, so it is the same thing as an itemset that\r\n% contains the same items. We can use the support metric for the itemset to\r\n% compute confidence of the rule.\r\n%\r\n%  Confidence = itemset support \/ ante support\r\n%\r\n\r\ncols = ismember(items,ante);\r\nanteCount = sum(all(T(:,cols),2));\r\nfprintf('Support Count for Ante = %d\\n',anteCount)\r\nanteSupport = anteCount\/N;\r\nfprintf('Support for Ante = %.2f\\n',anteSupport)\r\nconfidence = itemSetSupport\/anteSupport;\r\nfprintf('Confidence = %.2f (= itemset support \/ ante support)\\n',confidence)\r\n\r\n%%\r\n% Another measure of rules is *lift*. It compares the probability of ante\r\n% and conseq happening together independently to the observed frequency of\r\n% such a combination, as a measure of interestingness. If lift is 1, then\r\n% the probabilities of ante and conseq occurring together is independent\r\n% and there is no special relationship. If it is larger than 1, then lift\r\n% tells us how strongly ante and conseq are dependent to each other. We can\r\n% use respective support metrics to make this comparison.\r\n%\r\n%  Lift = itemset support \/ (ante support x conseq support)\r\n%\r\n\r\ncols = ismember(items,conseq);\r\nconseqCount = sum(all(T(:,cols),2));\r\nfprintf('Support Count for Conseq = %d\\n',conseqCount)\r\nconseqSupport = conseqCount\/N;\r\nfprintf('Support for Conseq = %.2f\\n',conseqSupport)\r\nlift = itemSetSupport\/(anteSupport*conseqSupport);\r\nfprintf('Lift = %.2f (= itemset support \/ (ante support x conseq support))\\n',lift)\r\n\r\n%% Apriori Algorithm\r\n% Now that we know the basic concepts, we can define the goal of our\r\n% analysis: finding association rules with a sufficient level of support\r\n% (happens often enough) and confidence (association is strong). Lift can\r\n% be a secondary measure to score the interestingness of the rules found.\r\n% \r\n% # Generate frequent itemsets that clear the minimum support threshold\r\n% recursively from 1-itemsets to higher level itemsets, pruning candidates\r\n% along the way\r\n% # Generate rules that clear the minimum confidence threshold in a similar\r\n% way\r\n% \r\n% In a brute force method, you would calculate the support and confidence\r\n% of all possible itemset combinations, but that would be computationally\r\n% expensive, because the number of candidates grows exponentially.\r\n%\r\n% The apriori algorithm addresses this issue by generating candidates\r\n% selectively. To get an intuition, think about the frequency of an itemset\r\n% that contains some infrequent items. That itemset will never be more\r\n% frequent than the least frequent item it contains. So if you construct\r\n% your candidates by combining the frequent itemsets only, starting from\r\n% 1-itemset and continue to higher levels, then you avoid creating useless\r\n% candidates.\r\n%\r\n% Let's start with generating frequent itemsets and get their support\r\n% measures. I wrote a couple of MATLAB functions to implement this\r\n% algorithm.\r\n\r\nminSup = 0.6; % minimum support threshold 0.6\r\n[F,S] = findFreqItemsets(transactions,minSup);\r\nfprintf('Minimum Support        : %.2f\\n', minSup)\r\nfprintf('Frequent Itemsets Found: %d\\n', sum(arrayfun(@(x) length(x.freqSets), F)))\r\nfprintf('Max Level Reached      : %d-itemsets\\n', length(F))\r\nfprintf('Number of Support Data : %d\\n', length(S))\r\n\r\n%%\r\n% When we computed support for each itemset we evaluated, we stored the\r\n% result in a\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/overview-of-the-map-data-structure.html\r\n% Map object> |S|. This is used for rule generation in order to avoid\r\n% recalculating support as part of the confidence computation. You can now\r\n% retrieve support for a given itemset with this object, by supplying the\r\n% string representation of the itemset as the key. Let's try [2,4,6]:\r\n\r\nitemset = [2,4,6];\r\nfprintf('Support for the itemset {%s %s %s}: %.2f\\n',items{itemset(:)},S(num2str(itemset))) \r\n\r\n%%\r\n% This itemset clearly didn't meet the minimum support criteria of 0.60.\r\n%\r\n%% Rule Generation Algorithm\r\n% We saw earlier that you can generate rule candidates from frequent\r\n% itemsets by splitting their contents into antes and conseqs, and\r\n% computing their confidence.\r\n%\r\n% If we generate every possible candidates by such brute force method, it\r\n% will be very time consuming. Apriori algorithm is also used to generate\r\n% rules selectively. Let's say that this rule has low confidence.\r\n%\r\n%  {Beer, Diapers} => {Milk}\r\n%\r\n% Then any other rules generated from this itemset that contain {Milk} in\r\n% rule consequent will have low confidence.\r\n%\r\n%  {Beer} => {Diapers, Milk} {Diapers} => {Beer, Milk}\r\n%\r\n% Why? because support for those antes will be always greater than the\r\n% initial ante {Beer, Diapers}, while the support for the itemset (hence\r\n% also for the rule) remains the same, and confidence is based on the ratio\r\n% of support between the rule and the ante.\r\n%\r\n% We can take advantage of this intuition by first generating rules with\r\n% only one item in conseq and drop those that don't meet the minimum\r\n% criteria, and then merge those conseqs to generate rules with two items\r\n% in conseq, and so forth.\r\n%\r\n% Now we can generate association rules from the frequent itemsets we\r\n% generated in the previous step.\r\n\r\nminConf = 0.8; % minimum confidence threshold 0.8\r\nrules = generateRules(F,S,minConf);\r\nfprintf('Minimum Confidence     : %.2f\\n', minConf)\r\nfprintf('Rules Found            : %d\\n\\n', length(rules))\r\n\r\nfor i = 1:length(rules)\r\n    disp([sprintf('{%s}',items{rules(i).Ante}),' => ',...\r\n        sprintf('{%s}', items{rules(i).Conseq}),...\r\n        sprintf('     Conf: %.2f ',rules(i).Conf),...\r\n        sprintf('Lift: %.2f ',rules(i).Lift),...\r\n        sprintf('Sup: %.2f',rules(i).Sup)])\r\nend\r\n\r\n%%\r\n% With a minimum support of 0.6 and a minimum confidence 0.8, we found only\r\n% one rule that clears those thresholds: {Beer} => {Diapers}. The\r\n% confidence is 1.00, which means we see diapers in all transactions that\r\n% include beer, and the lift is 1.25, so this rule is fairly interesting.\r\n%\r\n%% Test Example: Congressional Voting Records\r\n% As a non-supermarket use case of Market Basket Analysis, the textbook\r\n% uses the\r\n% <https:\/\/archive.ics.uci.edu\/ml\/datasets\/Congressional+Voting+Records\r\n% 1984 Congressional Voting Records dataset> from the UCI Machine Learning\r\n% Repository. We can test our code against the result in the textbook.\r\n%\r\n% For the purpose of Market Basket Analysis, you can think of individual\r\n% members of congress as transactions and bills they voted for and their\r\n% party affiliation as items.\r\n\r\ngetData\r\n\r\n%%\r\n% Now we generate frequent itemsets with a minimum support threshold of\r\n% 0.3, but we need to use a slightly lower threshold because MATLAB uses\r\n% double precision numbers.\r\nminSup = 0.2988; \r\n[F,S] = findFreqItemsets(votes,minSup);\r\nfprintf('Minimum Support        : %.2f\\n', minSup)\r\nfprintf('Frequent Itemsets Found: %d\\n', sum(arrayfun(@(x) length(x.freqSets), F)))\r\nfprintf('Max Level Reached      : %d-itemsets\\n', length(F))\r\nfprintf('Number of Support Data : %d\\n', length(S))\r\n\r\n%%\r\n% Then we generate some rules.\r\nminConf = 0.9; % minimum confidence threshold 0.9\r\nrules = generateRules(F,S,minConf);\r\nfprintf('Minimum Confidence     : %.2f\\n', minConf)\r\nfprintf('Rules Found            : %d\\n\\n', length(rules))\r\n\r\n%%\r\n% Finally, let's compare our results with the results from the textbook.\r\n\r\ntestAntes = [7,21,27;5,11,23;6,15,16;22,31,32];\r\ntestConseqs = [2,1,2,1];\r\ntestConf = [0.91,0.975,0.935,1];\r\n\r\n% Get the rules with 1-item conseq with party classification\r\ndem = rules(arrayfun(@(x) isequal(x.Conseq,1),rules));\r\nrep = rules(arrayfun(@(x) isequal(x.Conseq,2),rules));\r\n\r\n% Compare the results\r\nfor i = 1:size(testAntes,1)\r\n    rec = dem(arrayfun(@(x) isequal(x.Ante,testAntes(i,:)),dem));\r\n    rec = [rec,rep(arrayfun(@(x) isequal(x.Ante,testAntes(i,:)),rep))];\r\n    disp(['{', sprintf('%s, %s, %s',vars{rec.Ante}),'} => ',...\r\n        sprintf('{%s}', vars{rec.Conseq})])\r\n    fprintf('   Conf: %.2f Lift: %.2f Sup: %.2f\\n',rec.Conf,rec.Lift,rec.Sup)\r\n    if isequal(rec.Conseq,testConseqs(i));\r\n        fprintf('   Correct!  Expected Conf %.2f\\n\\n',testConf(i))\r\n    end\r\nend\r\n\r\n%% Visualization of Metrics\r\n% MATLAB has strong data visualization capabilities. Let's visualize\r\n% |conf|, |lift| and |support|.\r\n%\r\n\r\nconf = arrayfun(@(x) x.Conf, rules); % get conf as a vector\r\nlift = arrayfun(@(x) x.Lift, rules); % get lift as a vector\r\nsup = arrayfun(@(x) x.Sup, rules); % get support as a vector\r\ncolormap cool\r\nscatter(sup,conf,lift*30, lift, 'filled')\r\nxlabel('Support'); ylabel('Confidence')\r\nt = colorbar('peer',gca);\r\nset(get(t,'ylabel'),'String', 'Lift');\r\ntitle('1984 Congressional Voting Records')\r\n\r\n%%\r\n% You see that rules we generated match with those in the textbook. The\r\n% plot of support, confidence and lift is useful to identify rules that are\r\n% high support, high confidence (upper right region of the plot) and high\r\n% lift (more magenta). If you type |gname| into MATLAB prompt, you can\r\n% interactively identify the indices of those points, and hit Enter to end\r\n% the interactive session.\r\n%\r\n% <<gname.png>>\r\n%\r\n\r\n%%\r\n% What was Rule 51? Let's find out.\r\n\r\ndisp('Rule ID = 51')\r\nfprintf('{%s, %s} => {%s}, Conf: %.2f\\n',...\r\n    vars{rules(51).Ante(1)},vars{rules(51).Ante(2)},...\r\n    vars{rules(51).Conseq},rules(51).Conf)\r\n\r\n%% Closing\r\n% We used a small sample dataset in order to learn the basics of Market\r\n% Basket Analysis. However, for real applications such as web usage pattern\r\n% mining, we will typically be dealing with quite a large dataset. In a\r\n% later post, I would like to figure out how to extend what we learned here\r\n% with the help of MapReduce, introduced in R2014b.  Any thoughts\r\n% about your creative use of \"Market Basket Analysis\" with your data? Let\r\n% us know <https:\/\/blogs.mathworks.com\/loren\/?p=1089#respond here>.\r\n\r\n%% Appendix: Code used for this post\r\n% Here is the code used for this post.\r\n%\r\n% findFreqItemsets() generates frequent itemsets using the Apriori method.\r\ntype findFreqItemsets.m\r\n\r\n%%\r\n% aprioriGen() is a helper function to findFreqItemsets() and generates\r\n% candidate k-itemsets using the Apriori algorithm.\r\ntype aprioriGen.m\r\n\r\n%%\r\n% generateRules() returns the association rules that meets the minium\r\n% confidence criteria. It also uses aprioriGen() to generate rule\r\n% candidates.\r\ntype generateRules.m\r\n\r\n%%\r\n% getData is a script that fetches the congressional voting data using the\r\n% RESTful API access function webread() that was introduced in R2014b, and\r\n% preprocesses the data for Market Basket Analysis.\r\n\r\ntype getData.m\r\n##### SOURCE END ##### 0396ce612a234f948ca47e69f2db3a54\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/MarketBasketAnalysis1_01.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>You probably heard about the \"beer and diapers\" story as the often quoted example of what data mining can achieve. It goes like this: some supermarket placed beer next to diapers and got more business because they mined their sales data and found that men often bought those two items together.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/01\/29\/introduction-to-market-basket-analysis\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[63],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1089"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=1089"}],"version-history":[{"count":2,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1089\/revisions"}],"predecessor-version":[{"id":1091,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1089\/revisions\/1091"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=1089"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=1089"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=1089"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}