{"id":1105,"date":"2015-02-25T07:52:15","date_gmt":"2015-02-25T12:52:15","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=1105"},"modified":"2015-02-05T07:52:53","modified_gmt":"2015-02-05T12:52:53","slug":"scaling-market-basket-analysis-with-mapreduce","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2015\/02\/25\/scaling-market-basket-analysis-with-mapreduce\/","title":{"rendered":"Scaling Market Basket Analysis with MapReduce"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p>In <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=1089\">an earlier post<\/a>, today's guest blogger Toshi Takeuchi gave us an introduction to Market Basket Analysis. This week, he will discuss how to scale this technique using <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/mapreduce.html\"><tt>MapReduce<\/tt><\/a> to deal with larger data.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/bigdata.jpg\" alt=\"\"> <\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#6a746fb2-eeac-44e5-9e85-b2b5a6bc6615\">MapReduce in MATLAB 101<\/a><\/li><li><a href=\"#7e319c76-7e1a-4b01-a997-74c30ec2d1b9\">Step 1: Group items by transaction<\/a><\/li><li><a href=\"#c8b4c663-b14b-463f-917c-5b2f51de7697\">Step 2: Generate 1-itemsets<\/a><\/li><li><a href=\"#a397e0fe-00b4-40dd-8802-a61a15379e0a\">Generate Frequent Itemsets<\/a><\/li><li><a href=\"#29766426-1a44-4c90-ba20-e1509e7748a2\">Generate Rules<\/a><\/li><li><a href=\"#f4642a9d-8674-464d-99cb-24550062ccac\">Visualize rules by support, confidence, and lift<\/a><\/li><li><a href=\"#d7a955ea-f12a-48e1-8122-6b5fc9637a68\">Microsoft Web Data<\/a><\/li><li><a href=\"#fee2448a-e22c-41fd-8dcd-34e50a4ba3af\">Summary and the Next Step<\/a><\/li><li><a href=\"#03c6f8ec-e3d9-46ba-84f0-423d64dcf801\">Your thoughts?<\/a><\/li><li><a href=\"#72caec6c-cd60-4be7-b2bb-3f955947faba\">Appendix - Process Microsoft Web Data<\/a><\/li><\/ul><\/div><h4>MapReduce in MATLAB 101<a name=\"6a746fb2-eeac-44e5-9e85-b2b5a6bc6615\"><\/a><\/h4><p>R2014b was a major update to MATLAB core functionality and one of the several new exciting features to me was MapReduce. I was primarily interested in Market Basket Analysis to analyze clickstream data and I knew web usage data extracted from web server logs would be very large.<\/p><p>MapReduce was developed to process massive datasets in a distributed parallel computation, and it is one of the key technologies that enabled Big Data analytics.<\/p><p>MapReduce is made up of mappers and reducers. Mappers read data from file storage one chunk at a time and parse data to generate key-value pairs. Then reducers will receive those key-value pairs by key and process values associated with those values. Therefore what you need to do is:<\/p><div><ol><li>Use <tt>datastore<\/tt> to designate data sources<\/li><li>Define mapper and reducer functions<\/li><li>Use <tt>mapreduce<\/tt> with <tt>datastore<\/tt>, the mapper and reducer to process data<\/li><li>Store the processed data for further analysis<\/li><\/ol><\/div><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/mapreduce.png\" alt=\"\"> <\/p><p>Though mappers and reducers perform fairly simple operations, you can chain them together to handle more complex operations. In the Apriori algorithm, the most time-consuming steps are the generation of transactions and 1-itemset data. So let's use MapReduce to address these bottlenecks.<\/p><p>We start by setting up <tt>datastore<\/tt>. In this example, we are going to use a fairly small CSV file on a local drive, but <tt>datastore<\/tt> can<\/p><div><ul><li>read a data that is too large to fit in a memory in a single computer, or<\/li><li>read files on multiple locations on a cluster, including those on a Hadoop distributed file system (HDFS), with appropriate add-on products.<\/li><\/ul><\/div><p>It is important to start out with a small subset of the data to prototype and test your algorithm before you use it on really big data. MATLAB makes it really easy to prototype your algorithm on your local machine and then scale it up to the cluster or cloud later.<\/p><p>Set up source datastore.<\/p><pre class=\"codeinput\">source_ds = datastore(<span class=\"string\">'sampleData.csv'<\/span>,<span class=\"string\">'ReadVariableNames'<\/span>,false,<span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'NumHeaderLines'<\/span>,1,<span class=\"string\">'VariableNames'<\/span>,{<span class=\"string\">'VisitorCookieID'<\/span>,<span class=\"string\">'Page'<\/span>,<span class=\"string\">'Visits'<\/span>});\r\nsource_ds.SelectedVariableNames = {<span class=\"string\">'VisitorCookieID'<\/span>,<span class=\"string\">'Page'<\/span>};\r\n<\/pre><p>The data is a list of Visitor Cookie IDs and the pages associated with the IDs.<\/p><p>Let's review the data.<\/p><pre class=\"codeinput\">disp(preview(source_ds))\r\nreset(source_ds)\r\n<\/pre><pre class=\"codeoutput\">             VisitorCookieID                                                 Page                                    \r\n    __________________________________    ___________________________________________________________________________\r\n    '3821337fdad7a6132253b10b602c4616'    '\/matlabcentral\/answers\/'                                                  \r\n    '3821337fdad7a6132253b10b602c4616'    '\/matlabcentral\/answers\/152931-how-to-translate-the-following-code-from...'\r\n    '3821337fdad7a6132253b10b602c4616'    '\/matlabcentral\/answers\/153109-number-greater-than-the-largest-positive...'\r\n    '3821337fdad7a6132253b10b602c4616'    '\/help\/matlab\/ref\/realmax.html'                                            \r\n    '3821337fdad7a6132253b10b602c4616'    '\/matlabcentral\/answers\/153201-how-to-evaluate-large-factorials'           \r\n    '3821337fdad7a6132253b10b602c4616'    '\/help\/matlab\/matlab_prog\/floating-point-numbers.html'                     \r\n    '3821337fdad7a6132253b10b602c4616'    '\/matlabcentral\/answers\/contributors\/5560534-tigo\/questions'               \r\n    '3821337fdad7a6132253b10b602c4616'    '\/matlabcentral\/newsreader\/view_thread\/297433'                             \r\n<\/pre><h4>Step 1: Group items by transaction<a name=\"7e319c76-7e1a-4b01-a997-74c30ec2d1b9\"><\/a><\/h4><p>If you think of a visitor as a shopper, you can think of pages visited as items in the shopping cart (a transaction). A visitor can visit the same page multiple times, but such repeated visits are not factored in itemset counting.<\/p><p>One of the important considerations in designing a MapReduce algorithm is to minimize the number of keys you generate. For this reason, a good starting point would be to group items by transactions by using <tt>VisitorCookieID<\/tt> as the key, because we have a finite set of visitors but they can visit a larger number of pages.<\/p><pre class=\"codeinput\">type <span class=\"string\">transactionMapper<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction transactionMapper(data, info, intermKVStore)\r\n    tid = data.VisitorCookieID;\r\n    item = data.Page;\r\n    \r\n    % get unique tids\r\n    u_tid = unique(tid);\r\n    \r\n    % iterate over the data chunk to map multiple items to a unique tid\r\n    items = cell(size(u_tid));\r\n    for i = 1:length(tid)\r\n        row = ismember(u_tid,tid{i});\r\n        items{row}{end+1} = item{i};\r\n    end\r\n    % use addmulti to speed up the process\r\n    addmulti(intermKVStore, u_tid, items)\r\nend\r\n<\/pre><p>The mapper will then pass this key-value pair to the reducer.<\/p><pre class=\"codeinput\">type <span class=\"string\">transactionReducer<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction transactionReducer(key, value, outKVStore)\r\n    items = {};\r\n    % concatenate items from different mappers for the same key\r\n    while hasnext(value)\r\n        items  = [items, getnext(value)];\r\n    end\r\n    % eliminate duplicates\r\n    u_items = unique(items);\r\n    % save the data to a key-value store\r\n    add(outKVStore, key, u_items);\r\nend\r\n<\/pre><p>The reducer receives key-value pairs by key, and merges multiple cell arrays with the same key into a single cell array and removes any duplicates. We can then store the result in a new <tt>datastore<\/tt>.<\/p><p>Now let's run this job.<\/p><p>Group items by transaction.<\/p><pre class=\"codeinput\">transaction_ds = mapreduce(source_ds, @transactionMapper, @transactionReducer);\r\ntransactions = readall(transaction_ds);\r\n\r\ndisp(transactions(1:5,:))\r\n<\/pre><pre class=\"codeoutput\">Parallel mapreduce execution on the local cluster:\r\n********************************\r\n*      MAPREDUCE PROGRESS      *\r\n********************************\r\nMap   0% Reduce   0%\r\nMap 100% Reduce  50%\r\nMap 100% Reduce 100%\r\n                   Key                       Value   \r\n    __________________________________    ___________\r\n    '00927996b5566e6347e55463dc7c6273'    {1x16 cell}\r\n    '01c5f379d2e475d727eec2c711fb83f8'    {1x11 cell}\r\n    '0717615157c000c4955c21b958b7866d'    {1x1  cell}\r\n    '0f29597bff2a611013a4333f027b4f1a'    {1x12 cell}\r\n    '13027f74a9c7402bf7b8f699b557815f'    {1x12 cell}\r\n<\/pre><h4>Step 2: Generate 1-itemsets<a name=\"c8b4c663-b14b-463f-917c-5b2f51de7697\"><\/a><\/h4><p>We now know that all items in a transaction are unique. All we need to do is to count the number of times an item appears in the transactions to see how many transactions contained that item. The pages were stored as a value in the previous step, so we simply need to retrieve just those and count their contents.<\/p><pre class=\"codeinput\">type <span class=\"string\">oneItemsetMapper<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction oneItemsetMapper(data, info, intermKVStore)\r\n    % keys are in a cell array\r\n    keys = data.Value{1};\r\n    % create a cell array of 1's\r\n    values = num2cell(ones(size(keys)));\r\n    % save the data to a key-value store\r\n    addmulti(intermKVStore, keys, values)\r\nend\r\n<\/pre><p>The mapper passes each instance of a page as 1.<\/p><pre class=\"codeinput\">type <span class=\"string\">oneItemsetReducer<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction oneItemsetReducer(key, value, outKVStore)\r\n    count = 0;\r\n    while hasnext(value)\r\n        count = count + getnext(value);\r\n    end\r\n    add(outKVStore, key, count);\r\nend\r\n<\/pre><p>The reducer then collects the counts of a given page and sums them. Now let's run this job and read the completed results into memory.<\/p><pre class=\"codeinput\"><span class=\"comment\">% Get 1-itemsets<\/span>\r\noneItemset_ds = mapreduce(transaction_ds, @oneItemsetMapper, @oneItemsetReducer);\r\n<span class=\"comment\">% Read the result into memory<\/span>\r\noneItemsets = readall(oneItemset_ds);\r\n\r\ndisp(oneItemsets(655:659,:))\r\n<\/pre><pre class=\"codeoutput\">Parallel mapreduce execution on the local cluster:\r\n********************************\r\n*      MAPREDUCE PROGRESS      *\r\n********************************\r\nMap   0% Reduce   0%\r\nMap  50% Reduce   0%\r\nMap 100% Reduce  50%\r\nMap 100% Reduce 100%\r\n                                   Key                                    Value\r\n    __________________________________________________________________    _____\r\n    '\/loren\/'                                                             [2]  \r\n    '\/loren\/2006\/07\/05\/when-is-a-numeric-result-not-a-number\/'            [1]  \r\n    '\/loren\/2009\/10\/02\/using-parfor-loops-getting-up-and-running\/'        [1]  \r\n    '\/loren\/2011\/11\/14\/generating-c-code-from-your-matlab-algorithms\/'    [1]  \r\n    '\/loren\/2012\/02\/06\/using-gpus-in-matlab\/'                             [1]  \r\n<\/pre><h4>Generate Frequent Itemsets<a name=\"a397e0fe-00b4-40dd-8802-a61a15379e0a\"><\/a><\/h4><p>Now we are ready to feed the transactions and oneItemsets data to <tt>findFreqItemsets<\/tt>, which also accepts a table of 1-itemsets as an optional input. The code is available if you go to <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=1089\">the earlier post<\/a>.<\/p><p>Let's generate frequent itemsets based on a minimum support threshold 0.02, which means we see the same pattern among at least 2% of the visitors.<\/p><pre class=\"codeinput\">minSup = 0.02;\r\nfprintf(<span class=\"string\">'Processing dataset with minimum support threshold = %.2f\\n...\\n'<\/span>, minSup)\r\n[F,S,items] = findFreqItemsets(transactions,minSup,oneItemsets);\r\nfprintf(<span class=\"string\">'Frequent Itemsets Found: %d\\n'<\/span>, sum(arrayfun(@(x) size(x.freqSets,1), F)))\r\nfprintf(<span class=\"string\">'Max Level              : k = %d\\n'<\/span>, length(F))\r\nfprintf(<span class=\"string\">'Number of Support Data : %d\\n\\n'<\/span>, length(S))\r\n<\/pre><pre class=\"codeoutput\">Processing dataset with minimum support threshold = 0.02\r\n...\r\nFrequent Itemsets Found: 151\r\nMax Level              : k = 4\r\nNumber of Support Data : 4107\r\n\r\n<\/pre><h4>Generate Rules<a name=\"29766426-1a44-4c90-ba20-e1509e7748a2\"><\/a><\/h4><p>This step is no different from the example in the earlier post. Let's use a minimum confidence threshold 0.6.<\/p><pre class=\"codeinput\">minConf = 0.6;\r\nrules = generateRules(F,S,minConf);\r\nfprintf(<span class=\"string\">'Minimum Confidence     : %.2f\\n'<\/span>, minConf)\r\nfprintf(<span class=\"string\">'Rules Found            : %d\\n\\n'<\/span>, length(rules))\r\n<\/pre><pre class=\"codeoutput\">Minimum Confidence     : 0.60\r\nRules Found            : 99\r\n\r\n<\/pre><h4>Visualize rules by support, confidence, and lift<a name=\"f4642a9d-8674-464d-99cb-24550062ccac\"><\/a><\/h4><p>Now we can visualize the rules we generated. This sample dataset is very tiny, and the number of rules are limited.<\/p><pre class=\"codeinput\">conf = arrayfun(@(x) x.Conf, rules); <span class=\"comment\">% get conf as a vector<\/span>\r\nlift = arrayfun(@(x) x.Lift, rules); <span class=\"comment\">% get lift as a vector<\/span>\r\nsup = arrayfun(@(x) x.Sup, rules); <span class=\"comment\">% get support as a vector<\/span>\r\ncolormap <span class=\"string\">cool<\/span>\r\nscatter(sup,conf,lift*5, lift, <span class=\"string\">'filled'<\/span>)\r\nxlabel(<span class=\"string\">'Support'<\/span>); ylabel(<span class=\"string\">'Confidence'<\/span>)\r\nt = colorbar(<span class=\"string\">'peer'<\/span>,gca);\r\nset(get(t,<span class=\"string\">'ylabel'<\/span>),<span class=\"string\">'String'<\/span>, <span class=\"string\">'Lift'<\/span>);\r\ntitle(<span class=\"string\">'Sample Data'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/MarketBasketAnalysis2LS2_01.png\" alt=\"\"> <h4>Microsoft Web Data<a name=\"d7a955ea-f12a-48e1-8122-6b5fc9637a68\"><\/a><\/h4><p>Unfortunately we can&#8217;t share this sample data for you to try, but it is easy to adapt this code using publicly available dataset such as <a href=\"http:\/\/archive.ics.uci.edu\/ml\/datasets\/Anonymous+Microsoft+Web+Data\">Anonymous Microsoft Web Data Data Set<\/a> from UCI Machine Learning Repository. This data is preprossed from raw clickstream logs, and we need to process it back to the raw format we used in sample data. You can see the code used to process this dataset below.<\/p><p>When you apply Market Basket Analysis to this dataset, you should get something like this.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/microsoftweb.png\" alt=\"\"> <\/p><pre>Rule #1\r\n{Windows 95} =&gt; {Windows Family of OSs}\r\nConf: 0.91, Lift: 6.46<\/pre><pre>Rule #2\r\n{Windows95 Support} =&gt; {isapi}\r\nConf: 0.84, Lift: 5.16<\/pre><pre>Rule #3\r\n{SiteBuilder Network Membership} =&gt; {Internet Site Construction for Developers}\r\nConf: 0.80, Lift: 8.17<\/pre><pre>Rule #5\r\n{Knowledge Base, isapi} =&gt; {Support Desktop}\r\nConf: 0.71, Lift: 5.21<\/pre><pre>Rule #18\r\n{Windows Family of OSs, isapi} =&gt; {Windows95 Support}\r\nConf: 0.64, Lift: 11.66<\/pre><p>The sample code also includes MapReduce steps on a reduced dataset. I reduced the dataset because my MapReduce code is not optimal for this dataset and runs extremely slowly. My code assumes that there are more pages than visitors, and we have more visitors than pages in this Microsoft dataset.<\/p><p>If you would like to use MapReduce on the full Microsoft dataset, I strongly recommend that you wrote your own MapReduce code that is more optimized for this dataset.<\/p><h4>Summary and the Next Step<a name=\"fee2448a-e22c-41fd-8dcd-34e50a4ba3af\"><\/a><\/h4><p>Now we prototyped the algorithm and tested it. I can see if I am going to get the insight I am looking for with this process. Once I am happy with it, I still need to figure out where to store the larger data on a cluster or cloud. It's more of a business challenge than a technical one, and I am still in the middle of that.<\/p><p>Programming MapReduce in MATLAB is very straight forward and easy. The source data was local in this example, but you can also use it for a larger dataset that sits across multiple locations, such as HDFS files. You can prototype your MapReduce algorithm locally, then change the configuration to scale up to a larger dataset.<\/p><p>In this example, MapReduce was used only for the initial data processing, and the rest was still done in memory. As long as the processed result can fit in the memory, we can use this approach.<\/p><p>However, if the processed data gets larger, then we need to make more use of MapReduce in other steps in the Apriori alogrithm.<\/p><p>The key is to adapt the algorithm to parallel processing. In the current incarnation, you have a single thread in the candidate pruning stage.<\/p><p>Instead, we subdivide the dataset into several chunks and complete the entire process through rule generation in each. We need to adjust the minimum support to account for the reduction of transaction counts in subsets if we do so. Then we can combine the final output. This may not provide the same result as the single thread process, but it should be fairly close.<\/p><h4>Your thoughts?<a name=\"03c6f8ec-e3d9-46ba-84f0-423d64dcf801\"><\/a><\/h4><p>Do you see ways in which you might leverage MapReduce to handle larger data analysis projects?  Let us know <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=1105#respond\">here<\/a>.<\/p><h4>Appendix - Process Microsoft Web Data<a name=\"72caec6c-cd60-4be7-b2bb-3f955947faba\"><\/a><\/h4><p>Here is the code I used to process the Microsoft dataset.<\/p><pre class=\"codeinput\">type <span class=\"string\">processMicrosoftWebData.m<\/span>\r\n<\/pre><pre class=\"codeoutput\">\r\n%% Let's load the dataset\r\n% The UCI dataset is not in the raw log format, but in a special\r\n% non-tabular ASCII format. We need to process the data into a format we\r\n% can use. This is not a typical process you do when you work with actual\r\n% raw clickstream data.\r\n\r\n% clear everything\r\nclearvars; close all; clc\r\n\r\n% the URL of the source data\r\nfilename = 'anonymous-msweb.data';\r\n\r\n% if te file doesn't exist, download of the UCI website\r\nif exist(filename,'file') == 0\r\n    url = 'http:\/\/archive.ics.uci.edu\/ml\/machine-learning-databases\/anonymous\/anonymous-msweb.data';\r\n    filepath = websave('anonymous-msweb.data',url);\r\nend\r\n\r\n% accumulators\r\naid = [];   % attribute id\r\nvroot = {}; % vroot - page name\r\nurl = {};   % url of the page\r\nvid = [];   % visitor id\r\nvisits = [];  % aid of vroots visited by a visitor\r\nVisitorCookieID = {};   % same as vid but as string\r\n\r\n% open the file\r\nfid = fopen(filename);\r\n\r\n% read the first line\r\ntline = fgetl(fid);\r\n\r\n% if the line contains a string\r\nwhile ischar(tline)\r\n    % if the line contains attribute\r\n    if strcmp(tline(1),'A')\r\n        c = textscan(tline,'%*s%d%*d%q%q','Delimiter',',');\r\n        aid = [aid;c{1}];\r\n        vroot = [vroot;c{2}];\r\n        url = [url;c{3}];\r\n    % if the line contains case\r\n    elseif strcmp(tline(1),'C')\r\n        user = textscan(tline,'%*s%*q%d','Delimiter',',');\r\n    % if the line contains vote\r\n    elseif strcmp(tline(1),'V')\r\n        vid = [vid;user{1}];\r\n        vote = textscan(tline,'%*s%d%*d','Delimiter',',');\r\n        visits = [visits; vote{1}];\r\n        VisitorCookieID = [VisitorCookieID;['id',num2str(user{1})]];\r\n    end\r\n    \r\n    tline = fgetl(fid);\r\nend\r\n\r\n% close the file\r\nfclose(fid);\r\n\r\n% sort attributes by aid\r\n[~,idx] = sort(aid);\r\naid = aid(idx);\r\nvroot = vroot(idx);\r\nurl = url(idx);\r\n\r\n% populate |Page| with vroot based on aid\r\nPage = cell(size(visits));\r\nfor i = 1:length(aid)\r\n    Page(visits == aid(i)) = vroot(i);\r\nend\r\n\r\n% create table and write it to disk if it doesn't exist\r\ntransactions = table(VisitorCookieID,Page);\r\nif exist('msweb.transactions.csv','file') == 0\r\n    % just keep the first 318 rows to use as sample data\r\n    transactions(319:end,:) = []; % comment out to keep the whole thing\r\n    writetable(transactions,'msweb.transactions.csv')\r\nend\r\n\r\n%% Association Rule Mining without MapReduce\r\n% Since we already have all necessary pieces of data in the workspace, we\r\n% might as well do the analysis now.\r\n\r\n% get unique uid\r\nuniq_uid = unique(vid);\r\n\r\n% create a cell array of visits that contains vroots visited\r\ntransactions = cell(size(uniq_uid));\r\nfor i = 1:length(uniq_uid)\r\n    transactions(i) = {visits(vid == uniq_uid(i))'};\r\nend\r\n\r\n% find frequent itemsets of vroots from the visits\r\nminSup = 0.02;\r\nfprintf('Processing dataset with minimum support threshold = %.2f\\n...\\n', minSup)\r\n[F,S,items] = findFreqItemsets(transactions,minSup);\r\nfprintf('Frequent Itemsets Found: %d\\n', sum(arrayfun(@(x) size(x.freqSets,1), F)))\r\nfprintf('Max Level              : k = %d\\n', length(F))\r\nfprintf('Number of Support Data : %d\\n\\n', length(S))\r\n\r\n% generate rules \r\nminConf = 0.6;\r\nrules = generateRules(F,S,minConf);\r\nfprintf('Minimum Confidence     : %.2f\\n', minConf)\r\nfprintf('Rules Found            : %d\\n\\n', length(rules))\r\n\r\n% plot the rules\r\nconf = arrayfun(@(x) x.Conf, rules); % get conf as a vector\r\nlift = arrayfun(@(x) x.Lift, rules); % get lift as a vector\r\nsup = arrayfun(@(x) x.Sup, rules); % get support as a vector\r\ncolormap cool\r\nscatter(sup,conf,lift*5, lift, 'filled')\r\nxlabel('Support'); ylabel('Confidence')\r\nt = colorbar('peer',gca);\r\nset(get(t,'ylabel'),'String', 'Lift');\r\ntitle('Microsoft Web Data')\r\n\r\n% display the selected rules\r\nselected = [1,2,3,5,18];\r\nfor i = 1:length(selected)\r\n    fprintf('Rule #%d\\n', selected(i))\r\n    lenAnte = length(rules(selected(i)).Ante);\r\n    if lenAnte == 1\r\n        fprintf('{%s} =&gt; {%s}\\nConf: %.2f, Lift: %.2f\\n\\n',...\r\n            vroot{rules(selected(i)).Ante(1)},vroot{rules(selected(i)).Conseq},...\r\n            rules(selected(i)).Conf,rules(selected(i)).Lift)\r\n    elseif lenAnte == 2\r\n        fprintf('{%s, %s} =&gt; {%s}\\nConf: %.2f, Lift: %.2f\\n\\n',...\r\n            vroot{rules(selected(i)).Ante(1)},vroot{rules(selected(i)).Ante(2)},...\r\n            vroot{rules(selected(i)).Conseq},rules(selected(i)).Conf,rules(selected(i)).Lift)\r\n    end\r\nend\r\n\r\n%% MapReduce\r\n% My MapReduce code is not well suited for this dataset and runs extremely\r\n% slow if you use the whole dataset. I will just use just a subset for\r\n% demonstration purpose. If you want to try this on the full dataset, you\r\n% should write your own MapReduce code optimized for it.\r\n\r\n% clear everything\r\nclearvars; close all; clc\r\n\r\n% set up source datastore\r\nsource_ds = datastore('msweb.transactions.csv');\r\ndisp(preview(source_ds))\r\n\r\n% step 1: Group items by transaction\r\ntransaction_ds = mapreduce(source_ds, @transactionMapper, @transactionReducer);\r\ntransactions = readall(transaction_ds);\r\n\r\n% step 2: Generate 1-itemsets\r\noneItemset_ds = mapreduce(transaction_ds, @oneItemsetMapper, @oneItemsetReducer);\r\noneItemsets = readall(oneItemset_ds);\r\n<\/pre><script language=\"JavaScript\"> <!-- \r\n    function grabCode_0a3f82c66d894960ba549b219009b740() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='0a3f82c66d894960ba549b219009b740 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 0a3f82c66d894960ba549b219009b740';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2015 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_0a3f82c66d894960ba549b219009b740()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2014b<br><\/p><\/div><!--\r\n0a3f82c66d894960ba549b219009b740 ##### SOURCE BEGIN #####\r\n%% Scaling Market Basket Analysis with MapReduce\r\n% In <https:\/\/blogs.mathworks.com\/loren\/?p=1089 an earlier post>, today's guest\r\n% blogger Toshi Takeuchi gave us an introduction to Market Basket Analysis.\r\n% This week, he will discuss how to scale this technique using\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/mapreduce.html |MapReduce|> to deal\r\n% with larger data.\r\n%\r\n% <<bigdata.jpg>>\r\n% \r\n%% MapReduce in MATLAB 101\r\n% R2014b was a major update to MATLAB core functionality and one of the\r\n% several new exciting features to me was MapReduce. I was primarily\r\n% interested in Market Basket Analysis to analyze clickstream data and I\r\n% knew web usage data extracted from web server logs would be very large.\r\n%\r\n% MapReduce was developed to process massive datasets in a distributed\r\n% parallel computation, and it is one of the key technologies that enabled\r\n% Big Data analytics.\r\n%\r\n% MapReduce is made up of mappers and reducers. Mappers read data from file\r\n% storage one chunk at a time and parse data to generate key-value pairs.\r\n% Then reducers will receive those key-value pairs by key and process\r\n% values associated with those values. Therefore what you need to do is:\r\n%\r\n% # Use |datastore| to designate data sources\r\n% # Define mapper and reducer functions\r\n% # Use |mapreduce| with |datastore|, the mapper and reducer to process\r\n% data\r\n% # Store the processed data for further analysis\r\n% \r\n% <<mapreduce.png>>\r\n%\r\n% Though mappers and reducers perform fairly simple operations, you can\r\n% chain them together to handle more complex operations. In the Apriori\r\n% algorithm, the most time-consuming steps are the generation of\r\n% transactions and 1-itemset data. So let's use MapReduce to address these\r\n% bottlenecks.\r\n%\r\n% We start by setting up |datastore|. In this example, we are going to\r\n% use a fairly small CSV file on a local drive, but |datastore| can\r\n%\r\n% * read a data that is too large to fit in a memory in a single computer,\r\n% or\r\n% * read files on multiple locations on a cluster, including those on a\r\n% Hadoop distributed file system (HDFS), with appropriate add-on products.\r\n% \r\n% It is important to start out with a small subset of the data to prototype\r\n% and test your algorithm before you use it on really big data. MATLAB\r\n% makes it really easy to prototype your algorithm on your local machine\r\n% and then scale it up to the cluster or cloud later. \r\n% \r\n%%\r\n% Set up source datastore.\r\nsource_ds = datastore('sampleData.csv','ReadVariableNames',false,...\r\n    'NumHeaderLines',1,'VariableNames',{'VisitorCookieID','Page','Visits'});\r\nsource_ds.SelectedVariableNames = {'VisitorCookieID','Page'};\r\n\r\n%%\r\n% The data is a list of Visitor Cookie IDs and the pages associated with\r\n% the IDs.\r\n%\r\n% Let's review the data.\r\ndisp(preview(source_ds))\r\nreset(source_ds)\r\n\r\n%% Step 1: Group items by transaction\r\n% If you think of a visitor as a shopper, you can think of pages visited\r\n% as items in the shopping cart (a transaction). A visitor can visit the\r\n% same page multiple times, but such repeated visits are not factored in\r\n% itemset counting.\r\n%\r\n% One of the important considerations in designing a MapReduce algorithm is\r\n% to minimize the number of keys you generate. For this reason, a good\r\n% starting point would be to group items by transactions by using\r\n% |VisitorCookieID| as the key, because we have a finite set of visitors\r\n% but they can visit a larger number of pages.\r\n\r\ntype transactionMapper\r\n\r\n%%\r\n% The mapper will then pass this key-value pair to the reducer.\r\n\r\ntype transactionReducer\r\n%%\r\n% The reducer receives key-value pairs by key, and merges multiple cell\r\n% arrays with the same key into a single cell array and removes any\r\n% duplicates. We can then store the result in a new |datastore|.\r\n% \r\n% Now let's run this job. \r\n%\r\n% Group items by transaction.\r\ntransaction_ds = mapreduce(source_ds, @transactionMapper, @transactionReducer);\r\ntransactions = readall(transaction_ds);\r\n\r\ndisp(transactions(1:5,:))\r\n\r\n%% Step 2: Generate 1-itemsets\r\n% We now know that all items in a transaction are unique. All we need to do\r\n% is to count the number of times an item appears in the transactions\r\n% to see how many transactions contained that item. The pages were stored\r\n% as a value in the previous step, so we simply need to retrieve just those\r\n% and count their contents.\r\n\r\ntype oneItemsetMapper\r\n%%\r\n% The mapper passes each instance of a page as 1.\r\n \r\ntype oneItemsetReducer\r\n%%\r\n% The reducer then collects the counts of a given page and sums them. Now\r\n% let's run this job and read the completed results into memory.\r\n\r\n% Get 1-itemsets\r\noneItemset_ds = mapreduce(transaction_ds, @oneItemsetMapper, @oneItemsetReducer);\r\n% Read the result into memory\r\noneItemsets = readall(oneItemset_ds);\r\n\r\ndisp(oneItemsets(655:659,:))\r\n\r\n%% Generate Frequent Itemsets\r\n% Now we are ready to feed the transactions and oneItemsets data to\r\n% |findFreqItemsets|, which also accepts a table of 1-itemsets as an\r\n% optional input. The code is available if you go to\r\n% <https:\/\/blogs.mathworks.com\/loren\/?p=1089 the earlier post>.\r\n% \r\n% Let's generate frequent itemsets based on a minimum support threshold\r\n% 0.02, which means we see the same pattern among at least 2% of the\r\n% visitors.\r\n\r\nminSup = 0.02;\r\nfprintf('Processing dataset with minimum support threshold = %.2f\\n...\\n', minSup)\r\n[F,S,items] = findFreqItemsets(transactions,minSup,oneItemsets);\r\nfprintf('Frequent Itemsets Found: %d\\n', sum(arrayfun(@(x) size(x.freqSets,1), F)))\r\nfprintf('Max Level              : k = %d\\n', length(F))\r\nfprintf('Number of Support Data : %d\\n\\n', length(S))\r\n\r\n%% Generate Rules\r\n% This step is no different from the example in the earlier post. Let's use\r\n% a minimum confidence threshold 0.6.\r\n\r\nminConf = 0.6;\r\nrules = generateRules(F,S,minConf);\r\nfprintf('Minimum Confidence     : %.2f\\n', minConf)\r\nfprintf('Rules Found            : %d\\n\\n', length(rules))\r\n\r\n%% Visualize rules by support, confidence, and lift\r\n% Now we can visualize the rules we generated. This sample dataset is very\r\n% tiny, and the number of rules are limited. \r\n\r\nconf = arrayfun(@(x) x.Conf, rules); % get conf as a vector\r\nlift = arrayfun(@(x) x.Lift, rules); % get lift as a vector\r\nsup = arrayfun(@(x) x.Sup, rules); % get support as a vector\r\ncolormap cool\r\nscatter(sup,conf,lift*5, lift, 'filled')\r\nxlabel('Support'); ylabel('Confidence')\r\nt = colorbar('peer',gca);\r\nset(get(t,'ylabel'),'String', 'Lift');\r\ntitle('Sample Data')\r\n\r\n%% Microsoft Web Data\r\n% Unfortunately we can\u00e2\u20ac\u2122t share this sample data for you to try, but it is\r\n% easy to adapt this code using publicly available dataset such as\r\n% <http:\/\/archive.ics.uci.edu\/ml\/datasets\/Anonymous+Microsoft+Web+Data\r\n% Anonymous Microsoft Web Data Data Set> from UCI Machine Learning\r\n% Repository. This data is preprossed from raw clickstream logs, and we\r\n% need to process it back to the raw format we used in sample data. You can\r\n% see the code used to process this dataset below. \r\n%\r\n% When you apply Market Basket Analysis to this dataset, you should get\r\n% something like this. \r\n%\r\n% <<microsoftweb.png>>\r\n%\r\n%  Rule #1\r\n%  {Windows 95} => {Windows Family of OSs}\r\n%  Conf: 0.91, Lift: 6.46\r\n%\r\n%  Rule #2\r\n%  {Windows95 Support} => {isapi}\r\n%  Conf: 0.84, Lift: 5.16\r\n%  \r\n%  Rule #3\r\n%  {SiteBuilder Network Membership} => {Internet Site Construction for Developers}\r\n%  Conf: 0.80, Lift: 8.17\r\n%\r\n%  Rule #5\r\n%  {Knowledge Base, isapi} => {Support Desktop}\r\n%  Conf: 0.71, Lift: 5.21\r\n%\r\n%  Rule #18\r\n%  {Windows Family of OSs, isapi} => {Windows95 Support}\r\n%  Conf: 0.64, Lift: 11.66\r\n%\r\n% The sample code also includes MapReduce steps on a reduced dataset. I\r\n% reduced the dataset because my MapReduce code is not optimal for this\r\n% dataset and runs extremely slowly. My code assumes that there are more\r\n% pages than visitors, and we have more visitors than pages in this\r\n% Microsoft dataset.\r\n% \r\n% If you would like to use MapReduce on the full Microsoft dataset, I\r\n% strongly recommend that you wrote your own MapReduce code that is more\r\n% optimized for this dataset. \r\n\r\n%% Summary and the Next Step\r\n% Now we prototyped the algorithm and tested it. I can see if I am going to\r\n% get the insight I am looking for with this process. Once I am happy with\r\n% it, I still need to figure out where to store the larger data on a\r\n% cluster or cloud. It's more of a business challenge than a technical one,\r\n% and I am still in the middle of that.\r\n%\r\n% Programming MapReduce in MATLAB is very straight forward and easy. The\r\n% source data was local in this example, but you can also use it for a\r\n% larger dataset that sits across multiple locations, such as HDFS files.\r\n% You can prototype your MapReduce algorithm locally, then change the\r\n% configuration to scale up to a larger dataset.\r\n%\r\n% In this example, MapReduce was used only for the initial data processing,\r\n% and the rest was still done in memory. As long as the processed result\r\n% can fit in the memory, we can use this approach.\r\n%\r\n% However, if the processed data gets larger, then we need to make more use\r\n% of MapReduce in other steps in the Apriori alogrithm. \r\n%\r\n% The key is to adapt the algorithm to parallel processing. In the current\r\n% incarnation, you have a single thread in the candidate pruning stage.\r\n% \r\n% Instead, we subdivide the dataset into several chunks and complete the\r\n% entire process through rule generation in each. We need to adjust the\r\n% minimum support to account for the reduction of transaction counts in\r\n% subsets if we do so. Then we can combine the final output. This may not\r\n% provide the same result as the single thread process, but it should be\r\n% fairly close. \r\n%\r\n%% Your thoughts?\r\n% Do you see ways in which you might leverage MapReduce to handle larger\r\n% data analysis projects?  Let us know\r\n% <https:\/\/blogs.mathworks.com\/loren\/?p=1105#respond here>.\r\n%\r\n%% Appendix - Process Microsoft Web Data\r\n% Here is the code I used to process the Microsoft dataset.\r\n\r\ntype processMicrosoftWebData.m\r\n\r\n\r\n##### SOURCE END ##### 0a3f82c66d894960ba549b219009b740\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2015\/MarketBasketAnalysis2LS2_01.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>In <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=1089\">an earlier post<\/a>, today's guest blogger Toshi Takeuchi gave us an introduction to Market Basket Analysis. This week, he will discuss how to scale this technique using <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/mapreduce.html\"><tt>MapReduce<\/tt><\/a> to deal with larger data.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/02\/25\/scaling-market-basket-analysis-with-mapreduce\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[63,61],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1105"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=1105"}],"version-history":[{"count":2,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1105\/revisions"}],"predecessor-version":[{"id":1107,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1105\/revisions\/1107"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=1105"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=1105"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=1105"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}