Stuart’s MATLAB Videos

Watch and Learn

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the Original version of the page.

Using the MapReduce Technique to Process 500GB of Server Logs 4

Posted by Stuart McGarrity,

Here I’m using the MapReduce functionality in Parallel Processing Toolbox to process several hundred GBs of server logs from our web site. I want to be able to visualize the counts per minute of certain quantities and also filter the data to look for certain special requests to our website. I start small, getting my algorithm to work with one file first and without parallel processing. But MapReduce lets you write it in a way that it will work on any size and with parallel processing.

It eventually took 50 min to process one day’s worth of data (72GB) and about 14hrs to do 8 days (562GB). I think I’ll profile the small dataset problem to see where its spending the time, but suspect it is all file I/O.

>> sum(minuteResults.totalRequests)
ans =
>> bar(minuteResults.timeMinute ,minuteResults.totalRequests)

Features covered in this video include:

Follow me (@stuartmcgarrity) if you want to be notified via Twitter when I post.

Play the video in full screen mode for a better viewing experience. 

4 CommentsOldest to Newest

Francisco replied on : 3 of 4
Very useful, thanks. I guess the same analysis could be done with Tall arrays + findgroups() + splitapply() + gather() If so, what do you think would be the advantages/disadvantages of this alternative workflow in terms of performance, code simplicity, etc? Thanks, Francisco
Stuart McGarrity replied on : 4 of 4
Hi Francisco, In my opinion, its a simplicity vs control, or a high-level vs low-level trade-off. First of all, not all operations are supported with tall arrays. But if you can express what you need to do as tall arrays, it results in a simpler implementation. With Map Reduce you have to do more work to code it but it gives you more control, useful for more complex problems. I did originally try this with tall arrays but found I needed to handle a lot or missing data and edge cases that didn't fit into the TA model. I needed the control of MR.