Stuart’s MATLAB Videos

Watch and Learn

Using the MapReduce Technique to Process 500GB of Server Logs 4

Posted by Stuart McGarrity,

Here I’m using the MapReduce functionality in Parallel Processing Toolbox to process several hundred GBs of server logs from our web site. I want to be able to visualize the counts per minute of certain quantities and also filter the data to look for certain special requests to our website. I start small, getting my algorithm to work with one file first and without parallel processing. But MapReduce lets you write it in a way that it will work on any size and with parallel processing.

It eventually took 50 min to process one day’s worth of data (72GB) and about 14hrs to do 8 days (562GB). I think I’ll profile the small dataset problem to see where its spending the time, but suspect it is all file I/O.

>> sum(minuteResults.totalRequests)
ans =
   1.3388e+09
>> bar(minuteResults.timeMinute ,minuteResults.totalRequests)


Features covered in this video include:

Follow me (@stuartmcgarrity) if you want to be notified via Twitter when I post.


Play the video in full screen mode for a better viewing experience. 

3 views (last 30 days)  | |

Comments

To leave a comment, please click here to sign in to your MathWorks Account or create a new one.