Stuart’s MATLAB Videos

Watch and Learn

Using the MapReduce Technique to Process 500GB of Server Logs

Here I’m using the MapReduce functionality in Parallel Processing Toolbox to process several hundred GBs of server logs from our web site. I want to be able to visualize the counts per minute of certain quantities and also filter the data to look for certain special requests to our website. I start small, getting my algorithm to work with one file first and without parallel processing. But MapReduce lets you write it in a way that it will work on any size and with parallel processing.

It eventually took 50 min to process one day’s worth of data (72GB) and about 14hrs to do 8 days (562GB). I think I’ll profile the small dataset problem to see where its spending the time, but suspect it is all file I/O.

>> sum(minuteResults.totalRequests)
ans =
   1.3388e+09
>> bar(minuteResults.timeMinute ,minuteResults.totalRequests)


Features covered in this video include:

Follow me (@stuartmcgarrity) if you want to be notified via Twitter when I post.


Play the video in full screen mode for a better viewing experience. 

|

Comments

To leave a comment, please click here to sign in to your MathWorks Account or create a new one.