Stuart’s MATLAB Videos

Watch and Learn

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Using the MapReduce Technique to Process 500GB of Server Logs 2

Posted by Stuart McGarrity,

Here I’m using the MapReduce functionality in Parallel Processing Toolbox to process several hundred GBs of server logs from our web site. I want to be able to visualize the counts per minute of certain quantities and also filter the data to look for certain special requests to our website. I start small, getting my algorithm to work with one file first and without parallel processing. But MapReduce lets you write it in a way that it will work on any size and with parallel processing.

It eventually took 50 min to process one day’s worth of data (72GB) and about 14hrs to do 8 days (562GB). I think I’ll profile the small dataset problem to see where its spending the time, but suspect it is all file I/O.

>> sum(minuteResults.totalRequests)
ans =
>> bar(minuteResults.timeMinute ,minuteResults.totalRequests)

Features covered in this video include:

Follow me (@stuartmcgarrity) if you want to be notified via Twitter when I post.

Play the video in full screen mode for a better viewing experience. 

2 CommentsOldest to Newest

Add A Comment

Your email address will not be published. Required fields are marked *

Preview: hide