Using the MapReduce Technique to Process 500GB of Server Logs

Posted by Stuart McGarrity, March 16, 2018

53 views (last 30 days) | 0 Likes | 5 comments

Here I’m using the MapReduce functionality in Parallel Processing Toolbox to process several hundred GBs of server logs from our web site. I want to be able to visualize the counts per minute of certain quantities and also filter the data to look for certain special requests to our website. I start small, getting my algorithm to work with one file first and without parallel processing. But MapReduce lets you write it in a way that it will work on any size and with parallel processing.

It eventually took 50 min to process one day’s worth of data (72GB) and about 14hrs to do 8 days (562GB). I think I’ll profile the small dataset problem to see where its spending the time, but suspect it is all file I/O.

>> sum(minuteResults.totalRequests)
ans =
   1.3388e+09
>> bar(minuteResults.timeMinute ,minuteResults.totalRequests)

Features covered in this video include:

mapreduce
varfun

Play the video in full screen mode for a better viewing experience.

Category:: Format: Video

Comments

To leave a comment, please click here to sign in to your MathWorks Account or create a new one.

Stuart’s MATLAB Videos
Watch and Learn

Watch and Learn

Using the MapReduce Technique to Process 500GB of Server Logs

Comments

Stuart’s MATLAB VideosWatch and Learn

Watch and Learn

See Also

Comments

Stuart’s MATLAB Videos
Watch and Learn