Using the MapReduce Technique to Process 500GB of Server Logs

Posted by Stuart McGarrity, March 16, 2018

3 views (last 30 days) | 0 Likes | 5 comments

Here I’m using the MapReduce functionality in Parallel Processing Toolbox to process several hundred GBs of server logs from our web site. I want to be able to visualize the counts per minute of certain quantities and also filter the data to look for certain special requests to our website. I start small, getting my algorithm to work with one file first and without parallel processing. But MapReduce lets you write it in a way that it will work on any size and with parallel processing.

It eventually took 50 min to process one day’s worth of data (72GB) and about 14hrs to do 8 days (562GB). I think I’ll profile the small dataset problem to see where its spending the time, but suspect it is all file I/O.

>> sum(minuteResults.totalRequests)
ans =
   1.3388e+09
>> bar(minuteResults.timeMinute ,minuteResults.totalRequests)

Features covered in this video include:

mapreduce
varfun

Play the video in full screen mode for a better viewing experience.

Category:: Format: Video

Comments

To leave a comment, please click here to sign in to your MathWorks Account or create a new one.

Stuart’s MATLAB Videos
Watch and Learn

Watch and Learn

Using the MapReduce Technique to Process 500GB of Server Logs

Comments

See Also

Comments

Select a Web Site

Americas

Europe

Asia Pacific