Using the MapReduce Technique to Process 500GB of Server Logs

作者 Stuart McGarrity, March 16, 2018

3 次查看（过去 30 天） | 0 个赞 | 5 个评论

Here I’m using the MapReduce functionality in Parallel Processing Toolbox to process several hundred GBs of server logs from our web site. I want to be able to visualize the counts per minute of certain quantities and also filter the data to look for certain special requests to our website. I start small, getting my algorithm to work with one file first and without parallel processing. But MapReduce lets you write it in a way that it will work on any size and with parallel processing.

It eventually took 50 min to process one day’s worth of data (72GB) and about 14hrs to do 8 days (562GB). I think I’ll profile the small dataset problem to see where its spending the time, but suspect it is all file I/O.

>> sum(minuteResults.totalRequests)
ans =
   1.3388e+09
>> bar(minuteResults.timeMinute ,minuteResults.totalRequests)

Features covered in this video include:

mapreduce
varfun

Play the video in full screen mode for a better viewing experience.

类别:: Format: Video

要发表评论，请点击此处登录到您的 MathWorks 帐户或创建一个新帐户。

Stuart’s MATLAB Videos
Watch and Learn

Watch and Learn

Using the MapReduce Technique to Process 500GB of Server Logs

评论

See Also

评论

Select a Web Site

Americas

Europe

Asia Pacific