Here I’m using the MapReduce functionality in Parallel Processing Toolbox to process several hundred GBs of server logs from our web site. I want to be able to visualize the counts per minute of certain quantities and also filter the data to look for certain special requests to our website. I start small, getting my algorithm to work with one file first and without parallel processing. But MapReduce lets you write it in a way that it will work on any size and with parallel processing.
It eventually took 50 min to process one day’s worth of data (72GB) and about 14hrs to do 8 days (562GB). I think I’ll profile the small dataset problem to see where its spending the time, but suspect it is all file I/O.
>> sum(minuteResults.totalRequests) ans = 1.3388e+09 >> bar(minuteResults.timeMinute ,minuteResults.totalRequests)
Follow me (@stuartmcgarrity) if you want to be notified via Twitter when I post.
Play the video in full screen mode for a better viewing experience.