{"id":4874,"date":"2013-10-25T09:00:09","date_gmt":"2013-10-25T13:00:09","guid":{"rendered":"https:\/\/blogs.mathworks.com\/pick\/?p=4874"},"modified":"2013-10-25T09:57:03","modified_gmt":"2013-10-25T13:57:03","slug":"visualizing-the-frequency-distribution-of-2-dimensional-data","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/pick\/2013\/10\/25\/visualizing-the-frequency-distribution-of-2-dimensional-data\/","title":{"rendered":"Visualizing the frequency distribution of 2-Dimensional Data"},"content":{"rendered":"\r\n<div class=\"content\"><!--introduction--><!--\/introduction--><p><a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/911\">Brett<\/a>'s Pick this week is <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/23238-cloudplot\">\"cloudPlot\"<\/a>, by <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/25696\">Daniel Armyr<\/a>.<\/p><p>As data acquisition and storage capacities continue to expand, we are constantly being bombarded with increasingly large datasets. Analyzing--or even just visualizing--these data represents one of the most pressing challenges of our time.<\/p><p>Whenever I present \"Speeding Up MATLAB Applications,\" I always make a point of saying that there are \"better\" (i.e., efficient) ways to use MATLAB, and \"worse\" (inefficient) ways to use it. (This is true of all languages, of course, but the \"cost\" of doing things inefficiently in an interpreted language like MATLAB can be more pronounced than when compared with poorly implemented compiled code. I find that people who tell me that MATLAB is slow, often don't use it to its full potential.) I like to make the point, too, that writing better code entails keeping tabs on memory management in addition to performance. (This becomes ever more important with the aforementioned \"Big Data challenges.\") In fact, one of the topics of that \"Speeding Up\" presentation deals with efficient visualization of data--and with recognition that visualizations may contain full copies of your data.<\/p><p>Daniel's cloudPlot provides a very clever, and very well implemented, way of visualizing large 2-dimensional data. You can see this clearly in the following code section. First, we create some data. We'll create <tt>x<\/tt> and <tt>y<\/tt> as 1 million-by-one vectors of normally distributed random doubles; as created, each variable occupies 8 megabytes of memory:<\/p><pre class=\"language-matlab\">x = randn(1000000,1);\r\ny = randn(1000000,1);\r\n<\/pre><p>Now how would we best visualize <tt>x<\/tt> versus <tt>y<\/tt>? We could plot them, of course:<\/p><pre class=\"language-matlab\">h = plot( x, y, <span class=\"string\">'b.'<\/span> );\r\naxis <span class=\"string\">equal<\/span>\r\ntitle ( <span class=\"string\">'Plotting all data'<\/span> ,<span class=\"string\">'fontsize'<\/span>,12,<span class=\"string\">'fontweight'<\/span>,<span class=\"string\">'bold'<\/span>);\r\n<\/pre><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/pick\/files\/plotall.png\" alt=\"\"> <\/p><p>As I see it, there are two significant problems with this visualization. First, we have lost all the subtleties of the data. We have a big mass of points, from which we can tell very little about the distribution of our data. And secondly, that graphic contains <i>full copies<\/i> of <tt>x<\/tt> and <tt>y<\/tt>! We can see that readily when we get the properties of the plot:<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/pick\/files\/getH.png\" alt=\"\"> <\/p><p>Plotting a small subset--say, 2 percent-- of the data is a good start; we lose very little information, but the plot contains only 300 kilobytes of data instead of 16 megabytes:<\/p><pre class=\"language-matlab\">pct = 2;\r\nstepsize = 100\/pct;\r\nh = plot(x(1:stepsize:end),y(1:stepsize:end),<span class=\"string\">'r.'<\/span>)\r\ntitle ( <span class=\"string\">'Plotting 2% of  data'<\/span> ,<span class=\"string\">'fontsize'<\/span>,12,<span class=\"string\">'fontweight'<\/span>,<span class=\"string\">'bold'<\/span>);\r\n<\/pre><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/pick\/files\/plotsome.png\" alt=\"\"> <\/p><p>However, we still have the significant problem that we can't really tell what's going on with our data.<\/p><p>Enter Daniel's <tt>cloudPlot<\/tt>. After cleverly binning the 2-dimensional data, Daniel's function creates a great visualization that yields a lot more information than does the plots we created above:<\/p><pre class=\"language-matlab\">subplot ( 2, 2, 1 );\r\ncolormap(jet);\r\ncloudPlot( X, Y ,[-5 5 -5 5]);\r\ntitle ( <span class=\"string\">'Bins exactly one pixel large'<\/span> );\r\nsubplot ( 2, 2, 2 );\r\ncloudPlot( X, Y, [-5 5 -5 5], [], [100 100] );\r\ntitle ( <span class=\"string\">'Bins larger than one pixel'<\/span> );\r\nsubplot ( 2, 2, 3 );\r\ncloudPlot( X, Y, [-5 5 -5 5], [], [1000 1000] );\r\ntitle ( <span class=\"string\">'Bins smaller than one pixel'<\/span> );\r\n<\/pre><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/pick\/files\/cloudPlots.png\" alt=\"\"> <\/p><p>Now we have a great deal more insight into the distribution of those data. <i>And<\/i>, the image of the data in the upper left--arguably the most illustrative of the visualizations-- occupies only about a half of a megabyte. (The upper right image is smaller, the lower left, larger.)<\/p><p>Very useful indeed!<\/p><p>As always, I welcome your <a href=\"https:\/\/blogs.mathworks.com\/pick\/?p=4874#respond\">thoughts and comments<\/a>. Or leave feedback for Daniel <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/23238-cloudplot#comments\">here<\/a>.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_01d09b08cedb4eea8ac6189e33a4cf49() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='01d09b08cedb4eea8ac6189e33a4cf49 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 01d09b08cedb4eea8ac6189e33a4cf49';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2013 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_01d09b08cedb4eea8ac6189e33a4cf49()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2013b<br><\/p><p class=\"footer\"><br>\r\n      Published with MATLAB&reg; R2013b<br><\/p><\/div><!--\r\n01d09b08cedb4eea8ac6189e33a4cf49 ##### SOURCE BEGIN #####\r\n%% Visualizing the frequency distribution of 2-Dimensional Data\r\n%% \r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/911 Brett>'s Pick this week is\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/23238-cloudplot \"cloudPlot\">, by\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/25696 Daniel Armyr>.\r\n\r\n%% \r\n% As data acquisition and storage capacities continue to expand, we are\r\n% constantly being bombarded with increasingly large datasets. AnalyzingREPLACE_WITH_DASH_DASHor\r\n% even just visualizingREPLACE_WITH_DASH_DASHthese data represents one of the most pressing challenges\r\n% of our time. \r\n\r\n%%\r\n% Whenever I present \"Speeding Up MATLAB Applications,\" I always make a point of saying\r\n% that there are \"better\" (i.e., efficient) ways to use MATLAB, and \"worse\"\r\n% (inefficient) ways to use it. (This is true of all languages, of course,\r\n% but the \"cost\" of doing things inefficiently in an interpreted language\r\n% like MATLAB can be more pronounced than when compared with poorly\r\n% implemented compiled code. I find that people who tell me that MATLAB is\r\n% slow, often don't use it to its full potential.) I like to make the\r\n% point, too, that writing better code entails keeping tabs on memory\r\n% management in addition to performance. (This becomes ever more important\r\n% with the aforementioned \"Big Data challenges.\") In fact, one of the\r\n% topics of that \"Speeding Up\" presentation deals with efficient\r\n% visualization of dataREPLACE_WITH_DASH_DASHand with recognition that visualizations may\r\n% contain full copies of your data.\r\n\r\n%% \r\n% Daniel's cloudPlot provides a very clever, and very well implemented, way\r\n% of visualizing large 2-dimensional data. You can see this clearly in the\r\n% following code section. First, we create some data. We'll create |x| and\r\n% |y| as 1 million-by-one vectors of normally distributed random doubles;\r\n% as created, each variable occupies 8 megabytes of memory:\r\n\r\n%%\r\n%   x = randn(1000000,1);\r\n%   y = randn(1000000,1);\r\n\r\n%% \r\n% Now how would we best visualize |x| versus |y|? We could plot them, of\r\n% course:\r\n\r\n%%\r\n%   h = plot( x, y, 'b.' );\r\n%   axis equal\r\n%   title ( 'Plotting all data' ,'fontsize',12,'fontweight','bold');\r\n\r\n%%\r\n% \r\n% <<https:\/\/blogs.mathworks.com\/pick\/files\/plotall.png>>\r\n% \r\n\r\n%%\r\n% As I see it, there are two significant problems with this visualization.\r\n% First, we have lost all the subtleties of the data. We have a big mass of\r\n% points, from which we can tell very little about the distribution of our\r\n% data. And secondly, that graphic contains _full copies_ of |x| and |y|!\r\n% We can see that readily when we get the properties of the plot:\r\n% \r\n%%\r\n% \r\n% <<https:\/\/blogs.mathworks.com\/pick\/files\/getH.png>>\r\n% \r\n\r\n%% \r\n% Plotting a small subsetREPLACE_WITH_DASH_DASHsay, 2 percentREPLACE_WITH_DASH_DASH of the data is a\r\n% good start; we lose very little information, but the plot contains only\r\n% 300 kilobytes of data instead of 16 megabytes:\r\n\r\n%%\r\n%   pct = 2;\r\n%   stepsize = 100\/pct;\r\n%   h = plot(x(1:stepsize:end),y(1:stepsize:end),'r.')\r\n%   title ( 'Plotting 2% of  data' ,'fontsize',12,'fontweight','bold');\r\n\r\n%%\r\n% \r\n% <<https:\/\/blogs.mathworks.com\/pick\/files\/plotsome.png>>\r\n% \r\n\r\n%%\r\n% However, we still have the significant problem that we can't really tell\r\n% what's going on with our data.\r\n\r\n%%\r\n% Enter Daniel's |cloudPlot|. After cleverly binning the 2-dimensional\r\n% data, Daniel's function creates a great visualization that yields a lot\r\n% more information than does the plots we created above:\r\n\r\n%%\r\n%   subplot ( 2, 2, 1 );\r\n%   colormap(jet);\r\n%   cloudPlot( X, Y ,[-5 5 -5 5]);\r\n%   title ( 'Bins exactly one pixel large' );\r\n%   subplot ( 2, 2, 2 );\r\n%   cloudPlot( X, Y, [-5 5 -5 5], [], [100 100] );\r\n%   title ( 'Bins larger than one pixel' );\r\n%   subplot ( 2, 2, 3 );\r\n%   cloudPlot( X, Y, [-5 5 -5 5], [], [1000 1000] );\r\n%   title ( 'Bins smaller than one pixel' );\r\n\r\n%%\r\n% \r\n% <<https:\/\/blogs.mathworks.com\/pick\/files\/cloudPlots.png>>\r\n% \r\n\r\n%%\r\n% Now we have a great deal more insight into the distribution of those data. _And_,\r\n% the image of the data in the upper leftREPLACE_WITH_DASH_DASHarguably the most illustrative of the visualizationsREPLACE_WITH_DASH_DASH\r\n% occupies only about a half of a megabyte. (The upper right image is\r\n% smaller, the lower left, larger.)\r\n\r\n%%\r\n% Very useful indeed!\r\n\r\n%%\r\n% As always, I welcome your\r\n% <https:\/\/blogs.mathworks.com\/pick\/?p=4874#respond thoughts and comments>.\r\n% Or leave feedback for Daniel\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/23238-cloudplot#comments here>.\r\n##### SOURCE END ##### 01d09b08cedb4eea8ac6189e33a4cf49\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/pick\/files\/plotall.png\" onError=\"this.style.display ='none';\" \/><\/div><p>\r\nBrett's Pick this week is \"cloudPlot\", by Daniel Armyr.As data acquisition and storage capacities continue to expand, we are constantly being bombarded with increasingly large datasets.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/pick\/2013\/10\/25\/visualizing-the-frequency-distribution-of-2-dimensional-data\/\">read more >><\/a><\/p>","protected":false},"author":34,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[16],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/4874"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/users\/34"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/comments?post=4874"}],"version-history":[{"count":9,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/4874\/revisions"}],"predecessor-version":[{"id":4889,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/4874\/revisions\/4889"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/media?parent=4874"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/categories?post=4874"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/tags?post=4874"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}