# File Exchange Pick of the Week

## Visualizing the frequency distribution of 2-Dimensional Data

Brett's Pick this week is "cloudPlot", by Daniel Armyr.

As data acquisition and storage capacities continue to expand, we are constantly being bombarded with increasingly large datasets. Analyzing--or even just visualizing--these data represents one of the most pressing challenges of our time.

Whenever I present "Speeding Up MATLAB Applications," I always make a point of saying that there are "better" (i.e., efficient) ways to use MATLAB, and "worse" (inefficient) ways to use it. (This is true of all languages, of course, but the "cost" of doing things inefficiently in an interpreted language like MATLAB can be more pronounced than when compared with poorly implemented compiled code. I find that people who tell me that MATLAB is slow, often don't use it to its full potential.) I like to make the point, too, that writing better code entails keeping tabs on memory management in addition to performance. (This becomes ever more important with the aforementioned "Big Data challenges.") In fact, one of the topics of that "Speeding Up" presentation deals with efficient visualization of data--and with recognition that visualizations may contain full copies of your data.

Daniel's cloudPlot provides a very clever, and very well implemented, way of visualizing large 2-dimensional data. You can see this clearly in the following code section. First, we create some data. We'll create x and y as 1 million-by-one vectors of normally distributed random doubles; as created, each variable occupies 8 megabytes of memory:

x = randn(1000000,1);
y = randn(1000000,1);


Now how would we best visualize x versus y? We could plot them, of course:

h = plot( x, y, 'b.' );
axis equal
title ( 'Plotting all data' ,'fontsize',12,'fontweight','bold');


As I see it, there are two significant problems with this visualization. First, we have lost all the subtleties of the data. We have a big mass of points, from which we can tell very little about the distribution of our data. And secondly, that graphic contains full copies of x and y! We can see that readily when we get the properties of the plot:

Plotting a small subset--say, 2 percent-- of the data is a good start; we lose very little information, but the plot contains only 300 kilobytes of data instead of 16 megabytes:

pct = 2;
stepsize = 100/pct;
h = plot(x(1:stepsize:end),y(1:stepsize:end),'r.')
title ( 'Plotting 2% of  data' ,'fontsize',12,'fontweight','bold');


However, we still have the significant problem that we can't really tell what's going on with our data.

Enter Daniel's cloudPlot. After cleverly binning the 2-dimensional data, Daniel's function creates a great visualization that yields a lot more information than does the plots we created above:

subplot ( 2, 2, 1 );
colormap(jet);
cloudPlot( X, Y ,[-5 5 -5 5]);
title ( 'Bins exactly one pixel large' );
subplot ( 2, 2, 2 );
cloudPlot( X, Y, [-5 5 -5 5], [], [100 100] );
title ( 'Bins larger than one pixel' );
subplot ( 2, 2, 3 );
cloudPlot( X, Y, [-5 5 -5 5], [], [1000 1000] );
title ( 'Bins smaller than one pixel' );


Now we have a great deal more insight into the distribution of those data. And, the image of the data in the upper left--arguably the most illustrative of the visualizations-- occupies only about a half of a megabyte. (The upper right image is smaller, the lower left, larger.)

Very useful indeed!

As always, I welcome your thoughts and comments. Or leave feedback for Daniel here.

Get the MATLAB code

Published with MATLAB® R2013b

### 4 Responses to “Visualizing the frequency distribution of 2-Dimensional Data”

1. Andy replied on :

I think a hexagonal scatter plot would’ve been very appropriate for this data.

2. Eric replied on :

I’ve found cloudPlot very handy and use it quite a bit. I think that something like it could be a nice addition to MATLAB :) The only thing I’ve found is that the plot becomes inaccurate for bins < 1 pixel, however this is expected behavior for the function – I think it gives the user a warning in this situation?

Since the general topic is about visualizing large data, I’d also add that there’s another very useful entry that was covered a few months ago in a previous blog topic. Link below for anyone who hasn’t seen it before.

http://blogs.mathworks.com/pick/2013/06/07/plot-real-big/

4. bshoelso replied on :

@Andy:
I presume you’re talking about Salman Mashayekh’s File Exchange submission called hexagonalScatterplot? I guess that you may be right, but that function doesn’t have a lot of details about how to use it. (Otherwise, it might make a nice Pick of the Week, too!)

@Eric:
Yes, cloudPlot will when subpixel bins are requested.

Thanks for the reminder, Adam. There are so many great files out there, it’s easy to forget.

 Name (required) E-mail (required, will not be published) Website (optional) Spam protection (required): What is 9 + 10 ?

Wrap code fragments inside <pre> tags, like this:

<pre class="code">
a = magic(3);
sum(a)
</pre>


If you have a "<" character in your code, either follow it with a space or replace it with "&lt;" (including the semicolon).

Brett & Jiro share their favorite user-contributed submissions from the File Exchange.

These postings are the author's and don't necessarily represent the opinions of MathWorks.