Outlier detection and Robust Regression in MATLAB with the FSDA Toolbox
In 1857, Scottish physicist James David Forbes published a paper that discussed the relationship between atmospheric pressure and the boiling point of water. To get data, he traveled around the alps measuring the boiling point of water and atmospheric pressure at various altitudes. The idea behind this was that climbers could simply measure the boiling point of water to estimate atmospheric pressure and hence, altitude, instead of trying to use fragile and expensive barometers.
The data he collected are included in the open source MATLAB toolbox FSDA (Flexible Statistics Data Analysis). If you want to run the code examples in this blog post, you'll need to install it from the File Exchange or from MATLAB Add-ons.
Let's take a look at it along with the traditional least squares linear regression fit provided by fitlm from MathWorks' Statistics toolbox.
load('forbes.txt'); % Requires FSDA toolbox to be installed
y=forbes(:,2);
X=forbes(:,1);
X = (X - 32) * 5/9; % Convert to Celsius
plot(X,y,'o');
tradFit = fitlm(X,y); % Use Statistics toolbox to fit the data
hold on
plot(X,tradFit.predict(X)); % plot the fitted line
xlabel('Boiling point', 'Fontsize',16);
ylabel('100 x log(pressure)','Fontsize',16);
title("Forbes data with Linear Regression Fit")
legend({"Points","Least squares fit"},Location="Best");
hold off
Now, I want you to imagine that Forbes took a student with him whom he allowed to take one additional measurement. Imagine that this student was very careless when he took his measurement. Forbes, however, trusted him completely so we'll add this measurement to the data and see how it affects things.
%% Contaminated data (just 1 outlier)
yc = y;
yc(end+1) = 140;
Xc = X;
Xc(end+1) = 110;
plot(Xc,yc,'o');
hold on % Show the outlier in red
plot(Xc(end),yc(end),'o','MarkerFaceColor','auto','Color','r')
contaminatedTradFit = fitlm(Xc,yc);
plot(Xc,contaminatedTradFit.predict(Xc)); % plot the fitted line
xlabel('Boiling point', 'Fontsize',16);
ylabel('100 x log(pressure)','Fontsize',16);
legend({"Forbes' points","Outlier","Least squares fit"},Location="Best");
title("Contaminated Forbes data with Linear Regression Fit")
hold off
That one outlier has changed everything. Standard linear regression is not robust to this type of contamination.
A robust fit using FSDA toolbox
Enter the LXS function from FSDA toolbox which performs a more robust fit. By default, it uses an algorithm called Least Median of Squares regression which was first described in a 1984 paper by Peter Rousseeuw, a paper which has been cited over 5,000 times at the time of writing.
[outLXS]=LXS(yc,Xc);
Let's get the coefficients of the robust fit and compare it to the traditional least squares fit
b = outLXS.beta; % Fit coefficients
outliers = outLXS.outliers;
plot(Xc,yc,'o'); % Plot data
hold on % Show the outlier in red
plot(Xc(end),yc(end),'o','MarkerFaceColor','auto','Color','r')
plot(Xc,contaminatedTradFit.predict(Xc)); % Plot traditional least squares fit
plot(Xc,b(1)+b(2)*Xc, 'r','LineWidth' ,1); % Plot robust fit
legend({"Forbes' points","Outlier","Least squares fit","Robust Fit"},Location="Best");
hold off
Not only is the fit returned by LXS robust to our deliberate contamination, but it identifies the contamination as an outlier. It also identifies one of the points in the original data as a potential outlier. We can get the indices of the outliers from the structure returned from LXS
outLXS.outliers
Here's the plot once again, this time with the outliers clearly marked
plot(Xc,yc,'o'); %Plot data
hold on
plot(Xc,contaminatedTradFit.predict(Xc)); % Plot traditional least squares fit
plot(Xc,b(1)+b(2)*Xc, 'r','LineWidth' ,1); % Plot robust fit
plot(Xc(outliers),yc(outliers),'.', MarkerSize=20)
legend({"Points","Least squares fit","Robust Fit","Outliers"},Location="Best");
hold off
This is an example of robust linear regression which is just one of the areas of statistics and data analysis covered by the FSDA toolbox. For a more in-depth discussion and introduction to a suite of algorithms, refer to the documentation at Introduction to robust estimators in linear regression. Further analysis of this dataset is contained in Atkinson and Riani (2000). A comparison of different forms of robust regression estimators is given in the forthcoming book Atkinson et al. (2024).
About the FSDA toolbox
Developed at Università di Parma and the Joint Research Centre of the European Commission, FSDA toolbox is one of the most popular toolboxes in the MathWorks File Exchange and contains over 300 functions covering areas such as Robust Regression Analysis, Robust Multivariate Analysis and Robust Cluster Analysis.
References
Atkinson A.C. and Riani M. (2000). Robust Diagnostic Regression Analysis, Springer Verlag, New York.
Atkinson,A.C., Riani,M., Corbellini,A., Perrotta D., and Todorov,V. (2024), Applied Robust Statistics through the Monitoring Approach, Heidelberg: Springer Nature. https://github.com/UniprJRC/FigMonitoringBook
Rousseeuw P.J. (1984). Least Median of Squares Regression. Journal of the American Statistical Association, 79:388, 871-880
Forbes, J. (1857). Further experiments and remarks on the measurement of heights and boiling point of water. Transactions of the Royal Society of Edinburgh, 21, 235-243.
- Category:
- Data Science,
- Open Source,
- Statistics
Comments
To leave a comment, please click here to sign in to your MathWorks Account or create a new one.