In 1857, Scottish physicist James David Forbes published a paper that discussed the relationship between atmospheric pressure and the boiling point of water. To get data, he traveled around the alps measuring the boiling point of water and atmospheric pressure at various altitudes. The idea behind this was that climbers could simply measure the boiling point of water to estimate atmospheric pressure and hence, altitude, instead of trying to use fragile and expensive barometers.
The data he collected are included in the open source MATLAB toolbox FSDA (Flexible Statistics Data Analysis). If you want to run the code examples in this blog post, you'll need to install it from the File Exchange or from MATLAB Add-ons. Let's take a look at it along with the traditional least squares linear regression fit provided by fitlm from MathWorks' Statistics toolbox. load('forbes.txt'); % Requires FSDA toolbox to be installed
X = (X - 32) * 5/9; % Convert to Celsius
tradFit = fitlm(X,y); % Use Statistics toolbox to fit the data
plot(X,tradFit.predict(X)); % plot the fitted line
xlabel('Boiling point', 'Fontsize',16);
ylabel('100 x log(pressure)','Fontsize',16);
title("Forbes data with Linear Regression Fit")
legend({"Points","Least squares fit"},Location="Best");
Now, I want you to imagine that Forbes took a student with him whom he allowed to take one additional measurement. Imagine that this student was very careless when he took his measurement. Forbes, however, trusted him completely so we'll add this measurement to the data and see how it affects things.
%% Contaminated data (just 1 outlier)
hold on % Show the outlier in red
plot(Xc(end),yc(end),'o','MarkerFaceColor','auto','Color','r')
contaminatedTradFit = fitlm(Xc,yc);
plot(Xc,contaminatedTradFit.predict(Xc)); % plot the fitted line
xlabel('Boiling point', 'Fontsize',16);
ylabel('100 x log(pressure)','Fontsize',16);
legend({"Forbes' points","Outlier","Least squares fit"},Location="Best");
title("Contaminated Forbes data with Linear Regression Fit")
That one outlier has changed everything. Standard linear regression is not robust to this type of contamination.
A robust fit using FSDA toolbox
Enter the LXS function from FSDA toolbox which performs a more robust fit. By default, it uses an algorithm called Least Median of Squares regression which was first described in a 1984 paper by Peter Rousseeuw, a paper which has been cited over 5,000 times at the time of writing. [outLXS]=LXS(yc,Xc);
Total estimated time to complete LMS: 0.20 seconds
Let's get the coefficients of the robust fit and compare it to the traditional least squares fit
b = outLXS.beta; % Fit coefficients
outliers = outLXS.outliers;
plot(Xc,yc,'o'); % Plot data
hold on % Show the outlier in red
plot(Xc(end),yc(end),'o','MarkerFaceColor','auto','Color','r')
plot(Xc,contaminatedTradFit.predict(Xc)); % Plot traditional least squares fit
plot(Xc,b(1)+b(2)*Xc, 'r','LineWidth' ,1); % Plot robust fit
legend({"Forbes' points","Outlier","Least squares fit","Robust Fit"},Location="Best");
Not only is the fit returned by LXS robust to our deliberate contamination, but it identifies the contamination as an outlier. It also identifies one of the points in the original data as a potential outlier. We can get the indices of the outliers from the structure returned from LXS
Here's the plot once again, this time with the outliers clearly marked
plot(Xc,yc,'o'); %Plot data
plot(Xc,contaminatedTradFit.predict(Xc)); % Plot traditional least squares fit
plot(Xc,b(1)+b(2)*Xc, 'r','LineWidth' ,1); % Plot robust fit
plot(Xc(outliers),yc(outliers),'.', MarkerSize=20)
legend({"Points","Least squares fit","Robust Fit","Outliers"},Location="Best");
This is an example of robust linear regression which is just one of the areas of statistics and data analysis covered by the FSDA toolbox. For a more in-depth discussion and introduction to a suite of algorithms, refer to the documentation at Introduction to robust estimators in linear regression. Further analysis of this dataset is contained in Atkinson and Riani (2000). A comparison of different forms of robust regression estimators is given in the forthcoming book Atkinson et al. (2024). About the FSDA toolbox
References
Atkinson A.C. and Riani M. (2000). Robust Diagnostic Regression Analysis, Springer Verlag, New York.
Atkinson,A.C., Riani,M., Corbellini,A., Perrotta D., and Todorov,V. (2024), Applied Robust Statistics through the Monitoring Approach, Heidelberg: Springer Nature. https://github.com/UniprJRC/FigMonitoringBook Rousseeuw P.J. (1984). Least Median of Squares Regression. Journal of the American Statistical Association, 79:388, 871-880
Forbes, J. (1857). Further experiments and remarks on the measurement of heights and boiling point of water. Transactions of the Royal Society of Edinburgh, 21, 235-243.
评论
要发表评论,请点击 此处 登录到您的 MathWorks 帐户或创建一个新帐户。