In 1857, Scottish physicist James David Forbes published a paper that discussed the relationship between atmospheric pressure and the boiling point of water. To get data, he traveled around the alps measuring the boiling point of water and atmospheric pressure at various altitudes. The idea behind this was that climbers could simply measure the boiling point of water to estimate atmospheric pressure and hence, altitude, instead of trying to use fragile and expensive barometers.

The data he collected are included in the open source MATLAB toolbox FSDA (Flexible Statistics Data Analysis). If you want to run the code examples in this blog post, you'll need to install it from the File Exchange or from MATLAB Add-ons.

Let's take a look at it along with the traditional least squares linear regression fit provided by fitlm from MathWorks' Statistics toolbox.

load('forbes.txt'); % Requires FSDA toolbox to be installed

y=forbes(:,2);

X=forbes(:,1);

X = (X - 32) * 5/9; % Convert to Celsius

plot(X,y,'o');

tradFit = fitlm(X,y); % Use Statistics toolbox to fit the data

hold on

plot(X,tradFit.predict(X)); % plot the fitted line

xlabel('Boiling point', 'Fontsize',16);

ylabel('100 x log(pressure)','Fontsize',16);

title("Forbes data with Linear Regression Fit")

legend({"Points","Least squares fit"},Location="Best");

hold off

Now, I want you to imagine that Forbes took a student with him whom he allowed to take one additional measurement. Imagine that this student was very careless when he took his measurement. Forbes, however, trusted him completely so we'll add this measurement to the data and see how it affects things.

%% Contaminated data (just 1 outlier)

yc = y;

yc(end+1) = 140;

Xc = X;

Xc(end+1) = 110;

plot(Xc,yc,'o');

hold on % Show the outlier in red

plot(Xc(end),yc(end),'o','MarkerFaceColor','auto','Color','r')

contaminatedTradFit = fitlm(Xc,yc);

plot(Xc,contaminatedTradFit.predict(Xc)); % plot the fitted line

xlabel('Boiling point', 'Fontsize',16);

ylabel('100 x log(pressure)','Fontsize',16);

legend({"Forbes' points","Outlier","Least squares fit"},Location="Best");

title("Contaminated Forbes data with Linear Regression Fit")

hold off

That one outlier has changed everything. Standard linear regression is not robust to this type of contamination.

Enter the LXS function from FSDA toolbox which performs a more robust fit. By default, it uses an algorithm called Least Median of Squares regression which was first described in a 1984 paper by Peter Rousseeuw, a paper which has been cited over 5,000 times at the time of writing.

[outLXS]=LXS(yc,Xc);

Let's get the coefficients of the robust fit and compare it to the traditional least squares fit

b = outLXS.beta; % Fit coefficients

outliers = outLXS.outliers;

plot(Xc,yc,'o'); % Plot data

hold on % Show the outlier in red

plot(Xc(end),yc(end),'o','MarkerFaceColor','auto','Color','r')

plot(Xc,contaminatedTradFit.predict(Xc)); % Plot traditional least squares fit

plot(Xc,b(1)+b(2)*Xc, 'r','LineWidth' ,1); % Plot robust fit

legend({"Forbes' points","Outlier","Least squares fit","Robust Fit"},Location="Best");

hold off

Not only is the fit returned by LXS robust to our deliberate contamination, but it identifies the contamination as an outlier. It also identifies one of the points in the original data as a potential outlier. We can get the indices of the outliers from the structure returned from LXS

outLXS.outliers

Here's the plot once again, this time with the outliers clearly marked

plot(Xc,yc,'o'); %Plot data

hold on

plot(Xc,contaminatedTradFit.predict(Xc)); % Plot traditional least squares fit

plot(Xc,b(1)+b(2)*Xc, 'r','LineWidth' ,1); % Plot robust fit

plot(Xc(outliers),yc(outliers),'.', MarkerSize=20)

legend({"Points","Least squares fit","Robust Fit","Outliers"},Location="Best");

hold off

This is an example of robust linear regression which is just one of the areas of statistics and data analysis covered by the FSDA toolbox. For a more in-depth discussion and introduction to a suite of algorithms, refer to the documentation at Introduction to robust estimators in linear regression. Further analysis of this dataset is contained in Atkinson and Riani (2000). A comparison of different forms of robust regression estimators is given in the forthcoming book Atkinson et al. (2024).

Developed at Università di Parma and the Joint Research Centre of the European Commission, FSDA toolbox is one of the most popular toolboxes in the MathWorks File Exchange and contains over 300 functions covering areas such as Robust Regression Analysis, Robust Multivariate Analysis and Robust Cluster Analysis.

References

Atkinson A.C. and Riani M. (2000). Robust Diagnostic Regression Analysis, Springer Verlag, New York.

Atkinson,A.C., Riani,M., Corbellini,A., Perrotta D., and Todorov,V. (2024), Applied Robust Statistics through the Monitoring Approach, Heidelberg: Springer Nature. https://github.com/UniprJRC/FigMonitoringBook

Rousseeuw P.J. (1984). Least Median of Squares Regression. Journal of the American Statistical Association, 79:388, 871-880

Forbes, J. (1857). Further experiments and remarks on the measurement of heights and boiling point of water. Transactions of the Royal Society of Edinburgh, 21, 235-243.

## 评论

要发表评论，请点击 此处 登录到您的 MathWorks 帐户或创建一个新帐户。