Wikia

Psychology Wiki

Changes: Least squares

Edit

Back to page

(History)
(update wp +see also)
 
Line 1: Line 1:
 
{{StatsPsy}}
 
{{StatsPsy}}
  +
{{PsyPerspective}}
   
'''Least squares''' is a [[mathematics|mathematical]] [[optimization (mathematics)|optimization]] technique which, when given a series of measured data, attempts to find a [[function (mathematics)|function]] which closely approximates the data (a "best fit"). It attempts to [[maxima and minima|minimize]] the sum of the squares of the ordinate differences (called ''[[errors and residuals in statistics|residuals]]'') between points generated by the function and corresponding points in the data. Specifically, it is called ''[[least mean squares]]'' (LMS) when the number of measured data is 1 and the [[gradient descent]] method is used to minimize the squared residual. LMS is known to minimize the expectation of the squared residual, with the smallest operations (per iteration). But it requires a large number of iterations to converge.
+
In [[regression analysis]], '''least squares''', also known as '''ordinary least squares''' analysis, is a method for [[linear regression]] that determines the values of unknown quantities in a statistical model by minimizing the sum of the squared [[errors and residuals in statistics|residuals]] (the difference between the predicted and observed values). This method was first described by [[Carl Friedrich Gauss]] around 1794.<ref name=brertscher>{{cite book|author = Bretscher, Otto|title = Linear Algebra With Applications, 3rd ed.|publisher = Prentice Hall|year = 1995|location = Upper Saddle River NJ}}</ref> Today, this method is available in most [[statistical software]] packages. The least-squares approach to regression analysis has been shown to be optimal in the sense that it satisfies the [[Gauss-Markov theorem]].
   
An implicit requirement for the least squares method to work is that errors in each measurement be randomly distributed. The [[Gauss-Markov theorem]] proves that least square estimators are unbiased and that the sample data do not have to comply with, for instance, a [[normal distribution]]. It is also important that the collected data be well chosen, so as to allow visibility into the variables to be solved for (for giving more weight to particular data, refer to [[weighted least squares]]).
+
A related method is the '''[[least mean squares]]''' (LMS) method. It occurs when the number of measured data is 1 and the [[gradient descent]] method is used to minimize the squared residual. LMS is known to minimize the expectation of the squared residual, with the smallest number of operations per iteration). However, it requires a large number of iterations to converge.
   
The least squares technique is commonly used in [[curve fitting]]. Many other optimization problems can also be expressed in a least squares form, by either minimizing [[energy]] or maximizing [[entropy]].
+
Many other types of [[optimization (mathematics)|optimization]] problems can be expressed in a least squares form, by either minimizing [[energy]] or maximizing [[Information entropy|entropy]].
   
 
==History==
 
==History==
   
  +
===Context===
  +
  +
The method of least squares grew out of the fields of [[astronomy]] and [[geodesy]] as scientists and mathematicians sought to provide solutions to the challenges of navigating the Earth's oceans during the [[Age of Exploration]]. The accurate description of the behavior of celestial bodies was key to enabling ships to sail in open seas where before sailors had to rely on land sightings to determine the positions of their ships.
  +
  +
The method was the culmination of several advances that took place during the course of the [[18th century|eighteenth century]]<ref name=stigler>{{cite book
  +
| author = Stigler, Stephen M.
  +
| title = The History of Statistics: The Measurement of Uncertainty Before 1900
  +
| publisher = Belknap Press of Harvard University Press
  +
| year = 1986
  +
| location = Cambridge, MA
  +
}}</ref>:
  +
  +
*The combination of different observations taken under the ''same'' conditions as opposed to simply trying one's best to observe and record a single observation accurately. This approach was notably used by [[Tobias Mayer]] while studying the [[libration]]s of the moon.
  +
*The combination of different observations as being the best estimate of the true value; errors decrease with aggregation rather than increase, perhaps first expressed by [[Roger Cotes]].
  +
*The combination of different observations taken under ''different'' conditions as notably performed by [[Roger Joseph Boscovich]] in his work on the shape of the earth and [[Pierre-Simon Laplace]] in his work in explaining the differences in motion of [[Jupiter]] and [[Saturn]].
  +
*The development of a criterion that can be evaluated to determine when the solution with the minimum error has been achieved, developed by Laplace in his Method of Situation.
  +
  +
===The method itself===
 
[[Image:Carl Friedrich Gauss.jpg|120px|thumb|right|[[Carl Friedrich Gauss]]]]
 
[[Image:Carl Friedrich Gauss.jpg|120px|thumb|right|[[Carl Friedrich Gauss]]]]
   
On New Year's Day in 1801, the Italian astronomer Giuseppe Piazzi discovered the asteroid Ceres. He was able to track its path for 40 days. In the course of the year, many scientists tried to estimate the trajectory on the basis of Piazzi's observations (solving Kepler's nonlinear equations of motion is very difficult). Most evaluations were useless; the only calculation precise enough to allow Zach, a German astronomer, to recover Ceres at the end of the year, was that of 24-year-old [[Carl Friedrich Gauss]] (the fundamentals of his approach had already been accomplished by him in 1795, when he was still 18 years old). But his method of least squares was not published until 1809, when it appeared in volume two of his work on celestial mechanics, ''Theoria Motus Corporum Coelestium in sectionibus conicis solem ambientium''. The French Adrien-Marie Legendre independently developed the same method in 1805.
+
[[Carl Friedrich Gauss]] is credited with developing the fundamentals of the basis for least-squares analysis in [[1795]] at the age of eighteen.
   
In 1829 Gauss was able to state the reason for this procedure's outstanding success: The method of least squares is simply optimal in many respects. The precise argument is known as the [[Gauss-Markov theorem]].
+
An early demonstration of the strength of Gauss's method came when it was used to predict the future location of the newly discovered asteroid [[Ceres (asteroid)|Ceres]]. On [[January 1st]], [[1801]], the Italian astronomer [[Giuseppe Piazzi]] discovered Ceres and was able to track its path for 40 days before it was lost in the glare of the sun. Based on this data, it was desired to determine the location of Ceres after it emerged from behind the sun without solving the complicated [[Kepler's laws of planetary motion|Kepler's nonlinear equations]] of planetary motion. The only predictions that successfully allowed Hungarian astronomer [[Franz Xaver von Zach]] to relocate Ceres were those performed by the 24-year-old Gauss using least-squares analysis.
   
==Formulation of the problem==
+
Gauss did not publish the method until [[1809]], when it appeared in volume two of his work on celestial mechanics, ''Theoria Motus Corporum Coelestium in sectionibus conicis solem ambientium''.
  +
In [[1829]], Gauss was able to state that the least-squares approach to regression analysis is optimal in the sense that in a linear model where the errors have a mean of zero, are uncorrelated, and have equal variances, the best linear unbiased estimators of the coefficients is the least-squares estimators. This result is known as the [[Gauss-Markov theorem]].
   
Suppose that the data set consists of the points <math>(x_i, y_i)</math> with <math>i = 1, 2,\dots, n</math>. We want to find a function <math>f</math> such that <math>f(x_i)\approx y_i.</math>
+
The idea of least-squares analysis was also independently formulated by the Frenchman [[Adrien-Marie Legendre]] in [[1805]] and the American [[Robert Adrain]] in [[1808]].
   
To attain this goal, we suppose that the function ''f'' is of a particular form containing some parameters which need to be determined. For instance, suppose that it is [[quadratic function|quadratic]], meaning that <math>f(x) = ax^2+bx+c</math><!--''f''(''x'') = ''ax''<sup>2</sup> + ''bx'' + ''c''-->, where <math>a</math>, <math>b</math> and <math>c</math> are not yet known. We now seek the values of <math>a</math>, <math>b</math> and <math>c</math> that minimize the sum of the squares of the residuals (<math>S</math>):
+
==Problem statement==
:<math> S = \sum_{i=1}^n (y_i - f(x_i))^2. </math>
 
This explains the name ''least squares''.
 
   
==Solving the least squares problem==
+
The objective consists of adjusting a model function to best fit a data set. The chosen model function has adjustable parameters. The data set consist of n points <math>(y_i,\vec{x}_i)</math> with <math>i = 1, 2,\dots, n</math>. The model function has the form <math>y=f(\vec{x},\vec{a})</math>, where <math> y </math> is the dependent variable, <math>\vec{x}</math> are the independent variables, and <math>\vec{a}</math> are the model adjustable parameters. We wish to find the parameter values such that the model best fits the data according to a defined error criterion. The least sum square method minimizes the sum square error equation
  +
<math> S = \sum_{i=1}^n (y_i - f(\vec{x}_i,\vec{a}))^2 </math> with respect to the adjustable parameters <math>\vec{a}</math>.
   
In the above example, ''f'' is [[linear function|linear]] in the parameters ''a'', ''b'' and ''c''. The problem simplifies considerably in this case and essentially reduces to a [[system of linear equations]]. This is explained in the article on [[linear least squares]].
+
For an example, the data is height measurements over a surface. We choose to model the data by a plane with parameters for plane mean height, plane tip angle, and plane tilt angle. The model equation is then <math> y = f ( x_1, x_2 ) = a_1 + a_2 x_1 + a_3 x_2 </math>, the independent variables are <math> \vec{x}=(x_1,x_2)</math>, and the adjustable parameters are <math>\vec{a}=(a_1,a_2,a_3)</math>.
   
The problem is more difficult if ''f'' is not linear in the parameters to be determined. We then need to solve a general (unconstrained) [[optimization (mathematics)|optimization]] problem. Any algorithm for such problems, like [[Newton's method in optimization|Newton's method]] and [[gradient descent]], can be used. Another possibility is to apply an algorithm that is developed especially to tackle least squares problems, for instance the [[Gauss-Newton algorithm]] or the [[Levenberg-Marquardt algorithm]].
+
==Solving the least squares problem==
  +
  +
Least square optimization problems can be divided into linear and non-linear problems. The linear problem has a closed form solution. The optimization problem is said to be a linear optimization problem if the first order partial derivatives of ''S'' with respect to the parameters <math> \vec{a} </math> results in a set of equations that is linear in the parameter variables. The general, non-linear, unconstrained [[optimization (mathematics)|optimization]] problem has no closed form solution. In this case recursive methods, such as [[Newton's method in optimization|Newton's method]], combined with the [[gradient descent]] method, or specialized methods for least squares analysis, such as the [[Gauss-Newton algorithm]] or the [[Levenberg-Marquardt algorithm]] can be used.
   
 
==Least squares and regression analysis==
 
==Least squares and regression analysis==
Line 37: Line 56:
 
:<math>f(x_i) = y_i + \varepsilon_i,</math>
 
:<math>f(x_i) = y_i + \varepsilon_i,</math>
   
where the noise term &epsilon; is a [[random variable]] with mean zero. Note that we are assuming that the <math>x</math> values are exact, and all the errors are in the <math>y</math> values. Again, we distinguish between [[linear regression]], in which case the function ''f'' is linear in the parameters to be determined (e.g., ''f''(''x'') = ''ax''<sup>2</sup> + ''bx'' + ''c''), and [[nonlinear regression]]. As before, linear regression is much simpler than nonlinear regression. (It is tempting to think that the reason for the name ''linear regression'' is that the graph of the function ''f''(''x'') = ''ax'' + ''b'' is a line. Fitting a curve ''f''(''x'') = ''ax''<sup>2</sup> + ''bx'' + ''c'', estimating ''a'', ''b'', and ''c'' by least squares, is an instance of ''linear'' regression because the vector of least-square estimates of ''a'', ''b'', and ''c'' is a [[linear transformation]] of the vector whose components are ''f''(''x''<sub>''i''</sub>)&nbsp;+&nbsp;&epsilon;<sub>''i''</sub>.)
+
where the noise term ε is a [[random variable]] with mean zero. Note that we are assuming that the <math>x</math> values are exact, and all the errors are in the <math>y</math> values. Again, we distinguish between [[linear regression]], in which case the function ''f'' is linear in the parameters to be determined (e.g., ''f''(''x'') = ''ax''<sup>2</sup> + ''bx'' + ''c''), and [[nonlinear regression]]. As before, linear regression is much simpler than nonlinear regression. (It is tempting to think that the reason for the name ''linear regression'' is that the graph of the function ''f''(''x'') = ''ax'' + ''b'' is a line. But fitting a curve like ''f''(''x'') = ''ax''<sup>2</sup> + ''bx'' + ''c'' when estimating ''a'', ''b'', and ''c'' by least squares, is an instance of ''linear'' regression because the vector of least-square estimates of ''a'', ''b'', and ''c'' is a [[linear transformation]] of the vector whose components are ''f''(''x''<sub>''i''</sub>)&nbsp;+&nbsp;ε<sub>''i''</sub>.
   
One frequently estimates the parameters (''a'', ''b'' and ''c'' in the above example) by least squares: those values are taken that minimize the sum ''S''. The [[Gauss–Markov theorem]] states that the least squares estimates are optimal in a certain sense, if we take ''f''(''x'') = ''ax'' + ''b'' with ''a'' and ''b'' to be determined and the noise terms &epsilon; are independent and identically distributed (see the [[Gauss–Markov theorem|article]] for a more precise statement and less restrictive conditions on the noise terms).
+
===Parameter estimates===
  +
By recognizing that the <math>y_i = \alpha + \beta x_i + \varepsilon_i </math> regression model is a system of linear equations we can express the model using data matrix '''X''', ''target'' vector '''Y''' and parameter vector <math>\delta</math>. The ''i''th row of '''X''' and '''Y''' will contain the ''x'' and ''y'' value for the ''i''th data sample. Then the model can be written as
   
Least squares estimation for linear models is notoriously non-robust to outliers. If the distribution of the outliers is skewed, the estimates can be biased. In the presence of any outliers, the least squares estimates are inefficient and can be extremely so. When outliers occur in the data, methods of [[robust regression]] are more appropriate.
+
: <math> \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_n \end{bmatrix}= \begin{bmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots\\ 1 & x_n \end{bmatrix} \begin{bmatrix} \alpha \\ \beta \end{bmatrix} + \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n \end{bmatrix} </math>
  +
which when using pure matrix notation becomes
  +
: <math>Y = X \delta + \varepsilon \,</math>
  +
  +
where ε is normally distributed with expected value 0 (i.e., a column vector of 0s) and variance σ<sup>2</sup> ''I''<sub>''n''</sub>, where ''I<sub>n</sub>'' is the ''n''&times;''n'' identity matrix.
  +
  +
The [[least-squares estimation of linear regression coefficients|least-squares estimator]] for <math>\delta</math> is
  +
  +
: <math>\widehat{\delta} = (X^T X)^{-1}\; X^T Y \,</math>
  +
  +
(where ''X''<sup>T</sup> is the transpose of ''X'') and the sum of squares of residuals is
  +
  +
: <math>Y^T (I_n - X (X^T X)^{-1} X^T)\, Y.</math>
  +
  +
One of the properties of least-squares is that the matrix <math>X\widehat{\delta}</math> is the orthogonal projection of ''Y'' onto the column space of ''X''.
  +
  +
The fact that the matrix ''X''(''X''<sup>T</sup>''X'')<sup>&minus;1</sup>''X''<sup>T</sup> is a [[symmetric matrix|symmetric]] [[idempotent]] matrix is incessantly relied on in proofs of theorems. The linearity of <math>\widehat{\delta}</math> as a function of the vector ''Y'', expressed above by saying
  +
  +
:<math>\widehat{\delta} = (X^TX)^{-1}X^TY,\,</math>
  +
  +
is the reason why this is called "linear" regression. Nonlinear regression uses nonlinear methods of estimation.
  +
  +
The matrix ''I<sub>n</sub>''&nbsp;&minus;&nbsp;''X'' (''X''<sup>T</sup> ''X'')<sup>&minus;1</sup> ''X''<sup>T</sup> that appears above is a symmetric idempotent matrix of rank ''n''&nbsp;&minus;&nbsp;2. Here is an example of the use of that fact in the theory of linear regression. The finite-dimensional [[spectral theorem]] of [[linear algebra]] says that any real symmetric matrix ''M'' can be diagonalized by an [[orthogonal matrix]] ''G'', i.e., the matrix ''G''&prime;''MG'' is a diagonal matrix. If the matrix ''M'' is also idempotent, then the diagonal entries in ''G''&prime;''MG'' must be idempotent numbers. Only two real numbers are idempotent: 0 and 1. So ''I''<sub>''n''</sub>&nbsp;&minus;&nbsp;''X''(''X''<sup>T</sup>''X'')<sup>&nbsp;-1</sup>''X''<sup>T</sup>, after diagonalization, has ''n'' &minus; 2 1s and two 0s on the diagonal. That is most of the work in showing that the sum of squares of residuals has a [[chi-square distribution]] with ''n''&minus;2 degrees of freedom.
  +
  +
Regression parameters can also be estimated by [[Bayesian]] methods. This has the advantages that
  +
  +
* [[confidence interval]]s can be produced for parameter estimates without the use of asymptotic approximations,
  +
* prior information can be incorporated into the analysis.
  +
  +
Suppose that in the linear regression
  +
  +
:<math> y = \alpha + \beta x + \varepsilon \, </math>
  +
  +
we know from domain knowledge that alpha can only take one of the values {&minus;1,&nbsp;+1} but we do not know which. We can build this information into the analysis by choosing a prior for alpha which is a discrete distribution with a probability of 0.5 on &minus;1 and 0.5 on +1. The posterior for alpha will also be a discrete distribution on {&minus;1,&nbsp;+1}, but the probability weights will change to reflect the evidence from the data.
  +
  +
In modern computer applications, the actual value of <math> \beta</math> is calculated using the [[QR decomposition]] or slightly more robust methods when <math>X^TX</math> is near singular. The code for the [[MATLAB]] backslash function, "<code>\</code>", is an excellent example of a robust method.
  +
  +
==== Summarizing the data ====
  +
  +
We sum the observations, the squares of the ''X''s and the products ''XY'' to obtain the following quantities.
  +
  +
: <math>S_X = x_1 + x_2 + \cdots + x_n \,</math>
  +
  +
: <math>S_Y = y_1 + y_2 + \cdots + y_n \,</math>
  +
  +
: <math>S_{XX} = x_1^2 + x_2^2 + \cdots + x_n^2 \,</math>
  +
  +
: <math>S_{XY} = x_1 y_1 + x_2 y_2 + \cdots + x_n y_n. \,</math>
  +
  +
==== Estimating beta (the slope) ====
  +
  +
We use the summary statistics above to calculate <math>\widehat\beta</math>, the estimate of β.
  +
  +
: <math>\widehat\beta = {n S_{XY} - S_X S_Y \over n S_{XX} - S_X S_X}. \,</math>
  +
  +
==== Estimating alpha (the intercept) ====
  +
  +
We use the estimate of β and the other statistics to estimate α by:
  +
  +
: <math>\widehat\alpha = {S_Y - \widehat\beta S_X \over n}. \,</math>
  +
  +
A consequence of this estimate is that the regression line will always pass through the "center" <math>(\bar{x},\bar{y}) = (S_X/n, S_Y/n)</math>.
  +
  +
==Limitations==
  +
Least squares estimation for linear models is notoriously non-robust to [[outliers]]. If the distribution of the outliers is skewed, the estimates can be biased. In the presence of any outliers, the least squares estimates are inefficient and can be extremely slow. When outliers occur in the data, methods of [[robust regression]] are more appropriate.
   
 
==References==
 
==References==
*{{cite paper | author=Abdi, H | title = [http://www.utdallas.edu/~herve/Abdi-LeastSquares-pretty.pdf] (2003). Least-squares. In M. Lewis-Beck, A. Bryman, T. Futing (Eds): Encyclopedia for research methods for the social sciences. Thousand Oaks (CA): Sage. pp. 792-795.] | year = 2003 |}}
+
{{Refimprove|date=November 2007}}
==See also==
+
{{reflist}}
   
  +
==See also==
  +
<div style="-moz-column-count:2; column-count:2;">
  +
* [[Error of measurement]]
 
* [[Isotonic regression]]
 
* [[Isotonic regression]]
 
* [[Least mean squares filter]]
 
* [[Least mean squares filter]]
 
* [[Least-squares estimation of linear regression coefficients]]
 
* [[Least-squares estimation of linear regression coefficients]]
  +
* [[Linear least squares]]
 
* [[Linear regression]]
 
* [[Linear regression]]
  +
* [[Segmented regression]]
  +
* [[Measurement uncertainty]]
 
* [[Moving least squares]]
 
* [[Moving least squares]]
  +
* [[Numerical smoothing and differentiation]]
  +
* [[Recursive least squares]]
 
* [[Regression analysis]]
 
* [[Regression analysis]]
 
* [[Robust regression]]
 
* [[Robust regression]]
 
* [[Root mean square]]
 
* [[Root mean square]]
  +
* [[Statistical regression]]
 
* [[Total least squares]] or [[errors-in-variables model]]
 
* [[Total least squares]] or [[errors-in-variables model]]
 
* [[Weighted least squares]]
 
* [[Weighted least squares]]
  +
</div>
   
 
==External links==
 
==External links==
   
  +
* [http://video.google.com/videoplay?docid=1257250980743213153 MIT Linear Algebra Lecture on Least Squares] at Google Video, from MIT OpenCourseWare
 
* http://www.physics.csbsju.edu/stats/least_squares.html
 
* http://www.physics.csbsju.edu/stats/least_squares.html
 
* [http://www.zunzun.com Zunzun.com] - Online curve and surface fitting
 
* [http://www.zunzun.com Zunzun.com] - Online curve and surface fitting
 
* http://www.orbitals.com/self/least/least.htm
 
* http://www.orbitals.com/self/least/least.htm
  +
* {{MathWorld | urlname= LeastSquaresFittingPolynomial | title= Least Squares Fitting Polynomial }}
  +
*[http://math.fullerton.edu/mathews/n2003/LeastSqPolyMod.html Module for Least Squares Polynomials]
 
* {{planetmath reference|id=1195|title=Least squares}}
 
* {{planetmath reference|id=1195|title=Least squares}}
  +
* [http://www.personal.psu.edu/faculty/j/h/jhm/f90/lectures/lsq2.html Derivation of quadratic least squares]
  +
* [http://www.xuru.org/rt/TOC.asp xuru.org] Online least squares and regression tools
  +
  +
[[Category:Optimization]]
  +
[[Category:Statistical estimation]]
   
[[Category:Optimization]][[Category:Statistics]]
 
[[Category:Single equation methods (econometrics)]]
 
[[Category:Mathematical and quantitative methods (economics)]]
 
   
  +
<!--
 
{{Link FA|de}}
 
{{Link FA|de}}
   
 
[[cs:Metoda nejmenších čtverců]]
 
[[cs:Metoda nejmenších čtverců]]
 
[[de:Methode der kleinsten Quadrate]]
 
[[de:Methode der kleinsten Quadrate]]
  +
[[es:Mínimos cuadrados]]
 
[[fr:Méthode des moindres carrés]]
 
[[fr:Méthode des moindres carrés]]
 
[[gl:Mínimos cadrados]]
 
[[gl:Mínimos cadrados]]
 
[[it:Metodo dei minimi quadrati]]
 
[[it:Metodo dei minimi quadrati]]
  +
[[he:שיטת הריבועים הפחותים]]
  +
[[hu:Legkisebb négyzetek módszere]]
 
[[nl:Kleinste-kwadratenmethode]]
 
[[nl:Kleinste-kwadratenmethode]]
 
[[ja:最小二乗法]]
 
[[ja:最小二乗法]]
Line 82: Line 120:
 
[[ru:Метод наименьших квадратов]]
 
[[ru:Метод наименьших квадратов]]
 
[[su:Kuadrat leutik]]
 
[[su:Kuadrat leutik]]
  +
[[fi:Pienimmän neliösumman menetelmä]]
 
[[sv:Minstakvadratmetoden]]
 
[[sv:Minstakvadratmetoden]]
 
[[vi:Bình phương tối thiểu]]
 
[[vi:Bình phương tối thiểu]]
  +
[[tr:En küçük kareler yöntemi]]
 
[[zh:最小二乘法]]
 
[[zh:最小二乘法]]
  +
-->
 
{{enWP|Least_squares}}
 
{{enWP|Least_squares}}

Latest revision as of 20:52, January 21, 2008

Assessment | Biopsychology | Comparative | Cognitive | Developmental | Language | Individual differences | Personality | Philosophy | Social |
Methods | Statistics | Clinical | Educational | Industrial | Professional items | World psychology |

Statistics: Scientific method · Research methods · Experimental design · Undergraduate statistics courses · Statistical tests · Game theory · Decision theory


This article needs rewriting to enhance its relevance to psychologists..
Please help to improve this page yourself if you can..


In regression analysis, least squares, also known as ordinary least squares analysis, is a method for linear regression that determines the values of unknown quantities in a statistical model by minimizing the sum of the squared residuals (the difference between the predicted and observed values). This method was first described by Carl Friedrich Gauss around 1794.[1] Today, this method is available in most statistical software packages. The least-squares approach to regression analysis has been shown to be optimal in the sense that it satisfies the Gauss-Markov theorem.

A related method is the least mean squares (LMS) method. It occurs when the number of measured data is 1 and the gradient descent method is used to minimize the squared residual. LMS is known to minimize the expectation of the squared residual, with the smallest number of operations per iteration). However, it requires a large number of iterations to converge.

Many other types of optimization problems can be expressed in a least squares form, by either minimizing energy or maximizing entropy.

HistoryEdit

ContextEdit

The method of least squares grew out of the fields of astronomy and geodesy as scientists and mathematicians sought to provide solutions to the challenges of navigating the Earth's oceans during the Age of Exploration. The accurate description of the behavior of celestial bodies was key to enabling ships to sail in open seas where before sailors had to rely on land sightings to determine the positions of their ships.

The method was the culmination of several advances that took place during the course of the eighteenth century[2]:

  • The combination of different observations taken under the same conditions as opposed to simply trying one's best to observe and record a single observation accurately. This approach was notably used by Tobias Mayer while studying the librations of the moon.
  • The combination of different observations as being the best estimate of the true value; errors decrease with aggregation rather than increase, perhaps first expressed by Roger Cotes.
  • The combination of different observations taken under different conditions as notably performed by Roger Joseph Boscovich in his work on the shape of the earth and Pierre-Simon Laplace in his work in explaining the differences in motion of Jupiter and Saturn.
  • The development of a criterion that can be evaluated to determine when the solution with the minimum error has been achieved, developed by Laplace in his Method of Situation.

The method itselfEdit

Carl Friedrich Gauss

Carl Friedrich Gauss

Carl Friedrich Gauss is credited with developing the fundamentals of the basis for least-squares analysis in 1795 at the age of eighteen.

An early demonstration of the strength of Gauss's method came when it was used to predict the future location of the newly discovered asteroid Ceres. On January 1st, 1801, the Italian astronomer Giuseppe Piazzi discovered Ceres and was able to track its path for 40 days before it was lost in the glare of the sun. Based on this data, it was desired to determine the location of Ceres after it emerged from behind the sun without solving the complicated Kepler's nonlinear equations of planetary motion. The only predictions that successfully allowed Hungarian astronomer Franz Xaver von Zach to relocate Ceres were those performed by the 24-year-old Gauss using least-squares analysis.

Gauss did not publish the method until 1809, when it appeared in volume two of his work on celestial mechanics, Theoria Motus Corporum Coelestium in sectionibus conicis solem ambientium. In 1829, Gauss was able to state that the least-squares approach to regression analysis is optimal in the sense that in a linear model where the errors have a mean of zero, are uncorrelated, and have equal variances, the best linear unbiased estimators of the coefficients is the least-squares estimators. This result is known as the Gauss-Markov theorem.

The idea of least-squares analysis was also independently formulated by the Frenchman Adrien-Marie Legendre in 1805 and the American Robert Adrain in 1808.

Problem statementEdit

The objective consists of adjusting a model function to best fit a data set. The chosen model function has adjustable parameters. The data set consist of n points (y_i,\vec{x}_i) with i = 1, 2,\dots, n. The model function has the form y=f(\vec{x},\vec{a}), where  y is the dependent variable, \vec{x} are the independent variables, and \vec{a} are the model adjustable parameters. We wish to find the parameter values such that the model best fits the data according to a defined error criterion. The least sum square method minimizes the sum square error equation  S = \sum_{i=1}^n (y_i - f(\vec{x}_i,\vec{a}))^2 with respect to the adjustable parameters \vec{a}.

For an example, the data is height measurements over a surface. We choose to model the data by a plane with parameters for plane mean height, plane tip angle, and plane tilt angle. The model equation is then  y = f ( x_1, x_2 ) = a_1 + a_2 x_1 + a_3 x_2 , the independent variables are  \vec{x}=(x_1,x_2), and the adjustable parameters are \vec{a}=(a_1,a_2,a_3).

Solving the least squares problemEdit

Least square optimization problems can be divided into linear and non-linear problems. The linear problem has a closed form solution. The optimization problem is said to be a linear optimization problem if the first order partial derivatives of S with respect to the parameters  \vec{a} results in a set of equations that is linear in the parameter variables. The general, non-linear, unconstrained optimization problem has no closed form solution. In this case recursive methods, such as Newton's method, combined with the gradient descent method, or specialized methods for least squares analysis, such as the Gauss-Newton algorithm or the Levenberg-Marquardt algorithm can be used.

Least squares and regression analysisEdit

In regression analysis, one replaces the relation

f(x_i)\approx y_i

by

f(x_i) = y_i + \varepsilon_i,

where the noise term ε is a random variable with mean zero. Note that we are assuming that the x values are exact, and all the errors are in the y values. Again, we distinguish between linear regression, in which case the function f is linear in the parameters to be determined (e.g., f(x) = ax2 + bx + c), and nonlinear regression. As before, linear regression is much simpler than nonlinear regression. (It is tempting to think that the reason for the name linear regression is that the graph of the function f(x) = ax + b is a line. But fitting a curve like f(x) = ax2 + bx + c when estimating a, b, and c by least squares, is an instance of linear regression because the vector of least-square estimates of a, b, and c is a linear transformation of the vector whose components are f(xi) + εi.

Parameter estimatesEdit

By recognizing that the y_i = \alpha + \beta x_i + \varepsilon_i regression model is a system of linear equations we can express the model using data matrix X, target vector Y and parameter vector \delta. The ith row of X and Y will contain the x and y value for the ith data sample. Then the model can be written as

 \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_n \end{bmatrix}= \begin{bmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots\\ 1 & x_n \end{bmatrix} \begin{bmatrix} \alpha \\ \beta \end{bmatrix} + \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n \end{bmatrix}

which when using pure matrix notation becomes

Y = X \delta + \varepsilon \,

where ε is normally distributed with expected value 0 (i.e., a column vector of 0s) and variance σ2 In, where In is the n×n identity matrix.

The least-squares estimator for \delta is

\widehat{\delta} = (X^T X)^{-1}\; X^T Y \,

(where XT is the transpose of X) and the sum of squares of residuals is

Y^T (I_n - X (X^T X)^{-1} X^T)\, Y.

One of the properties of least-squares is that the matrix X\widehat{\delta} is the orthogonal projection of Y onto the column space of X.

The fact that the matrix X(XTX)−1XT is a symmetric idempotent matrix is incessantly relied on in proofs of theorems. The linearity of \widehat{\delta} as a function of the vector Y, expressed above by saying

\widehat{\delta} = (X^TX)^{-1}X^TY,\,

is the reason why this is called "linear" regression. Nonlinear regression uses nonlinear methods of estimation.

The matrix In − X (XT X)−1 XT that appears above is a symmetric idempotent matrix of rank n − 2. Here is an example of the use of that fact in the theory of linear regression. The finite-dimensional spectral theorem of linear algebra says that any real symmetric matrix M can be diagonalized by an orthogonal matrix G, i.e., the matrix GMG is a diagonal matrix. If the matrix M is also idempotent, then the diagonal entries in GMG must be idempotent numbers. Only two real numbers are idempotent: 0 and 1. So In − X(XTX) -1XT, after diagonalization, has n − 2 1s and two 0s on the diagonal. That is most of the work in showing that the sum of squares of residuals has a chi-square distribution with n−2 degrees of freedom.

Regression parameters can also be estimated by Bayesian methods. This has the advantages that

  • confidence intervals can be produced for parameter estimates without the use of asymptotic approximations,
  • prior information can be incorporated into the analysis.

Suppose that in the linear regression

 y = \alpha + \beta x + \varepsilon \,

we know from domain knowledge that alpha can only take one of the values {−1, +1} but we do not know which. We can build this information into the analysis by choosing a prior for alpha which is a discrete distribution with a probability of 0.5 on −1 and 0.5 on +1. The posterior for alpha will also be a discrete distribution on {−1, +1}, but the probability weights will change to reflect the evidence from the data.

In modern computer applications, the actual value of  \beta is calculated using the QR decomposition or slightly more robust methods when X^TX is near singular. The code for the MATLAB backslash function, "\", is an excellent example of a robust method.

Summarizing the data Edit

We sum the observations, the squares of the Xs and the products XY to obtain the following quantities.

S_X = x_1 + x_2 + \cdots + x_n \,
S_Y = y_1 + y_2 + \cdots + y_n \,
S_{XX} = x_1^2 + x_2^2 + \cdots + x_n^2 \,
S_{XY} =  x_1 y_1 + x_2 y_2 + \cdots + x_n y_n. \,

Estimating beta (the slope) Edit

We use the summary statistics above to calculate \widehat\beta, the estimate of β.

\widehat\beta = {n S_{XY} - S_X S_Y \over n S_{XX} - S_X S_X}. \,

Estimating alpha (the intercept) Edit

We use the estimate of β and the other statistics to estimate α by:

\widehat\alpha = {S_Y - \widehat\beta S_X \over n}. \,

A consequence of this estimate is that the regression line will always pass through the "center" (\bar{x},\bar{y}) = (S_X/n, S_Y/n).

LimitationsEdit

Least squares estimation for linear models is notoriously non-robust to outliers. If the distribution of the outliers is skewed, the estimates can be biased. In the presence of any outliers, the least squares estimates are inefficient and can be extremely slow. When outliers occur in the data, methods of robust regression are more appropriate.

ReferencesEdit

  1. Bretscher, Otto (1995). Linear Algebra With Applications, 3rd ed., Upper Saddle River NJ: Prentice Hall.
  2. Stigler, Stephen M. (1986). The History of Statistics: The Measurement of Uncertainty Before 1900, Cambridge, MA: Belknap Press of Harvard University Press.

See alsoEdit

External linksEdit


This page uses Creative Commons Licensed content from Wikipedia (view authors).

Around Wikia's network

Random Wiki