

Line 1: 
Line 1: 

{{StatsPsy}} 

{{StatsPsy}} 

+ 
{{PsyPerspective}} 




− 
'''Least squares''' is a [[mathematicsmathematical]] [[optimization (mathematics)optimization]] technique which, when given a series of measured data, attempts to find a [[function (mathematics)function]] which closely approximates the data (a "best fit"). It attempts to [[maxima and minimaminimize]] the sum of the squares of the ordinate differences (called ''[[errors and residuals in statisticsresiduals]]'') between points generated by the function and corresponding points in the data. Specifically, it is called ''[[least mean squares]]'' (LMS) when the number of measured data is 1 and the [[gradient descent]] method is used to minimize the squared residual. LMS is known to minimize the expectation of the squared residual, with the smallest operations (per iteration). But it requires a large number of iterations to converge. 
+ 
In [[regression analysis]], '''least squares''', also known as '''ordinary least squares''' analysis, is a method for [[linear regression]] that determines the values of unknown quantities in a statistical model by minimizing the sum of the squared [[errors and residuals in statisticsresiduals]] (the difference between the predicted and observed values). This method was first described by [[Carl Friedrich Gauss]] around 1794.<ref name=brertscher>{{cite bookauthor = Bretscher, Ottotitle = Linear Algebra With Applications, 3rd ed.publisher = Prentice Hallyear = 1995location = Upper Saddle River NJ}}</ref> Today, this method is available in most [[statistical software]] packages. The leastsquares approach to regression analysis has been shown to be optimal in the sense that it satisfies the [[GaussMarkov theorem]]. 




− 
An implicit requirement for the least squares method to work is that errors in each measurement be randomly distributed. The [[GaussMarkov theorem]] proves that least square estimators are unbiased and that the sample data do not have to comply with, for instance, a [[normal distribution]]. It is also important that the collected data be well chosen, so as to allow visibility into the variables to be solved for (for giving more weight to particular data, refer to [[weighted least squares]]). 
+ 
A related method is the '''[[least mean squares]]''' (LMS) method. It occurs when the number of measured data is 1 and the [[gradient descent]] method is used to minimize the squared residual. LMS is known to minimize the expectation of the squared residual, with the smallest number of operations per iteration). However, it requires a large number of iterations to converge. 




− 
The least squares technique is commonly used in [[curve fitting]]. Many other optimization problems can also be expressed in a least squares form, by either minimizing [[energy]] or maximizing [[entropy]]. 
+ 
Many other types of [[optimization (mathematics)optimization]] problems can be expressed in a least squares form, by either minimizing [[energy]] or maximizing [[Information entropyentropy]]. 





==History== 

==History== 





+ 
===Context=== 

+ 


+ 
The method of least squares grew out of the fields of [[astronomy]] and [[geodesy]] as scientists and mathematicians sought to provide solutions to the challenges of navigating the Earth's oceans during the [[Age of Exploration]]. The accurate description of the behavior of celestial bodies was key to enabling ships to sail in open seas where before sailors had to rely on land sightings to determine the positions of their ships. 

+ 


+ 
The method was the culmination of several advances that took place during the course of the [[18th centuryeighteenth century]]<ref name=stigler>{{cite book 

+ 
 author = Stigler, Stephen M. 

+ 
 title = The History of Statistics: The Measurement of Uncertainty Before 1900 

+ 
 publisher = Belknap Press of Harvard University Press 

+ 
 year = 1986 

+ 
 location = Cambridge, MA 

+ 
}}</ref>: 

+ 


+ 
*The combination of different observations taken under the ''same'' conditions as opposed to simply trying one's best to observe and record a single observation accurately. This approach was notably used by [[Tobias Mayer]] while studying the [[libration]]s of the moon. 

+ 
*The combination of different observations as being the best estimate of the true value; errors decrease with aggregation rather than increase, perhaps first expressed by [[Roger Cotes]]. 

+ 
*The combination of different observations taken under ''different'' conditions as notably performed by [[Roger Joseph Boscovich]] in his work on the shape of the earth and [[PierreSimon Laplace]] in his work in explaining the differences in motion of [[Jupiter]] and [[Saturn]]. 

+ 
*The development of a criterion that can be evaluated to determine when the solution with the minimum error has been achieved, developed by Laplace in his Method of Situation. 

+ 


+ 
===The method itself=== 

[[Image:Carl Friedrich Gauss.jpg120pxthumbright[[Carl Friedrich Gauss]]]] 

[[Image:Carl Friedrich Gauss.jpg120pxthumbright[[Carl Friedrich Gauss]]]] 




− 
On New Year's Day in 1801, the Italian astronomer Giuseppe Piazzi discovered the asteroid Ceres. He was able to track its path for 40 days. In the course of the year, many scientists tried to estimate the trajectory on the basis of Piazzi's observations (solving Kepler's nonlinear equations of motion is very difficult). Most evaluations were useless; the only calculation precise enough to allow Zach, a German astronomer, to recover Ceres at the end of the year, was that of 24yearold [[Carl Friedrich Gauss]] (the fundamentals of his approach had already been accomplished by him in 1795, when he was still 18 years old). But his method of least squares was not published until 1809, when it appeared in volume two of his work on celestial mechanics, ''Theoria Motus Corporum Coelestium in sectionibus conicis solem ambientium''. The French AdrienMarie Legendre independently developed the same method in 1805. 
+ 
[[Carl Friedrich Gauss]] is credited with developing the fundamentals of the basis for leastsquares analysis in [[1795]] at the age of eighteen. 




− 
In 1829 Gauss was able to state the reason for this procedure's outstanding success: The method of least squares is simply optimal in many respects. The precise argument is known as the [[GaussMarkov theorem]]. 
+ 
An early demonstration of the strength of Gauss's method came when it was used to predict the future location of the newly discovered asteroid [[Ceres (asteroid)Ceres]]. On [[January 1st]], [[1801]], the Italian astronomer [[Giuseppe Piazzi]] discovered Ceres and was able to track its path for 40 days before it was lost in the glare of the sun. Based on this data, it was desired to determine the location of Ceres after it emerged from behind the sun without solving the complicated [[Kepler's laws of planetary motionKepler's nonlinear equations]] of planetary motion. The only predictions that successfully allowed Hungarian astronomer [[Franz Xaver von Zach]] to relocate Ceres were those performed by the 24yearold Gauss using leastsquares analysis. 




− 
==Formulation of the problem== 
+ 
Gauss did not publish the method until [[1809]], when it appeared in volume two of his work on celestial mechanics, ''Theoria Motus Corporum Coelestium in sectionibus conicis solem ambientium''. 

+ 
In [[1829]], Gauss was able to state that the leastsquares approach to regression analysis is optimal in the sense that in a linear model where the errors have a mean of zero, are uncorrelated, and have equal variances, the best linear unbiased estimators of the coefficients is the leastsquares estimators. This result is known as the [[GaussMarkov theorem]]. 




− 
Suppose that the data set consists of the points <math>(x_i, y_i)</math> with <math>i = 1, 2,\dots, n</math>. We want to find a function <math>f</math> such that <math>f(x_i)\approx y_i.</math> 
+ 
The idea of leastsquares analysis was also independently formulated by the Frenchman [[AdrienMarie Legendre]] in [[1805]] and the American [[Robert Adrain]] in [[1808]]. 




− 
To attain this goal, we suppose that the function ''f'' is of a particular form containing some parameters which need to be determined. For instance, suppose that it is [[quadratic functionquadratic]], meaning that <math>f(x) = ax^2+bx+c</math><!''f''(''x'') = ''ax''<sup>2</sup> + ''bx'' + ''c''>, where <math>a</math>, <math>b</math> and <math>c</math> are not yet known. We now seek the values of <math>a</math>, <math>b</math> and <math>c</math> that minimize the sum of the squares of the residuals (<math>S</math>): 
+ 
==Problem statement== 
− 
:<math> S = \sum_{i=1}^n (y_i  f(x_i))^2. </math> 

− 
This explains the name ''least squares''. 





− 
==Solving the least squares problem== 
+ 
The objective consists of adjusting a model function to best fit a data set. The chosen model function has adjustable parameters. The data set consist of n points <math>(y_i,\vec{x}_i)</math> with <math>i = 1, 2,\dots, n</math>. The model function has the form <math>y=f(\vec{x},\vec{a})</math>, where <math> y </math> is the dependent variable, <math>\vec{x}</math> are the independent variables, and <math>\vec{a}</math> are the model adjustable parameters. We wish to find the parameter values such that the model best fits the data according to a defined error criterion. The least sum square method minimizes the sum square error equation 

+ 
<math> S = \sum_{i=1}^n (y_i  f(\vec{x}_i,\vec{a}))^2 </math> with respect to the adjustable parameters <math>\vec{a}</math>. 




− 
In the above example, ''f'' is [[linear functionlinear]] in the parameters ''a'', ''b'' and ''c''. The problem simplifies considerably in this case and essentially reduces to a [[system of linear equations]]. This is explained in the article on [[linear least squares]]. 
+ 
For an example, the data is height measurements over a surface. We choose to model the data by a plane with parameters for plane mean height, plane tip angle, and plane tilt angle. The model equation is then <math> y = f ( x_1, x_2 ) = a_1 + a_2 x_1 + a_3 x_2 </math>, the independent variables are <math> \vec{x}=(x_1,x_2)</math>, and the adjustable parameters are <math>\vec{a}=(a_1,a_2,a_3)</math>. 




− 
The problem is more difficult if ''f'' is not linear in the parameters to be determined. We then need to solve a general (unconstrained) [[optimization (mathematics)optimization]] problem. Any algorithm for such problems, like [[Newton's method in optimizationNewton's method]] and [[gradient descent]], can be used. Another possibility is to apply an algorithm that is developed especially to tackle least squares problems, for instance the [[GaussNewton algorithm]] or the [[LevenbergMarquardt algorithm]]. 
+ 
==Solving the least squares problem== 

+ 


+ 
Least square optimization problems can be divided into linear and nonlinear problems. The linear problem has a closed form solution. The optimization problem is said to be a linear optimization problem if the first order partial derivatives of ''S'' with respect to the parameters <math> \vec{a} </math> results in a set of equations that is linear in the parameter variables. The general, nonlinear, unconstrained [[optimization (mathematics)optimization]] problem has no closed form solution. In this case recursive methods, such as [[Newton's method in optimizationNewton's method]], combined with the [[gradient descent]] method, or specialized methods for least squares analysis, such as the [[GaussNewton algorithm]] or the [[LevenbergMarquardt algorithm]] can be used. 





==Least squares and regression analysis== 

==Least squares and regression analysis== 
Line 37: 
Line 56: 

:<math>f(x_i) = y_i + \varepsilon_i,</math> 

:<math>f(x_i) = y_i + \varepsilon_i,</math> 




− 
where the noise term ε is a [[random variable]] with mean zero. Note that we are assuming that the <math>x</math> values are exact, and all the errors are in the <math>y</math> values. Again, we distinguish between [[linear regression]], in which case the function ''f'' is linear in the parameters to be determined (e.g., ''f''(''x'') = ''ax''<sup>2</sup> + ''bx'' + ''c''), and [[nonlinear regression]]. As before, linear regression is much simpler than nonlinear regression. (It is tempting to think that the reason for the name ''linear regression'' is that the graph of the function ''f''(''x'') = ''ax'' + ''b'' is a line. Fitting a curve ''f''(''x'') = ''ax''<sup>2</sup> + ''bx'' + ''c'', estimating ''a'', ''b'', and ''c'' by least squares, is an instance of ''linear'' regression because the vector of leastsquare estimates of ''a'', ''b'', and ''c'' is a [[linear transformation]] of the vector whose components are ''f''(''x''<sub>''i''</sub>) + ε<sub>''i''</sub>.) 
+ 
where the noise term ε is a [[random variable]] with mean zero. Note that we are assuming that the <math>x</math> values are exact, and all the errors are in the <math>y</math> values. Again, we distinguish between [[linear regression]], in which case the function ''f'' is linear in the parameters to be determined (e.g., ''f''(''x'') = ''ax''<sup>2</sup> + ''bx'' + ''c''), and [[nonlinear regression]]. As before, linear regression is much simpler than nonlinear regression. (It is tempting to think that the reason for the name ''linear regression'' is that the graph of the function ''f''(''x'') = ''ax'' + ''b'' is a line. But fitting a curve like ''f''(''x'') = ''ax''<sup>2</sup> + ''bx'' + ''c'' when estimating ''a'', ''b'', and ''c'' by least squares, is an instance of ''linear'' regression because the vector of leastsquare estimates of ''a'', ''b'', and ''c'' is a [[linear transformation]] of the vector whose components are ''f''(''x''<sub>''i''</sub>) + ε<sub>''i''</sub>. 




− 
One frequently estimates the parameters (''a'', ''b'' and ''c'' in the above example) by least squares: those values are taken that minimize the sum ''S''. The [[Gauss–Markov theorem]] states that the least squares estimates are optimal in a certain sense, if we take ''f''(''x'') = ''ax'' + ''b'' with ''a'' and ''b'' to be determined and the noise terms ε are independent and identically distributed (see the [[Gauss–Markov theoremarticle]] for a more precise statement and less restrictive conditions on the noise terms). 
+ 
===Parameter estimates=== 

+ 
By recognizing that the <math>y_i = \alpha + \beta x_i + \varepsilon_i </math> regression model is a system of linear equations we can express the model using data matrix '''X''', ''target'' vector '''Y''' and parameter vector <math>\delta</math>. The ''i''th row of '''X''' and '''Y''' will contain the ''x'' and ''y'' value for the ''i''th data sample. Then the model can be written as 




− 
Least squares estimation for linear models is notoriously nonrobust to outliers. If the distribution of the outliers is skewed, the estimates can be biased. In the presence of any outliers, the least squares estimates are inefficient and can be extremely so. When outliers occur in the data, methods of [[robust regression]] are more appropriate. 
+ 
: <math> \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_n \end{bmatrix}= \begin{bmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots\\ 1 & x_n \end{bmatrix} \begin{bmatrix} \alpha \\ \beta \end{bmatrix} + \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n \end{bmatrix} </math> 

+ 
which when using pure matrix notation becomes 

+ 
: <math>Y = X \delta + \varepsilon \,</math> 

+ 


+ 
where ε is normally distributed with expected value 0 (i.e., a column vector of 0s) and variance σ<sup>2</sup> ''I''<sub>''n''</sub>, where ''I<sub>n</sub>'' is the ''n''×''n'' identity matrix. 

+ 


+ 
The [[leastsquares estimation of linear regression coefficientsleastsquares estimator]] for <math>\delta</math> is 

+ 


+ 
: <math>\widehat{\delta} = (X^T X)^{1}\; X^T Y \,</math> 

+ 


+ 
(where ''X''<sup>T</sup> is the transpose of ''X'') and the sum of squares of residuals is 

+ 


+ 
: <math>Y^T (I_n  X (X^T X)^{1} X^T)\, Y.</math> 

+ 


+ 
One of the properties of leastsquares is that the matrix <math>X\widehat{\delta}</math> is the orthogonal projection of ''Y'' onto the column space of ''X''. 

+ 


+ 
The fact that the matrix ''X''(''X''<sup>T</sup>''X'')<sup>−1</sup>''X''<sup>T</sup> is a [[symmetric matrixsymmetric]] [[idempotent]] matrix is incessantly relied on in proofs of theorems. The linearity of <math>\widehat{\delta}</math> as a function of the vector ''Y'', expressed above by saying 

+ 


+ 
:<math>\widehat{\delta} = (X^TX)^{1}X^TY,\,</math> 

+ 


+ 
is the reason why this is called "linear" regression. Nonlinear regression uses nonlinear methods of estimation. 

+ 


+ 
The matrix ''I<sub>n</sub>'' − ''X'' (''X''<sup>T</sup> ''X'')<sup>−1</sup> ''X''<sup>T</sup> that appears above is a symmetric idempotent matrix of rank ''n'' − 2. Here is an example of the use of that fact in the theory of linear regression. The finitedimensional [[spectral theorem]] of [[linear algebra]] says that any real symmetric matrix ''M'' can be diagonalized by an [[orthogonal matrix]] ''G'', i.e., the matrix ''G''′''MG'' is a diagonal matrix. If the matrix ''M'' is also idempotent, then the diagonal entries in ''G''′''MG'' must be idempotent numbers. Only two real numbers are idempotent: 0 and 1. So ''I''<sub>''n''</sub> − ''X''(''X''<sup>T</sup>''X'')<sup> 1</sup>''X''<sup>T</sup>, after diagonalization, has ''n'' − 2 1s and two 0s on the diagonal. That is most of the work in showing that the sum of squares of residuals has a [[chisquare distribution]] with ''n''−2 degrees of freedom. 

+ 


+ 
Regression parameters can also be estimated by [[Bayesian]] methods. This has the advantages that 

+ 


+ 
* [[confidence interval]]s can be produced for parameter estimates without the use of asymptotic approximations, 

+ 
* prior information can be incorporated into the analysis. 

+ 


+ 
Suppose that in the linear regression 

+ 


+ 
:<math> y = \alpha + \beta x + \varepsilon \, </math> 

+ 


+ 
we know from domain knowledge that alpha can only take one of the values {−1, +1} but we do not know which. We can build this information into the analysis by choosing a prior for alpha which is a discrete distribution with a probability of 0.5 on −1 and 0.5 on +1. The posterior for alpha will also be a discrete distribution on {−1, +1}, but the probability weights will change to reflect the evidence from the data. 

+ 


+ 
In modern computer applications, the actual value of <math> \beta</math> is calculated using the [[QR decomposition]] or slightly more robust methods when <math>X^TX</math> is near singular. The code for the [[MATLAB]] backslash function, "<code>\</code>", is an excellent example of a robust method. 

+ 


+ 
==== Summarizing the data ==== 

+ 


+ 
We sum the observations, the squares of the ''X''s and the products ''XY'' to obtain the following quantities. 

+ 


+ 
: <math>S_X = x_1 + x_2 + \cdots + x_n \,</math> 

+ 


+ 
: <math>S_Y = y_1 + y_2 + \cdots + y_n \,</math> 

+ 


+ 
: <math>S_{XX} = x_1^2 + x_2^2 + \cdots + x_n^2 \,</math> 

+ 


+ 
: <math>S_{XY} = x_1 y_1 + x_2 y_2 + \cdots + x_n y_n. \,</math> 

+ 


+ 
==== Estimating beta (the slope) ==== 

+ 


+ 
We use the summary statistics above to calculate <math>\widehat\beta</math>, the estimate of β. 

+ 


+ 
: <math>\widehat\beta = {n S_{XY}  S_X S_Y \over n S_{XX}  S_X S_X}. \,</math> 

+ 


+ 
==== Estimating alpha (the intercept) ==== 

+ 


+ 
We use the estimate of β and the other statistics to estimate α by: 

+ 


+ 
: <math>\widehat\alpha = {S_Y  \widehat\beta S_X \over n}. \,</math> 

+ 


+ 
A consequence of this estimate is that the regression line will always pass through the "center" <math>(\bar{x},\bar{y}) = (S_X/n, S_Y/n)</math>. 

+ 


+ 
==Limitations== 

+ 
Least squares estimation for linear models is notoriously nonrobust to [[outliers]]. If the distribution of the outliers is skewed, the estimates can be biased. In the presence of any outliers, the least squares estimates are inefficient and can be extremely slow. When outliers occur in the data, methods of [[robust regression]] are more appropriate. 





==References== 

==References== 
− 
*{{cite paper  author=Abdi, H  title = [http://www.utdallas.edu/~herve/AbdiLeastSquarespretty.pdf] (2003). Leastsquares. In M. LewisBeck, A. Bryman, T. Futing (Eds): Encyclopedia for research methods for the social sciences. Thousand Oaks (CA): Sage. pp. 792795.]  year = 2003 }} 
+ 
{{Refimprovedate=November 2007}} 
− 
==See also== 
+ 
{{reflist}} 





+ 
==See also== 

+ 
<div style="mozcolumncount:2; columncount:2;"> 

+ 
* [[Error of measurement]] 

* [[Isotonic regression]] 

* [[Isotonic regression]] 

* [[Least mean squares filter]] 

* [[Least mean squares filter]] 

* [[Leastsquares estimation of linear regression coefficients]] 

* [[Leastsquares estimation of linear regression coefficients]] 

+ 
* [[Linear least squares]] 

* [[Linear regression]] 

* [[Linear regression]] 

+ 
* [[Segmented regression]] 

+ 
* [[Measurement uncertainty]] 

* [[Moving least squares]] 

* [[Moving least squares]] 

+ 
* [[Numerical smoothing and differentiation]] 

+ 
* [[Recursive least squares]] 

* [[Regression analysis]] 

* [[Regression analysis]] 

* [[Robust regression]] 

* [[Robust regression]] 

* [[Root mean square]] 

* [[Root mean square]] 

+ 
* [[Statistical regression]] 

* [[Total least squares]] or [[errorsinvariables model]] 

* [[Total least squares]] or [[errorsinvariables model]] 

* [[Weighted least squares]] 

* [[Weighted least squares]] 

+ 
</div> 





==External links== 

==External links== 





+ 
* [http://video.google.com/videoplay?docid=1257250980743213153 MIT Linear Algebra Lecture on Least Squares] at Google Video, from MIT OpenCourseWare 

* http://www.physics.csbsju.edu/stats/least_squares.html 

* http://www.physics.csbsju.edu/stats/least_squares.html 

* [http://www.zunzun.com Zunzun.com]  Online curve and surface fitting 

* [http://www.zunzun.com Zunzun.com]  Online curve and surface fitting 

* http://www.orbitals.com/self/least/least.htm 

* http://www.orbitals.com/self/least/least.htm 

+ 
* {{MathWorld  urlname= LeastSquaresFittingPolynomial  title= Least Squares Fitting Polynomial }} 

+ 
*[http://math.fullerton.edu/mathews/n2003/LeastSqPolyMod.html Module for Least Squares Polynomials] 

* {{planetmath referenceid=1195title=Least squares}} 

* {{planetmath referenceid=1195title=Least squares}} 

+ 
* [http://www.personal.psu.edu/faculty/j/h/jhm/f90/lectures/lsq2.html Derivation of quadratic least squares] 

+ 
* [http://www.xuru.org/rt/TOC.asp xuru.org] Online least squares and regression tools 

+ 


+ 
[[Category:Optimization]] 

+ 
[[Category:Statistical estimation]] 




− 
[[Category:Optimization]][[Category:Statistics]] 

− 
[[Category:Single equation methods (econometrics)]] 

− 
[[Category:Mathematical and quantitative methods (economics)]] 






+ 
<! 

{{Link FAde}} 

{{Link FAde}} 





[[cs:Metoda nejmenších čtverců]] 

[[cs:Metoda nejmenších čtverců]] 

[[de:Methode der kleinsten Quadrate]] 

[[de:Methode der kleinsten Quadrate]] 

+ 
[[es:Mínimos cuadrados]] 

[[fr:Méthode des moindres carrés]] 

[[fr:Méthode des moindres carrés]] 

[[gl:Mínimos cadrados]] 

[[gl:Mínimos cadrados]] 

[[it:Metodo dei minimi quadrati]] 

[[it:Metodo dei minimi quadrati]] 

+ 
[[he:שיטת הריבועים הפחותים]] 

+ 
[[hu:Legkisebb négyzetek módszere]] 

[[nl:Kleinstekwadratenmethode]] 

[[nl:Kleinstekwadratenmethode]] 

[[ja:最小二乗法]] 

[[ja:最小二乗法]] 
Line 82: 
Line 120: 

[[ru:Метод наименьших квадратов]] 

[[ru:Метод наименьших квадратов]] 

[[su:Kuadrat leutik]] 

[[su:Kuadrat leutik]] 

+ 
[[fi:Pienimmän neliösumman menetelmä]] 

[[sv:Minstakvadratmetoden]] 

[[sv:Minstakvadratmetoden]] 

[[vi:Bình phương tối thiểu]] 

[[vi:Bình phương tối thiểu]] 

+ 
[[tr:En küçük kareler yöntemi]] 

[[zh:最小二乘法]] 

[[zh:最小二乘法]] 

+ 
> 

{{enWPLeast_squares}} 

{{enWPLeast_squares}} 