Wikia

Psychology Wiki

Changes: Linear regression

Edit

Back to page

(update wp)
(update wp)
 
Line 1: Line 1:
 
{{StatsPsy}}
 
{{StatsPsy}}
   
In [[statistics]], '''linear regression''' is a [[Regression analysis|regression method]] of [[model (abstract)|modeling]] the conditional [[expected value]] of one variable ''y'' given the values of some other variable or variables ''x''.
+
In statistics, '''linear regression''' is used for two things:
  +
:*to construct a simple formula that will predict a value or values for a variable given the value of another variable.
  +
:*to test whether and how a given variable is related to another variable or variables.
   
Linear regression is called "linear" because the relation of the response to the explanatory variables is assumed to be a [[linear function]] of some parameters. It is often erroneously thought that the reason the technique is called "linear regression" is that the graph of <math>y = ax+b </math> is a line. But in fact, if the model is (for example)
+
Linear regression is a form of [[regression analysis]] in which the relationship between one or more [[independent variable]]s and another variable, called the dependent variable, is modelled by a [[least squares]] function, called a linear regression equation. This function is a [[linear combination]] of one or more model parameters, called regression coefficients. A linear regression equation with one independent variable represents a straight line when the predicted value (i.e. the dependent variable from the regression equation) is plotted against the independent variable: this is called a [[simple linear regression]]. However, note that "linear" does not refer to this straight line, but rather to the way in which the regression coefficients occur in the regression equation. The results are subject to statistical analysis.
  +
[[Image:Linear regression.png|thumb|right|400px|Example of linear regression with one independent variable.]]
   
: <math>y_i = \alpha + \beta x_i + \gamma x_i^2 + \varepsilon_i</math>
+
== Introduction ==
  +
===Theoretical model===
   
(in which case we have put the vector <math>(x_i, x_i^2)</math> in the role formerly played by <math>x_i</math> and the vector <math>(\beta, \gamma)</math> in the role formerly played by <math>\beta</math>), then the problem is still one of '''linear''' regression, even though the graph is not a straight line.
+
A linear regression model assumes, given a random sample <math> (Y_i, X_{i1}, \ldots, X_{ip}), \, i = 1, \ldots, n </math>, a possibly imperfect relationship between <math> Y_i </math>, the regressand, and the regressors <math> X_{i1}, \ldots, X_{ip} </math>. A disturbance term <math> \varepsilon_i </math>, which is a random variable too, is added to this assumed relationship to capture the influence of everything else on <math> Y_i </math> other than <math> X_{i1}, \ldots, X_{ip} </math>. Hence, the multiple linear regression model takes the following form:
   
Regression models which are not a linear function of the parameters are called [[nonlinear regression]] models (for example, a multi-layer [[artificial neural network]]).
+
:<math> Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} + \varepsilon_i, \qquad i = 1, \ldots, n </math>
   
Still more generally, regression may be viewed as a special case of [[density estimation]]. The [[joint probability|joint distribution]] of the response and explanatory variables can be constructed from the [[conditional probability|conditional distribution]] of the response variable and the [[marginal probability|marginal distribution]] of the explanatory variables. In some problems, it is convenient to work in the other direction: from the joint distribution, the conditional distribution of the response variable can be derived. Regression lines can be extrapolated, where the line is extended to fit the model for values of the explanatory variables outside their original range. However extrapolation may be very inaccurate and can only be used reliably in certain instances.
+
Note that the regressors (the ''X''-s) are also called independent variables, exogenous variables, covariates, input variables or predictor variables. Similarly, regressands (the ''Y''-s) are also called dependent variables, response variables, measured variables, or predicted variables. There are ''p'' + 1 unknown parameters β<sub>0</sub>, β<sub>1</sub>, ..., β<sub>p</sub>. One of the purpose of linear regression is to determine these unknown parameters and their statistical significance (in other words: the mean value and standard deviation of each β<sub>''j''</sub>, ''j'' = 0, 1, ..., ''p'').
   
==Historical remarks==
+
Models which do not conform to this specification may be treated by [[nonlinear regression]]. A linear regression model need not be a linear function of the independent variable: ''linear'' in this context means that the conditional mean of <math> Y_i </math> is linear in the parameters <math>\beta</math>. For example, the model <math> Y_i = \beta_1 X_i + \beta_2 X_i^2 + \varepsilon_i </math> is linear in the parameters <math> \beta_1 </math> and <math> \beta_2 </math>, but it is not linear in <math> X_i^2 </math>, a nonlinear function of <math> X_i </math>. An illustration of this model is shown in the [[#Example|example]], below.
The earliest form of linear regression was the [[method of least squares]], which was published by [[Adrien Marie Legendre|Legendre]] in 1805,<ref name="Legendre">[[Adrien-Marie Legendre|A.M. Legendre]]. ''Nouvelles méthodes pour la détermination des orbites des comètes'' (1805). "Sur la Méthode des moindres quarrés" appears as an appendix.</ref> and by [[Carl Friedrich Gauss|Gauss]] in 1809.<ref name="Gauss">[[Carl Friedrich Gauss|C.F. Gauss]]. ''Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientum''. (1809)</ref> The term "least squares" is from Legendre's term, ''moindres carrés''. However, Gauss claimed that he had known the method since 1795.
 
   
Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the sun. [[Leonhard Euler|Euler]] had worked on the same problem (1748) without success.{{fact}} Gauss published a further development of the theory of least squares in 1821,<ref name="Gauss2">C.F. Gauss. ''Theoria combinationis observationum erroribus minimis obnoxiae''. (1821/1823)</ref> including a version of the [[Gauss-Markov theorem]].
+
===Data and estimation===
   
The term "reversion" was used in the [[nineteenth century]] to describe a biological phenomenon, namely that the progeny of exceptional individuals tend on average to be less exceptional than their parents, and more like their more distant ancestors. [[Francis Galton]] studied this phenomenon, and applied the slightly misleading term "[[regression toward the mean|regression towards mediocrity]]" to it (parents of exceptional individuals also tend on average to be less exceptional than their children). For Galton, regression had only this biological meaning, but his work (1877, 1885)<ref name="Galton1">[[Francis Galton]]. "Typical laws of heredity", Nature 15 (1877), 492-495, 512-514, 532-533. ''(Galton uses the term "reversion" in this paper, which discusses the size of peas.)''</ref><ref name="Galton2">Francis Galton. Presidential address, Section H, Anthropology. (1885) ''(Galton uses the term "regression" in this paper, which discusses the height of humans.)''</ref> was extended by [[Udny Yule]] and [[Karl Pearson]] to a more general statistical context (1897,{{fact}} 1903).<ref name="Yule">[[G. Udny Yule]]. "On the Theory of Correlation", J. Royal Statist. Soc., 1897, p. 812-54.</ref><ref name="Pearson">[[Karl Pearson]], G. U. Yule, Norman Blanchard, and Alice Lee. "The Law of Ancestral Heredity", ''[[Biometrika]]'' (1903)</ref> In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is assumed to be Gaussian. This assumption was weakened by [[Ronald A. Fisher|R.A. Fisher]] in his works of 1922 and 1925.<ref name="Fisher1">[[Ronald Fisher|R.A. Fisher]]. "The goodness of fit of regression formulae, and the distribution of regression coefficients", J. Royal Statist. Soc., 85, 597-612 (1922)</ref><ref name="Fisher2">R.A. Fisher. ''[[Statistical Methods for Research Workers]]'' (1925)</ref> Fisher assumed that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this respect, Fisher's assumption is closer to Gauss's formulation of 1821.
+
It is important to distinguish the model formulated in terms of random variables and the observed values of these random variables. Typically, the observed values, or data, denoted by lower case letters, consist of ''n'' values <math> (y_i, x_{i1}, \ldots, x_{ip}), \, i = 1, \ldots, n </math>.
   
==Naming conventions==
+
In general there are <math> p + 1 </math> parameters to be determined, <math>\beta_0, \ldots, \beta_p</math>. In order to estimate the parameters it is often useful to use the matrix notation
The measured variable, ''y'', is conventionally called the "[[response variable]]". The terms "endogenous variable," "output variable," "criterion variable," and "dependent variable" are also used. The controlled or manipulated variable, ''x'', are called the [[explanatory variable]]s. The terms "exogenous variables," "input variables," "predictor variables" and "independent variables" are also used.
 
   
The terms ''independent'' or ''explanatory'' variable suggest that the variables are [[statistical independence|statistically independent]], which is not what the terms describe. These terms may also convey that the value of the independent/explanatory variable can be chosen at will. The ''response'' variable is then seen as an effect, that is, causally dependent on the independent/explanatory variable, as in a [[stimulus-response model]]. Although many linear regression models are formulated as models of cause and effect, the direction of causation may just as well go the other way, or indeed there need not be any [[causality|causal]] relation at all. For that reason, one may prefer the terms "predictor / response" or "endogenous / exogenous," which do not imply causality. (See [[correlation does not imply causation]] for more on the topic of cause and effect relationships in correlative designs.)
+
:<math>\mathbf{ Y = X \boldsymbol\beta + \boldsymbol\varepsilon} \,</math> <!--- Do not delete "\,", it improves the display of the formula on certain browsers --->
   
The explanatory and response variables may be [[scalar (mathematics)|scalar]]s or [[coordinate vector|vector]]s. [[Linear regression#Multiple linear regression|Multiple linear regression]] includes cases with more than one explanatory variable.
+
where '''Y''' is a column vector that includes the observed values of <math> Y_1, \ldots, Y_n </math>, <math> \varepsilon </math> includes the unobserved stochastic components <math> \varepsilon_1, \ldots, \varepsilon_n </math> and the matrix '''X''' the observed values of the regressors
   
==Statement of the simple linear regression model==
+
:<math> \mathbf {X} = \begin{pmatrix} 1 & x_{11} & \cdots & x_{1p} \\ 1 & x_{21} & \cdots & x_{2p}\\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \cdots & x_{np} \end{pmatrix}.</math>
   
The simple linear regression model is typically stated in the form
+
'''X''' includes, typically, a constant column, that is, a column which does not vary across observations, which is used to represent the intercept term <math> \beta_0 </math>. Matrix '''X''' is sometime called the [[design matrix]].
   
:<math> y = \alpha + \beta x + \varepsilon. \, </math>
+
If there is any [[linear dependence|''linear'' dependence]] among the columns of '''X''', then the vector of parameters <math>\beta</math> cannot be estimated by least squares unless <math>\beta</math> is constrained, as, for example, by requiring the sum of some of its components to be 0. However, some linear combinations of the components of <math>\boldsymbol\beta</math> may still be uniquely estimable in such cases. For example, the model
   
The right hand side may take more general forms, but generally comprises a [[linear combination]] of the parameters, here denoted &alpha; and &beta;. The term &epsilon; represents the unpredicted or unexplained variation in the response variable; it is conventionally called the "error" whether it is really a [[measurement error]] or not, and is assumed to be independent of <math>x</math>. The error term is conventionally assumed to have [[expected value]] equal to zero (a nonzero expected value could be absorbed into &alpha;). See also [[errors and residuals in statistics]]; the difference between an error and a residual is also dealt with below.
+
:<math> Y_i = \beta_1 X_i + \beta_2 2 X_i + \varepsilon_i </math>
   
An equivalent formulation which explicitly shows the linear regression as a model of conditional expectation is
+
cannot be solved for <math> \beta_1 </math> and <math> \beta_2 </math> independently as the matrix of observations has the reduced [[Rank (linear algebra)|rank]] 2. In this case the model can be rewritten as
   
:<math> \mbox{E}(y|x) = \alpha + \beta x \, </math>
+
:<math> Y_i = (\beta_1 + 2\beta_2)X_i + \varepsilon_i </math>
   
with the [[conditional probability|conditional distribution]] of ''y'' given ''x'' essentially the same as the distribution of the error term.
+
and solved to give a value for the composite entity <math>\beta_1+2\beta_2</math>.
   
A linear regression model need not be [[affine transformation|affine]], let alone linear, in the explanatory variables ''x''. For example,
+
Note that to only perform a least squares estimation of <math> \beta_0, \beta_1, \ldots, \beta_p </math> it is not necessary to consider the sample as random variables. It may even be conceptually simpler to consider the sample as fixed, observed values, as we have done thus far. However in the context of hypothesis testing and confidence intervals, it will be necessary to interpret the sample as random variables <math> (Y_i, X_{i1}, \ldots, X_{ip}), \, i = 1, \ldots, n </math> that will produce estimators which are themselves random variables. Then it will be possible to study the distribution of the estimators and draw inferences.
   
: <math>y = \alpha + \beta x + \gamma x^2 + \varepsilon</math>
+
===Classical assumptions===
   
is a linear regression model, for the right-hand side is a linear combination of the parameters &alpha;, &beta;, and &gamma;. In this case it is useful to think of ''x''<sup>2</sup> as a new explanatory variable, formed by modifying the original variable ''x''. Indeed, any linear combination of functions ''f''(''x''), ''g''(''x''), ''h''(''x''), ..., is linear regression model,
+
Classical assumptions for linear regression include the assumptions that the sample is selected at random from the population of interest, that the dependent variable is continuous on the real line, and that the error terms follow identical and independent [[normal distribution]]s, that is, that the errors are [[i.i.d.]] and [[normal distribution|Gaussian]]. Note that these assumptions imply that the error term does not statistically depend on the values of the independent variables, that is, that <math> \varepsilon_i </math> is statistically independent of the predictor variables. This article adopts these assumptions unless otherwise stated. Note that all of these assumptions may be relaxed, depending on the nature of the true probabilistic model of the problem at hand. The issue of choosing which assumptions to relax, which functional form to adopt, and other choices related to the underlying probabilistic model are known as ''specification searches''. In particular note that the assumption that the error terms are normally distributed is of no consequence unless the sample is very small because [[central limit theorem]]s imply that, so long as the error terms have finite variance and are not too strongly correlated, the parameter estimates will be approximately normally distributed even when the underlying errors are not.
so long as these functions do not have any free parameters (otherwise the model is generally a nonlinear regression model). The least-squares estimates of &alpha;, &beta;, and &gamma; are linear in the response variable ''y'', and nonlinear in ''x'' (they are nonlinear in ''x'' even if the &gamma; and &alpha; terms are absent; if only &beta; were present then doubling all observed ''x'' values would multiply the least-squares estimate of &beta; by 1/2).
 
   
Often in linear regression problems statisticians rely on the [[Carl Friedrich Gauss|Gauss]]-[[Andrei Andreevich Markov|Markov]] assumptions:
+
Under these assumptions, an equivalent formulation of [[simple linear regression]] that explicitly shows the linear regression as a model of conditional expectation can be given as
* The random errors &epsilon;<sub>''i''</sub> have expected value 0.
 
* The random errors &epsilon;<sub>''i''</sub> are uncorrelated (this is weaker than an assumption of [[statistical independence|probabilistic independence]]).
 
* The random errors &epsilon;<sub>''i''</sub> are [[homoscedastic]], i.e., they all have the same [[variance]].
 
(See also [[Gauss-Markov theorem]]. That result says that under the assumptions above, least-squares estimators are in a certain sense optimal.)
 
   
Sometimes a stronger set of assumptions is relied on:
+
:<math> \mbox{E}(Y_i \mid X_i = x_i) = \alpha + \beta x_i \, </math>
* The random errors &epsilon;<sub>''i''</sub> have expected value 0.
 
* They are [[statistical independence|independent]].
 
* They are [[normal distribution|normally distributed]].
 
* They are [[homoscedastic]].
 
   
If ''x<sub>i</sub>'' is a [[coordinate vector|vector]] we can take the product &beta;''x<sub>i</sub>'' to be a scalar product (see "[[dot product]]").
+
The [[conditional expected value]] of ''Y''<sub>''i''</sub> given ''X''<sub>''i''</sub> is an [[affine function]] of ''X''<sub>''i''</sub>. Note that this expression follows from the assumption that the mean of <math>\varepsilon_i </math> is zero conditional on ''X''<sub>''i''</sub>.
   
A statistician will usually '''estimate''' the unobservable values of the parameters &alpha; and &beta; by the [[least squares|'''method of least squares''']], which consists of finding the values of ''a'' and ''b'' that minimize the sum of squares of the '''residuals'''
+
==Least-squares analysis==
  +
===Least squares estimates===
   
: <math>\widehat\varepsilon_i = y_i - (\widehat{\alpha} + \widehat{\beta}x_i). \,</math>
+
The first objective of regression analysis is to best-fit the data by estimating the parameters of the model. Of the different criteria that can be used to define what constitutes a best fit, the least squares criterion is a very powerful one. This estimate (or estimator, if we are in the context of a random sample), is given by
   
Those values of <math>\widehat{\alpha}</math> and <math>\widehat{\beta}</math> are the "least-squares estimates" of &alpha; and &beta; respectively. The residuals may be regarded as estimates of the errors; see also [[errors and residuals in statistics]].
+
:<math> \boldsymbol\hat\beta = (\mathbf {X}^T \mathbf {X})^{-1}\mathbf {X}^T \mathbf {y} \, </math>
   
Notice that, whereas the errors are independent, the residuals cannot be independent because the use of least-squares estimates implies that the sum of the residuals must be 0, and the scalar product of the vector of residuals with the vector of ''x''-values must be 0, i.e., we must have
+
For a full derivation see [[Linear least squares]].
   
: <math>\widehat{\varepsilon}_1 + \cdots + \widehat{\varepsilon}_n = 0 \,</math>
+
=== Regression inference ===
   
and
+
The estimates can be used to test various hypotheses.
   
: <math>\widehat{\varepsilon}_1 x_1 + \cdots + \widehat{\varepsilon}_n x_n = 0. \,</math>
+
Denote by <math> \sigma^2 </math> the variance of the error term <math> \varepsilon </math> (recall we assume that <math> \varepsilon_i \sim N(0,\sigma^2)\, </math> for every <math> i=1,\ldots,n </math>). An unbiased estimate of <math> \sigma ^2 </math> is given by
   
These two linear constraints imply that the vector of residuals must lie within a certain (''n''&nbsp;&minus;&nbsp;2)-dimensional subspace of R<sup>''n''</sup>; hence we say that there are "''n'' &minus; 2 degrees of freedom for error". If one assumes the errors are normally distributed and independent, then it can be shown to follow that 1) the sum of squares of residuals
+
:<math>\hat \sigma^2 = \frac {S} {n-p-1} , </math>
   
: <math>\widehat{\varepsilon}_1^2 + \cdots + \widehat{\varepsilon}_n^2 \,</math>
+
where <math> S := \sum_{i=1}^n \hat{\varepsilon}_i^2 </math> is the sum of square residuals.
  +
The relation between the estimate and the true value is:
  +
:<math>\hat\sigma^2 \cdot \frac{n-p-1}{\sigma^2} \sim \chi_{n-p-1}^2 </math>
  +
where <math> \chi_{n-p-1}^2 </math> has [[Chi-square distribution]] with ''n''&nbsp;&minus;&nbsp;''p'' &nbsp;&minus;&nbsp;''1'' degrees of freedom.
   
is distributed as
+
The solution to the normal equations can be written as
  +
<ref>The parameter estimators should be obtained by solving the normal equations as simultaneous linear equations. The inverse normal equations matrix need only be calculated in order to obtain the standard deviations on the parameters. </ref>
   
: <math>\sigma^2 \chi^2_{n - 2}, \,</math>
+
:<math>\hat\boldsymbol\beta=(\mathbf X ^T \mathbf X)^{-1} \mathbf X ^T \mathbf y.</math>
   
So we have :
+
This shows that the parameter estimators are linear combinations of the dependent variable. It follows that, if the observational errors are normally distributed, the parameter estimators will follow a joint normal distribution. Under the assumptions here, the estimated parameter vector is exactly distributed,
* the sum of squares divided by the error-variance &sigma;<sup>2</sup>, has a [[chi-square distribution]] with ''n'' &minus; 2 degrees of freedom,
 
* the sum of squares of residuals is actually probabilistically independent of the estimates <math>\widehat{\alpha},\,\widehat{\beta}</math> of the parameters &alpha; and &beta;.
 
   
These facts make it possible to use [[Student's t-distribution]] with ''n'' &minus; 2 degrees of freedom (so named in honor of the pseudonymous "[[William Sealey Gosset|Student]]") to find confidence intervals for &alpha; and &beta;.
+
:<math> \hat\boldsymbol\beta \sim N ( \boldsymbol\beta, \sigma^2 (\mathbf X^T \mathbf X)^{-1} ) </math>
   
== Parameter estimation using least-squares ==
+
where ''N'' denotes the [[multivariate normal distribution]].
By recognizing that the <math>y_i = \alpha + \beta x_i + \varepsilon_i </math> regression model is a system of linear equations we can express the model using data matrix '''X''', ''target'' vector '''Y''' and parameter vector <math>\delta</math>. The ''i''th row of '''X''' and '''Y''' will contain the ''x'' and ''y'' value for the ''i''th data sample. Then the model can be written as
 
   
: <math> \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_n \end{bmatrix}= \begin{bmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots\\ 1 & x_n \end{bmatrix} \begin{bmatrix} \alpha \\ \beta \end{bmatrix} + \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n \end{bmatrix} </math>
+
The [[standard error]] of ''j''-th parameter estimator β<sub>''j''</sub> (where ''j'' = 0, 1, ..., ''p'') is given by
which when using pure matrix notation becomes
 
: <math>Y = X \delta + \varepsilon \,</math>
 
   
where &epsilon; is normally distributed with expected value 0 (i.e., a column vector of 0s) and variance &sigma;<sup>2</sup> ''I''<sub>''n''</sub>, where ''I<sub>n</sub>'' is the ''n''&times;''n'' identity matrix.
+
:<math>\hat\sigma_j=\sqrt{ \frac{S}{n-p-1}\left[(\mathbf X^T \mathbf X)^{-1}\right]_{jj}}.</math>
   
The [[least-squares estimation of linear regression coefficients|least-squares estimator]] for <math>\delta</math> is
+
The 100(1&nbsp;&minus;&nbsp;''α'')% [[confidence interval]] for the parameter, <math>\beta_j </math>, is computed as follows:
   
: <math>\widehat{\delta} = (X' X)^{-1}\; X' Y \,</math>
+
:<math>\hat \beta_j \pm t_{\frac{\alpha }{2},n - p - 1} \hat \sigma_j. </math>
   
(where ''X''&prime; is the transpose of ''X'') and the sum of squares of residuals is
+
The residuals can be expressed as
   
: <math>Y' (I_n - X (X' X)^{-1} X')\, Y.</math>
+
:<math>\mathbf{\hat r = y-X \hat\boldsymbol\beta= y-X} ( \mathbf X^ T \mathbf X)^{-1} \mathbf X^T \mathbf y.\,</math>
   
One of the properties of least-squares is that the matrix <math>X\widehat{\delta}</math> is the orthogonal projection of ''Y'' onto the column space of ''X''.
+
The matrix <math>\mathbf X( \mathbf X^T \mathbf X)^{-1} \mathbf X^T</math> is known as the [[hat matrix]] and has the useful property that it is [[idempotent]]. Using this property it can be shown that, if the errors are normally distributed, the residuals will follow a normal distribution with covariance matrix
  +
<math>(\mathbf I - \mathbf X(\mathbf X^T \mathbf X)^{-1} \mathbf X^T) \mathbf y \,</math>. <!--- Do not delete "\," it improves the display of the fomula in certain browsers --->
  +
[[Studentized residuals]] are useful in testing for [[outliers]].
   
The fact that the matrix ''X''(''X''&prime;''X'')<sup>&minus;1</sup>''X''&prime; is a [[symmetric matrix|symmetric]] [[idempotent]] matrix is incessantly relied on in proofs of theorems. The linearity of <math>\widehat{\delta}</math> as a function of the vector ''Y'', expressed above by saying
+
The hat matrix is the matrix of the orthogonal [[projection (linear algebra)|projection]] onto the column space of the matrix ''X''.
   
:<math>\widehat{\delta} = (X'X)^{-1}X'Y,\,</math>
+
Given a value of the independent variable, ''x<sub>d</sub>'', (where ''d'' = 1, 2, ..., ''n'') the predicted response is calculated as
   
is the reason why this is called "linear" regression. Nonlinear regression uses nonlinear methods of estimation.
+
:<math>y_d=\sum_{j=1}^{p}x_{dj}\hat\beta_j. </math>
   
The matrix ''I<sub>n</sub>''&nbsp;&minus;&nbsp;''X'' (''X''&prime; ''X'')<sup>&minus;1</sup> ''X''&prime; that appears above is a symmetric idempotent matrix of rank ''n''&nbsp;&minus;&nbsp;2. Here is an example of the use of that fact in the theory of linear regression. The finite-dimensional [[spectral theorem]] of [[linear algebra]] says that any real symmetric matrix ''M'' can be diagonalized by an [[orthogonal matrix]] ''G'', i.e., the matrix ''G''&prime;''MG'' is a diagonal matrix. If the matrix ''M'' is also idempotent, then the diagonal entries in ''G''&prime;''MG'' must be idempotent numbers. Only two real numbers are idempotent: 0 and 1. So ''I''<sub>''n''</sub>&nbsp;&minus;&nbsp;''X''(''X''&prime;''X'')<sup>&nbsp;1</sup>''X''&prime;, after diagonalization, has ''n'' &minus; 2 0s and two 1s on the diagonal. That is most of the work in showing that the sum of squares of residuals has a chi-square distribution with ''n''&minus;2 degrees of freedom.
+
Writing the elements <math>x_{dj},\ j=1, \dots, p</math> as <math>\mathbf z</math>, the 100(1&nbsp;&minus;&nbsp;''α'')% [[mean response]] confidence interval for the prediction is given, using [[error propagation]] theory, by:
   
Regression parameters can also be estimated by [[Bayesian]] methods. This has the advantages that
+
:<math>\mathbf z^T\hat\boldsymbol\beta \pm t_{ \frac{\alpha }{2} ,n-p} \hat \sigma
  +
\sqrt {\mathbf z^T( \mathbf X^T \mathbf X)^{-1} \mathbf z }.</math>
   
* [[confidence interval]]s can be produced for parameter estimates without the use of asymptotic approximations,
+
The 100(1&nbsp;&minus;&nbsp;''α'')% [[predicted response]] confidence intervals for the data are given by:
* prior information can be incorporated into the analysis.
 
   
Suppose that in the linear regression
+
:<math>\mathbf z^T \hat\boldsymbol\beta \pm t_{\frac{\alpha }{2},n-p-1} \hat \sigma
  +
\sqrt {1 + \mathbf z^T(\mathbf X^T \mathbf X)^{-1} \mathbf z}.</math>
   
:<math> y = \alpha + \beta x + \varepsilon \, </math>
+
==== Univariate linear case ====
   
we know from domain knowledge that alpha can only take one of the values {&minus;1,&nbsp;+1} but we do not know which. We can build this information into the analysis by choosing a prior for alpha which is a discrete distribution with a probability of 0.5 on &minus;1 and 0.5 on +1. The posterior for alpha will also be a discrete distribution on {&minus;1,&nbsp;+1}, but the probability weights will change to reflect the evidence from the data.
+
We consider here the case of the simplest regression model, <math> Y = \alpha + \beta X + \varepsilon </math>. In order to estimate <math> \alpha </math> and <math> \beta </math>, we have a sample <math> (y_i, x_i), \, i = 1, \ldots, n </math> of observations which are, here, not seen as random variables and denoted by lower case letters. As stated in the introduction, however, we might want to interpret the sample in terms of random variables in some other contexts than least squares estimation.
   
In modern computer applications, the actual value of <math> \beta</math> is calculated using the [[QR decomposition]] or slightly more fancy methods when <math>X'X</math> is near singular. The code for the matlab \ function is an excellent example of a robust method.
+
The idea of least squares estimation is to minimize the following ''unknown'' quantity, the sum of squared errors:
   
===Robust regression===
+
:<math> \sum_{i = 1}^n \varepsilon_i^2 = \sum_{i = 1}^n (y_i - \alpha - \beta x_i)^2 </math>
   
A host of alternative approaches to the computation of regression parameters are included in the category known as [[robust regression]]. One technique minimizes the mean [[approximation error|absolute error]], or some other function of the residuals, instead of mean squared error as in linear regression. Robust regression is much more computationally intensive than linear regression and is somewhat more difficult to implement as well. While least squares estimates are not very sensitive to breaking the normality of the errors assumption, this is not true when the variance or mean of the error distribution is not bounded, or when an analyst that can identify outliers is unavailable.
+
Taking the derivative of the preceding expression with respect to <math> \alpha </math> and <math> \beta </math> yields the ''normal equations'':
   
In the [[Stata]] culture, ''Robust regression'' means linear regression with Huber-White standard error estimates. This relaxes the assumption of [[homoskedasticity]] for variance estimates only, the predictors are still ordinary least squares (OLS) estimates.
+
:<math>\begin{array}{lcl}
  +
n\ \alpha + \displaystyle\sum_{i = 1}^n x_i\ \beta = \displaystyle\sum_{i = 1}^n y_i \\
  +
\displaystyle\sum_{i = 1}^n x_i\ \alpha + \displaystyle\sum_{i = 1}^n x_i^2\ \beta = \displaystyle\sum_{i = 1}^n x_i y_i
  +
\end{array}</math>
   
=== Summarizing the data ===
+
This is a linear system of equations which can be solved using [[Cramer's rule]]:
   
We sum the observations, the squares of the ''Y''s and ''X''s and the products ''XY'' to obtain the following quantities.
+
:<math>\hat\beta = \frac {n \displaystyle\sum_{i = 1}^n x_i y_i - \displaystyle\sum_{i = 1}^n x_i \displaystyle\sum_{i = 1}^n y_i} {n \displaystyle\sum_{i = 1}^n x_i^2 - \left(\displaystyle\sum_{i = 1}^n x_i\right)^2}
  +
=\frac{\displaystyle\sum_{i = 1}^n(x_i-\bar{x})(y_i-\bar{y})}{\displaystyle\sum_{i = 1}^n(x_i-\bar{x})^2}
  +
\,</math>
   
: <math>S_X = x_1 + x_2 + \cdots + x_n \,</math>
+
:<math>\hat\alpha = \frac {\displaystyle\sum_{i = 1}^n x_i^2 \displaystyle\sum_{i = 1}^n y_i - \displaystyle\sum_{i = 1}^n x_i \displaystyle\sum_{i = 1}^n x_iy_i} {n \displaystyle\sum_{i = 1}^n x_i^2 - \left(\displaystyle\sum_{i = 1}^n x_i\right)^2}= \bar y-\bar x \hat\beta </math>
   
and <math>S_Y</math> similarly.
+
:<math>S = \sum_{i = 1}^n (y_i - \hat{y}_i)^2
  +
= \sum_{i = 1}^n y_i^2 - \frac {n \left(\displaystyle\sum_{i = 1}^n x_i y_i \right)^2 + \left(\displaystyle\sum_{i = 1}^n y_i \right)^2 \displaystyle\sum_{i = 1}^n x_i^2 - 2 \displaystyle\sum_{i = 1}^n x_i \displaystyle\sum_{i = 1}^n y_i \displaystyle\sum_{i = 1}^n x_i y_i } {n \displaystyle\sum_{i = 1}^n x_i^2 - \left(\displaystyle\sum_{i = 1}^n x_i\right)^2}
  +
</math>
   
: <math>S_{XX} = x_1^2 + x_2^2 + \cdots + x_n^2 \,</math>
+
:<math>\hat \sigma^2 = \frac {S} {n-2}. </math>
   
and <var>S<sub>YY</sub></var> similarly.
+
The covariance matrix is
   
: <math>S_{XY} = x_1 y_1 + x_2 y_2 + \cdots + x_n y_n. \,</math>
+
:<math>\frac{1}{n \displaystyle\sum_{i = 1}^n x_i^2 - \left(\displaystyle\sum_{i = 1}^n x_i\right)^2}\begin{pmatrix}
  +
\sum x_i^2 & -\sum x_i \\
  +
-\sum x_i & n
  +
\end{pmatrix}</math>
   
=== Estimating beta (the slope) ===
+
The [[mean response]] confidence interval is given by
  +
:<math>y_d = (\alpha+\hat\beta x_d) \pm t_{ \frac{\alpha }{2} ,n-2} \hat \sigma \sqrt {\frac{1}{n} + \frac{(x_d - \bar{x})^2}{\sum (x_i - \bar{x})^2}}</math>
   
We use the summary statistics above to calculate <math>\widehat\beta</math>, the estimate of &beta;.
+
The [[predicted response]] confidence interval is given by
  +
:<math>y_d = (\alpha+\hat\beta x_d) \pm t_{ \frac{\alpha }{2} ,n-2} \hat \sigma \sqrt {1+\frac{1}{n} + \frac{(x_d - \bar{x})^2}{\sum (x_i - \bar{x})^2}}</math>
   
: <math>\widehat\beta = {n S_{XY} - S_X S_Y \over n S_{XX} - S_X S_X}. \,</math>
+
The term <math>t_{ \frac{\alpha }{2} ,n-2}</math> is a reference to the [[Student's t-distribution]]. <math>\hat \sigma</math> is standard error.
   
=== Estimating alpha (the intercept) ===
+
=== Analysis of variance ===
  +
In [[analysis of variance]] (ANOVA), the total sum of squares is split into two or more components.
   
We use the estimate of &beta; and the other statistics to estimate &alpha; by:
+
The "total (corrected) sum of squares" is
   
: <math>\widehat\alpha = {S_Y - \widehat\beta S_X \over n}. \,</math>
+
: <math> \text{SST} = \sum_{i=1}^n (y_i - \bar y)^2, </math>
   
A consequence of this estimate is that the regression line will always pass through the "center" <math>(\bar{x},\bar{y}) = (S_X/n, S_Y/n)</math>.
+
where
   
=== Displaying the residuals ===
+
: <math> \bar y = \frac{1}{n} \sum_{i=1}^n y_i</math>
   
The first method of displaying the residuals use the [[histogram]] or [[Cumulative distribution function|cumulative distribution]] to depict the similarity (or lack thereof) to a [[normal distribution]]. Non-normality suggests that the model may not be a good summary description of the data.
+
is the average value of observed ''y''<sub>''i''</sub>. (Here "corrected" means <math>\scriptstyle\bar y\,</math> has been subtracted from each ''y''-value.) Equivalently
   
We plot the residuals,
+
: <math> \text{SST} = \sum_{i=1}^n y_i^2 - \frac{1}{n}\left(\sum_{i=1}^n y_i\right)^2 </math>
   
:<math>\widehat{\varepsilon}_i=y_i - \widehat{\alpha} - \widehat{\beta}x_i \,</math>
+
The total sum of squares is partitioned as the sum of the "regression sum of squares" SSReg (or RSS, also called the "explained sum of squares") and the "error sum of squares" SSE, which is the sum of squares of residuals.
   
against the explanatory variable, ''x''. There should be no discernible trend or pattern if the model is satisfactory for these data. Some of the possible problems are:
+
The regression sum of squares is
*Residuals increase (or decrease) as the explanatory variable increases &mdash;indicates mistakes in the calculations. Find the mistakes and correct them.
 
*Residuals first rise and then fall (or first fall and then rise) &mdash;indicates that the appropriate model is (at least) quadratic. Adding a quadratic term (and then possibly higher) to the model may be appropriate. See [[Non-linear regression|nonlinear regression]] and [[Multiple regression|multiple linear regression]].
 
*One residual is much larger than the others &mdash;suggests that there is one unusual observation which is distorting the fit.
 
**Verify its value before publishing ''or''
 
**Eliminate it, document your decision to do so, and recalculate the statistics.
 
:[[Studentized residual]]s can be used in [[outlier]] detection.
 
*The vertical spread of the residuals increases as the explanatory variable increases (funnel-shaped plot) &mdash;indicates that the [[homoskedasticity]] assumption is violated (i.e. there is [[heteroskedasticity]]: the variability of the response depends on the value of ''x''.
 
**Transform the data: for example, the [[logarithm]] or [[logit]] transformations are often useful.
 
**Use a more general modeling approach that can account for non-constant variance, for example a [[general linear model]] or a [[generalized linear model]].
 
   
=== Ancillary statistics ===
+
:<math>\text{SSReg} = \sum_{i=1}^n \left( \hat y_i - \bar y \right)^2
  +
= \hat\boldsymbol\beta ^T \mathbf X^T
  +
\mathbf y - \frac{1}{n}\left( \mathbf y^T \mathbf u \mathbf u^T \mathbf y \right), </math>
   
The sum of squared deviations can be partitioned as in [[analysis of variance|ANOVA]] to indicate what part of the dispersion of the response variable is explained by the explanatory variable.
+
where '''u''' is an ''n''-by-1 vector in which each element is 1. Note that
   
The [[correlation|correlation coefficient]], ''r'', can be calculated by
+
: <math>\mathbf y^T \mathbf u = \mathbf u^T \mathbf y = \sum_{i=1}^n y_i,</math>
: <math>r = {n S_{XY} - S_X S_Y \over \sqrt{(n S_{XX} - S_X^2) (n S_{YY} - S_Y^2)}}. \,</math>
 
   
This statistic is a measure of how well a straight line describes the data. Values near zero suggest that the model is ineffective. ''r''<sup>2</sup> is frequently interpreted as the fraction of the variability explained by the explanatory variable, ''X''.
+
and
  +
  +
: <math>\frac{1}{n} \mathbf y^T \mathbf u \mathbf u^T \mathbf y = \frac{1}{n}\left(\sum_{i=1}^n y_i\right)^2.\,</math>
  +
  +
The error (or "unexplained") sum of squares SSE, which is the sum of square of residuals, is given by
  +
  +
:<math>\text{SSE} = \sum_{i=1}^n {\left( {y_i - \hat y_i} \right)^2 }
  +
= \mathbf y^T \mathbf y - \hat\boldsymbol\beta^T \mathbf X^T y.</math>
  +
  +
The total sum of squares SST is
  +
  +
:<math>\text{SST} = \sum_{i=1}^n \left( y_i-\bar y \right)^2 = \mathbf{ y^T y}-\frac{1}{n}\left(
  +
\mathbf y^T \mathbf u \mathbf u^T \mathbf y\right)=\text{SSReg}+ \text{SSE}.</math>
  +
  +
[[Coefficient of determination|Pearson's coefficient of regression]], ''R''<sup>&nbsp;2</sup> is then given as
  +
  +
:<math>R^2 = \frac{\text{SSReg}}{{\text{SST}}}
  +
= 1 - \frac{\text{SSE}}{\text{SST}}.</math>
  +
  +
If the errors are independent and normally distributed with expected value 0 and they all have the same variance, then under the [[null hypothesis]] that all of the elements in ''β''&nbsp;=&nbsp;0 except the constant, the statistic
  +
  +
: <math> \frac{ R^2 / p }{ ( 1 - R^2 )/(n-p-1)} </math>
  +
  +
follows an [[F-distribution]] with ''p'' and ''n''&nbsp;&minus;&nbsp;''p''&nbsp;&minus;&nbsp;1 degrees of freedom (''n'' is the number of regressands and ''p'' + 1 is the number of unknown parameters β<sub>0</sub>, β<sub>1</sub>, ..., β<sub>''p''</sub>). If that statistic is too large, then one rejects the null hypothesis. How large is too large depends on the level of the test, which is the tolerated probability of [[type I error]]; see [[statistical significance]].
  +
  +
===Goodness of fit===
  +
  +
Arises in the following situation: there are two statistical models:
  +
  +
:<math>\mathbf Y = \mathbf X_{(1)} \boldsymbol\beta_{(1)} + \boldsymbol\varepsilon \,</math>
  +
  +
and
  +
  +
:<math>\mathbf Y = \mathbf X_{(2)} \boldsymbol\beta_{(2)} + \boldsymbol\varepsilon \,</math>
  +
  +
Which one is better?
  +
  +
Let
  +
  +
:<math>\text{SSE}_{(j)} = \sum_{i=1}^n {\left( {y_i - \hat y_{(j)i}} \right)^2 },\qquad j = 1, 2 </math>
  +
  +
be the sum of residuals of the ''j''-th model (''j'' = 1 or 2). Then the ratio
  +
  +
:<math>Q = \frac {\text{SSE}_{(1)} / (n - p_{(1)} - 1)} {\text{SSE}_{(2)} / (n - p_{(2)} - 1)}</math>
  +
  +
has [[F-distribution]] of ''n''&nbsp;&minus;&nbsp;''p''<sub>(1)</sub>&nbsp;&minus;&nbsp;1 and
  +
''n''&nbsp;&minus;&nbsp;''p''<sub>(2)</sub>&nbsp;&minus;&nbsp;1 degrees of freedom, respectively.
  +
  +
Similar but more restricted test is presented in [[F-test#Regression problems|F-test]].
  +
See also [[Lack-of-fit sum of squares]].
  +
  +
==Example==
  +
  +
===Regression analysis===
  +
To illustrate the various goals of regression, we give an example. The following data set gives the average heights and weights for American women aged 30–39 (source: ''The World Almanac and Book of Facts, 1975'').
  +
:{|
  +
|Height (m) ||1.47|| 1.5|| 1.52|| 1.55|| 1.57|| 1.60|| 1.63|| 1.65|| 1.68|| 1.7|| 1.73|| 1.75|| 1.78|| 1.8|| 1.83
  +
|-
  +
|Weight (kg) ||52.21|| 53.12 ||54.48 ||55.84 ||57.2 ||58.57 ||59.93|| 61.29|| 63.11 ||64.47 ||66.28|| 68.1|| 69.92|| 72.19|| 74.46
  +
|}
  +
A plot of weight against height (see below) shows that it cannot be modeled by a straight line, so a regression is performed by modeling the data by a parabola.
  +
:<math> Y_i = \beta_0 + \beta_1 X_i + \beta_2 X^2_i +\varepsilon_i \! </math>
  +
where the dependent variable <math> Y_i </math> is weight and the independent variable <math> X_i </math> is height.
  +
  +
Place the observations <math> x_i, \ x_i^2, \, i = 1, \ldots, n </math>, in the matrix '''X'''.
  +
:<math> \mathbf{X} =
  +
\begin{pmatrix}
  +
1&1.47&2.16\\
  +
1&1.50&2.25\\
  +
1&1.52&2.31\\
  +
1&1.55&2.40\\
  +
1&1.57&2.46\\
  +
1&1.60&2.56\\
  +
1&1.63&2.66\\
  +
1&1.65&2.72\\
  +
1&1.68&2.82\\
  +
1&1.70&2.89\\
  +
1&1.73&2.99\\
  +
1&1.75&3.06\\
  +
1&1.78&3.17\\
  +
1&1.80&3.24\\
  +
1&1.83&3.35\\
  +
\end{pmatrix}</math>
  +
[[Image:BMIregression.png|right|px 388]]
  +
[[Image:BMIresiduals.png|right|px 386]]
  +
  +
The values of the parameters are found by solving the normal equations
   
==Multiple linear regression==
+
:<math>\mathbf (X^T \mathbf X)\boldsymbol\hat\beta = \mathbf X^T \mathbf y</math>
   
Linear regression can be extended to functions of two or more variables, for example
+
Element ''ij'' of the normal equation matrix, <math>\mathbf X^T \mathbf X</math> is formed by summing the products of column ''i'' and column ''j'' of '''X'''.
   
: <math>Z = a X + b Y + c +\varepsilon. \,</math>
+
:<math>x_{ij} = \sum_{k = 1}^{15} x_{ki} x_{kj}</math>
   
Here ''X'' and ''Y'' are explanatory variables.
+
Element ''i'' of the right-hand side vector <math> \mathbf X^T \mathbf y </math> is formed by summing the products of column ''i'' of '''X''' with the column of dependent variable values.
The values of the parameters <math>\alpha,\beta</math> and <math>\gamma</math> are estimated by the [[least squares|'''method of least squares''']], that minimize the sum of squares of the residuals
 
   
: <math>S_r = \sum_{i=1}^n (\widehat{\alpha} x_i + \widehat{\beta} y_i + \widehat{\gamma} - z_i)^2. \,</math>
+
:<math>\left(\mathbf X^T \mathbf y\right)_i= \sum_{k = 1}^{15} x_{ki} y_k</math>
   
Take derivatives with respect to <math>\widehat{\alpha}, \widehat{\beta}, \widehat{\gamma}</math> in <math>S_r</math> and set equal to zero. This leads to a system of three linear equations i.e. the '''normal equations''' to find estimates of parameters <math>\alpha,\beta</math> and <math>\gamma</math>:
+
Thus, the normal equations are
  +
:<math>\begin{pmatrix}
  +
15&24.76&41.05\\
  +
24.76&41.05&68.37\\
  +
41.05&68.37&114.35\\
  +
\end{pmatrix}
   
:<math>
+
\begin{pmatrix}
\begin{bmatrix}
+
\hat\beta_0\\
\sum_{i=1}^n x_i^2 & \sum_{i=1}^n x_i y_i & \sum_{i=1}^n x_i \\ \\
+
\hat\beta_1\\
\sum_{i=1}^n x_i y_i & \sum_{i=1}^n y_i^2 & \sum_{i=1}^n y_i \\ \\
+
\hat\beta_2\\
\sum_{i=1}^n x_i & \sum_{i=1}^n y_i & n \\
+
\end{pmatrix}
\end{bmatrix}
 
\begin{bmatrix}
 
\widehat{\alpha} \\ \\
 
\widehat{\beta} \\ \\
 
\widehat{\gamma}
 
\end{bmatrix}
 
 
=
 
=
\begin{bmatrix}
+
\begin{pmatrix}
\sum_{i=1}^n x_i z_i \\ \\
+
931\\
\sum_{i=1}^n y_i z_i \\ \\
+
1548\\
\sum_{i=1}^n z_i
+
2586\\
\end{bmatrix}
+
\end{pmatrix}
 
</math>
 
</math>
   
===Polynomial fitting===
+
:<math>\hat\beta_0=129 \pm 16</math> (value <math>\pm</math> standard deviation)
  +
:<math>\hat\beta_1=-143 \pm 20</math>
  +
:<math>\hat\beta_2=62 \pm 6</math>
   
A polynomial fit is a specific type of multiple regression. The simple regression model (a first-order polynomial) can be trivially extended to higher orders. The regression model <math>y_i = \alpha_0 + \alpha_1 x_i + \alpha_2 x_i^2 \dots + \alpha_m x_i^m + \varepsilon_i </math> is a system of polynomial equations of order ''m'' with polynomial coefficients <math>\{ \alpha_0 \dots \alpha_m \}</math>. As before, we can express the model using data matrix '''X''', ''target'' vector '''Y''' and parameter vector <math>\delta</math>. The ''i''th row of '''X''' and '''Y''' will contain the ''x'' and ''y'' value for the ''i''th data sample. Then the model can be written as as system of linear equations:
+
The calculated values are given by
  +
:<math> \hat{y}_i = \hat\beta_0 + \hat\beta_1 x_i+ \hat\beta_2 x^2_i</math>
  +
The observed and calculated data are plotted together and the residuals, <math>y_i - \hat{y}_i</math>, are calculated and plotted. [[Standard deviation]]s are calculated using the sum of squares, ''S'' = 0.76.
   
:<math> \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_n \end{bmatrix}= \begin{bmatrix} 1 & x_1 & x_1^2 & \dots & x_1^m \\ 1 & x_2 & x_2^2 & \dots & x_2^m \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & x_n & x_n^2 & \dots & x_n^m \end{bmatrix} \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_m \end{bmatrix} + \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n \end{bmatrix} </math>
+
The confidence intervals are computed using:
   
which when using pure matrix notation remains, as before:
+
:<math>[\hat{\beta_j}-\hat\sigma_j t_{m-n;1-\frac{\alpha}{2}};\hat{\beta_j}+\hat\sigma_j t_{m-n;1-\frac{\alpha}{2}}]</math>
   
: <math>Y = X \delta + \varepsilon \,</math>
+
with <math>\alpha</math> = 5%, <math>t_{m-n;1-\frac{\alpha}{2}}</math> = 2.2. Therefore, we can say that the 95% [[confidence interval]]s are:
   
and the vector of polynomial coefficients is
+
:<math>\hat\beta_0\in[92.9,164.7]</math>
   
: <math>\widehat{\delta} = (X' X)^{-1}\; X' Y \,</math>
+
:<math>\hat\beta_1\in[-186.8,-99.5]</math>
   
The simple regression model is a special case of a polynomial fit, having the polynomial order ''m''&nbsp;=&nbsp;1.
+
:<math>\hat\beta_2\in[48.7,75.2]</math>
   
===Coefficient of determination===
+
===Goodness of fit===
{{main|Coefficient of determination}}
+
Assume that the following two models have been proposed for the data in this example:
   
The [[coefficient of determination]], ''R''<sup>2</sup>, is a measure of the proportion of variability explained by, or due to the regression (linear relationship), in a sample of paired data. The value of the coefficient of determination can range between zero and one, where a value close to zero suggests a poor model.
+
# A linear model with &beta;<sub>(1)0</sub> and &beta;<sub>(1)1</sub> as unknown parameters.
  +
# A quadratic model solved above.
   
==Examples==
+
Which of these two is better?
Linear regression is widely used in biological, behavioural and social sciences to describe relationships between variables. It ranks as one of the most important tools used in these disciplines.
 
   
===Medicine===
+
The [[null hypothesis]] is that there is no difference between these two models. The residual sum of squares are SSE<sub>(1)</sub> = 7.49 and SSE<sub>(2)</sub> = 0.76.
As one example, early evidence relating [[tobacco smoking]] to mortality and [[morbidity]] came from studies employing regression. Researchers usually include several variables in their regression analysis in an effort to remove factors that might produce [[spurious correlation]]s. For the cigarette smoking example, researchers might include socio-economic status in addition to smoking to ensure that any observed effect of smoking on mortality is not due to some effect of education or income. However, it is never possible to include all possible confounding variables in a study employing regression. For the smoking example, a hypothetical gene might increase mortality and also cause people to smoke more. For this reason, [[randomized controlled trial]]s are considered to be more trustworthy than a regression analysis.
 
   
===Finance===
+
:<math>Q = \frac {\text{SSE}_{(1)} / (n - p_{(1)} - 1)} {\text{SSE}_{(2)} / (n - p_{(2)} - 1)} =
Linear regression underlies the [[capital asset pricing model]], and the concept of using [[Beta coefficient|Beta]] for analyzing and quantifying the systematic risk of an investment. This comes directly from the [[Beta coefficient]] of the linear regression model that relates the return on the investment to the return on all risky assets.
+
\frac {7.49 / (15 - 1 - 1)} {0.76 / (15 - 2 - 1)} = 9.1</math>
   
==References==
+
The [[F-distribution]] with 13 and 12 degrees of freedom, respectively, give for &alpha; = 95%
<references />
+
''F''<sub>13,12</sub>(1&nbsp;-&nbsp;0.95) = 2.66. Since ''Q'' > 2.66 the zero hypothesis can be rejected on level of 95%.
   
===Additional sources===
+
==Examining results of regression models==
  +
===Checking model assumptions===
  +
<!-- [[Image:Ht_wt_residuals.png|300 px|thumb|right|residuals from [[regression analysis#example|regression analysis (example)]]]] -->
  +
Some of the model assumptions can be evaluated by calculating the residuals and plotting or otherwise analyzing them.
  +
The following plots can be constructed to test the validity of the assumptions:
  +
#Residuals against the explanatory variables in the model, as illustrated above. The residuals should have no relation to these variables (look for possible non-linear relations) and the spread of the residuals should be the same over the whole range.
  +
#Residuals against explanatory variables not in the model. Any relation of the residuals to these variables would suggest considering these variables for inclusion in the model.
  +
#Residuals against the fitted values, <math>\hat\mathbf y\,</math>.
  +
#A [[time series]] plot of the residuals, that is, plotting the residuals as a function of time.
  +
#Residuals against the preceding residual.
  +
#A [[normal probability plot]] of the residuals to test normality. The points should lie along a straight line.
  +
  +
There should not be any noticeable pattern to the data in all but the last plot.
  +
  +
===Assessing goodness of fit===
  +
  +
#The [[coefficient of determination]] gives what fraction of the observed variance of the response variable can be explained by the given variables.
  +
#Examine the observational and prediction confidence intervals. In most contexts, the smaller they are the better.
  +
  +
==Other procedures==
  +
=== Generalized least squares ===
  +
[[Generalized least squares]], which includes [[weighted least squares]] as a special case, can be used when the observational errors have unequal variance or serial correlation.
  +
  +
=== Errors-in-variables model ===
  +
[[Errors-in-variables model]] or [[total least squares]] when the independent variables are subject to error
  +
  +
=== Generalized linear model ===
  +
  +
[[Generalized linear model]] is used when the distribution function of the errors is not a Normal distribution. Examples include [[exponential distribution]], [[gamma distribution]], [[inverse Gaussian distribution]], [[Poisson distribution]], [[binomial distribution]], [[multinomial distribution]]
  +
  +
=== Robust regression ===
  +
{{main |robust regression}}
  +
  +
A host of alternative approaches to the computation of regression parameters are included in the category known as [[robust regression]]. One technique minimizes the mean [[approximation error|absolute error]], or some other function of the residuals, instead of mean squared error as in linear regression. Robust regression is much more computationally intensive than linear regression and is somewhat more difficult to implement as well. While least squares estimates are not very sensitive to breaking the normality of the errors assumption, this is not true when the variance or mean of the error distribution is not bounded, or when an analyst that can identify outliers is unavailable.
  +
  +
Among [[Stata]] users, ''Robust regression'' is frequently taken to mean linear regression with Huber-White standard error estimates due to the naming conventions for regression commands. This procedure relaxes the assumption of [[homoscedasticity]] for variance estimates only; the predictors are still ordinary least squares (OLS) estimates. This occasionally leads to confusion; [[Stata]] users sometimes believe that linear regression is a robust method when this option is used, although it is actually not robust in the sense of outlier-resistance.
  +
  +
=== Instrumental variables and related methods===
  +
The assumption that the error term in the linear model can be treated as uncorrelated with the independent variables will frequently be untenable, as [[omitted-variables bias]], "reverse" causation, and errors-in-variables problems can generate such a correlation. [[Instrumental variable]] and other methods can be used in such cases.
  +
  +
== Applications of linear regression==
  +
Linear regression is widely used in biological, behavioral and social sciences to describe possible relationships between variables. It ranks as one of the most important tools used in these disciplines.
  +
===Trend line ===
  +
:''For trend lines as used in [[technical analysis]], see [[Trend lines (technical analysis)]]''
  +
  +
A '''trend line''' represents a [[trend (statistics)|trend]], the long-term movement in [[time series]] data after other components have been accounted for. It tells whether a particular data set (say GDP, oil prices or stock prices) have increased or decreased over the period of time. A trend line could simply be drawn by eye through a set of data points, but more properly their position and slope is calculated using statistical techniques like linear regression. Trend lines typically are straight lines, although some variations use higher degree polynomials depending on the degree of curvature desired in the line.
  +
  +
Trend lines are sometimes used in business analytics to show changes in data over time. This has the advantage of being simple. Trend lines are often used to argue that a particular action or event (such as training, or an advertising campaign) caused observed changes at a point in time. This is a simple technique, and does not require a control group, experimental design, or a sophisticated analysis technique. However, it suffers from a lack of scientific validity in cases where other potential changes can affect the data.
  +
  +
=== Epidemiology ===
  +
As one example, early evidence relating [[tobacco smoking]] to mortality and [[morbidity]] came from studies employing regression. Researchers usually include several variables in their regression analysis in an effort to remove factors that might produce [[spurious correlation]]s. For the cigarette smoking example, researchers might include socio-economic status in addition to smoking to ensure that any observed effect of smoking on mortality is not due to some effect of education or income. However, it is never possible to include all possible confounding variables in a study employing regression. For the smoking example, a hypothetical gene might increase mortality and also cause people to smoke more. For this reason, [[randomized controlled trial]]s are often able to generate more compelling evidence of causal relationships than correlational analysis using linear regression. When controlled experiments are not feasible, variants of regression analysis such as [[instrumental variables]] and other methods may be used to attempt to estimate causal relationships from observational data.
  +
  +
  +
  +
==See also==
  +
<div class="references-small" style="-moz-column-count:3; column-count:3;">
  +
* [[Anscombe's quartet]]
  +
* [[Cross-sectional regression]]
  +
* [[Empirical Bayes methods]]
  +
* [[Hierarchical linear modeling]]
  +
* [[Instrumental variable]]
  +
* [[Least-squares estimation of linear regression coefficients]]
  +
* [[M-estimator]]
  +
* [[Nonlinear regression]]
  +
* [[Nonparametric regression]]
  +
* [[Multiple regression]]
  +
* [[Segmented regression]]
  +
* [[Ridge regression]]
  +
* [[Lack-of-fit sum of squares]]
  +
* [[Truncated regression model]]
  +
* [[Censored regression model]]
  +
</div>
  +
  +
==Notes==
  +
{{reflist}}
  +
  +
==References==
 
* Cohen, J., Cohen P., West, S.G., & Aiken, L.S. (2003). ''Applied multiple regression/correlation analysis for the behavioral sciences.'' (2nd ed.) Hillsdale, NJ: Lawrence Erlbaum Associates
 
* Cohen, J., Cohen P., West, S.G., & Aiken, L.S. (2003). ''Applied multiple regression/correlation analysis for the behavioral sciences.'' (2nd ed.) Hillsdale, NJ: Lawrence Erlbaum Associates
 
* [[Charles Darwin]]. ''The Variation of Animals and Plants under Domestication''. (1869) ''(Chapter XIII describes what was known about reversion in Galton's time. Darwin uses the term "reversion".)''
 
* [[Charles Darwin]]. ''The Variation of Animals and Plants under Domestication''. (1869) ''(Chapter XIII describes what was known about reversion in Galton's time. Darwin uses the term "reversion".)''
 
* Draper, N.R. and Smith, H. ''Applied Regression Analysis'' Wiley Series in Probability and Statistics (1998)
 
* Draper, N.R. and Smith, H. ''Applied Regression Analysis'' Wiley Series in Probability and Statistics (1998)
 
* Francis Galton. "Regression Towards Mediocrity in Hereditary Stature," ''Journal of the Anthropological Institute'', 15:246-263 (1886). ''(Facsimile at: [http://www.mugu.com/galton/essays/1880-1889/galton-1886-jaigi-regression-stature.pdf])''
 
* Francis Galton. "Regression Towards Mediocrity in Hereditary Stature," ''Journal of the Anthropological Institute'', 15:246-263 (1886). ''(Facsimile at: [http://www.mugu.com/galton/essays/1880-1889/galton-1886-jaigi-regression-stature.pdf])''
+
* Robert S. Pindyck and Daniel L. Rubinfeld (1998, 4h ed.). ''Econometric Models and Economic Forecasts'',, ch. 1 (Intro, incl. appendices on Σ operators & derivation of parameter est.) & Appendix 4.3 (mult. regression in matrix form).
===See also===
+
* {{citation | last1=Kaw | first1=Autar | last2=Kalu | first2=Egwu | year=2008 | title=Numerical Methods with Applications | edition=1st | publisher=[http://www.autarkaw.com] }}, Chapter 6 deals with linear and non-linear regression.
* [[Econometrics]]
 
* [[Regression analysis]]
 
* [[Robust regression]]
 
* [[Least squares]]
 
* [[Median-median line]]
 
* [[Instrumental variable]]
 
* [[Hierarchical linear modeling]]
 
* [[Empirical Bayes methods]]
 
   
 
==External links==
 
==External links==
  +
* http://homepage.mac.com/nshoffner/nsh/CalcBookAll/Chapter%201/1functions.html
  +
* [http://ssrn.com/abstract=1076742 Investment Volatility: A Critique of Standard Beta Estimation and a Simple Way Forward, C.Tofallis]Downloadable version of paper, subsequently published in the ''European Journal of Operational Research 2008''.
  +
* [http://www.cs.tut.fi/~lasip Scale-adaptive nonparametric regression] (with Matlab software).
 
* [http://www.dangoldstein.com/regression.html Visual Least Squares]: An interactive, visual flash demonstration of how linear regression works.
 
* [http://www.dangoldstein.com/regression.html Visual Least Squares]: An interactive, visual flash demonstration of how linear regression works.
* [http://www.che.utexas.edu/~john/research/ In Situ Adaptive Tabulation]: Combining many linear regressions to approximate any nonlinear function.
+
* [http://www.hedengren.net/research/ In Situ Adaptive Tabulation]: Combining many linear regressions to approximate any nonlinear function.
* [http://members.aol.com/jeff570/mathword.html Earliest Known uses of some of the Words of Mathematics]. See: [http://members.aol.com/jeff570/e.html] for "error", [http://members.aol.com/jeff570/g.html] for "Gauss-Markov theorem", [http://members.aol.com/jeff570/m.html] for "method of least squares", and [http://members.aol.com/jeff570/r.html] for "regression".
+
* [http://jeff560.tripod.com/mathword.html Earliest Known uses of some of the Words of Mathematics]. See: [http://jeff560.tripod.com/e.html] for "error", [http://jeff560.tripod.com/g.html] for "Gauss-Markov theorem", [http://jeff560.tripod.com/m.html] for "method of least squares", and [http://jeff560.tripod.com/r.html] for "regression". <!-- THE INFORMATION AT THESE LINKS SHOULD BE INCORPORATED INTO THE ARTICLE, NOT PRESENTED AS EXTERNAL LINKS. -->
 
* [http://www.wessa.net/esteq.wasp Online linear regression calculator.]
 
* [http://www.wessa.net/esteq.wasp Online linear regression calculator.]
  +
* [http://www.mathpages.com/home/kmath110.htm Perpendicular Regression Of a Line] at MathPages
 
* [http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/ Online regression by eye (simulation).]
 
* [http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/ Online regression by eye (simulation).]
 
* [http://www.vias.org/simulations/simusoft_leverage.html Leverage Effect] Interactive simulation to show the effect of outliers on the regression results
 
* [http://www.vias.org/simulations/simusoft_leverage.html Leverage Effect] Interactive simulation to show the effect of outliers on the regression results
 
* [http://www.vias.org/simulations/simusoft_linregr.html Linear regression as an optimisation problem]
 
* [http://www.vias.org/simulations/simusoft_linregr.html Linear regression as an optimisation problem]
 
* [http://www.visualstatistics.net/Visual%20Statistics%20Multimedia/regression_analysis.htm Visual Statistics with Multimedia]
 
* [http://www.visualstatistics.net/Visual%20Statistics%20Multimedia/regression_analysis.htm Visual Statistics with Multimedia]
* [http://www.egwald.com/statistics/statistics.php3 Multiple Regression] by Elmer G. Wiens. Online multiple and restricted multiple regression package.
+
* [http://www.egwald.ca/statistics/statistics.php3 Multiple Regression] by Elmer G. Wiens. Online multiple and restricted multiple regression package.
 
* [http://zunzun.com/ ZunZun.com] Online curve and surface fitting.
 
* [http://zunzun.com/ ZunZun.com] Online curve and surface fitting.
  +
* [http://www.causeweb.org CAUSEweb.org] Many resources for teaching statistics including Linear Regression.
  +
* [http://www.angelfire.com/tx4/cus/regress/index.html Multivariate Regression] Python, Smalltalk & Java Implementation of Linear Regression Calculation.
  +
* [http://www.neas-seminars.com/discussions/messages.aspx?ForumID=194] "Mahler's Guide to Regression"
  +
* [http://numericalmethods.eng.usf.edu/topics/linear_regression.html Linear Regression] - Notes, PPT, Videos, Mathcad, Matlab, Mathematica, Maple at [http://numericalmethods.eng.usf.edu Numerical Methods for STEM undergraduate]
  +
* [http://www.dss.uniud.it/utenti/rizzi/econometrics_part1_file/restricted-reg.pdf Restricted regression] - Lecture in the Department of Statistics, University of Udine
  +
  +
{{Statistics}}
  +
{{Least Squares and Regression Analysis}}
   
 
[[Category:Regression analysis]]
 
[[Category:Regression analysis]]
 
[[Category:Estimation theory]]
 
[[Category:Estimation theory]]
  +
[[Category:Parametric statistics]]
  +
[[Category:Statistical correlation]]
  +
[[Category:Statistical regression]]
   
:de:Regressionsanalyse
+
<!--
  +
[[ar:انحدار خطي]]
  +
[[cs:Lineární regrese]]
  +
[[de:Regressionsanalyse]]
 
[[es:Regresión lineal]]
 
[[es:Regresión lineal]]
[[fr:Régression linéaire]]
+
[[eu:Karratu txikienen erregresio zuzen]]
  +
[[fa:رگرسیون خطی]]
  +
[[fr:Régression linéaire multiple]]
  +
[[ko:선형 회귀]]
 
[[it:Regressione lineare]]
 
[[it:Regressione lineare]]
 
[[he:רגרסיה לינארית]]
 
[[he:רגרסיה לינארית]]
[[nl:Lineaire regressie]]
+
[[ja:線形回帰]]
  +
[[no:Lineær regresjon]]
 
[[pl:Regresja liniowa]]
 
[[pl:Regresja liniowa]]
 
[[pt:Regressão linear]]
 
[[pt:Regressão linear]]
+
[[sv:Multipel linjär regression]]
  +
[[zh:線性回歸]]
  +
-->
 
{{enWP|Linear regression}}
 
{{enWP|Linear regression}}

Latest revision as of 08:01, June 17, 2009

Assessment | Biopsychology | Comparative | Cognitive | Developmental | Language | Individual differences | Personality | Philosophy | Social |
Methods | Statistics | Clinical | Educational | Industrial | Professional items | World psychology |

Statistics: Scientific method · Research methods · Experimental design · Undergraduate statistics courses · Statistical tests · Game theory · Decision theory


In statistics, linear regression is used for two things:

  • to construct a simple formula that will predict a value or values for a variable given the value of another variable.
  • to test whether and how a given variable is related to another variable or variables.

Linear regression is a form of regression analysis in which the relationship between one or more independent variables and another variable, called the dependent variable, is modelled by a least squares function, called a linear regression equation. This function is a linear combination of one or more model parameters, called regression coefficients. A linear regression equation with one independent variable represents a straight line when the predicted value (i.e. the dependent variable from the regression equation) is plotted against the independent variable: this is called a simple linear regression. However, note that "linear" does not refer to this straight line, but rather to the way in which the regression coefficients occur in the regression equation. The results are subject to statistical analysis.

File:Linear regression.png
Example of linear regression with one independent variable.

Introduction Edit

Theoretical modelEdit

A linear regression model assumes, given a random sample  (Y_i, X_{i1}, \ldots, X_{ip}), \, i = 1, \ldots, n , a possibly imperfect relationship between  Y_i , the regressand, and the regressors  X_{i1}, \ldots, X_{ip} . A disturbance term  \varepsilon_i , which is a random variable too, is added to this assumed relationship to capture the influence of everything else on  Y_i other than  X_{i1}, \ldots, X_{ip} . Hence, the multiple linear regression model takes the following form:

 Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} + \varepsilon_i, \qquad i = 1, \ldots, n

Note that the regressors (the X-s) are also called independent variables, exogenous variables, covariates, input variables or predictor variables. Similarly, regressands (the Y-s) are also called dependent variables, response variables, measured variables, or predicted variables. There are p + 1 unknown parameters β0, β1, ..., βp. One of the purpose of linear regression is to determine these unknown parameters and their statistical significance (in other words: the mean value and standard deviation of each βj, j = 0, 1, ..., p).

Models which do not conform to this specification may be treated by nonlinear regression. A linear regression model need not be a linear function of the independent variable: linear in this context means that the conditional mean of  Y_i is linear in the parameters \beta. For example, the model  Y_i = \beta_1 X_i + \beta_2 X_i^2 + \varepsilon_i is linear in the parameters  \beta_1 and  \beta_2 , but it is not linear in  X_i^2 , a nonlinear function of  X_i . An illustration of this model is shown in the example, below.

Data and estimationEdit

It is important to distinguish the model formulated in terms of random variables and the observed values of these random variables. Typically, the observed values, or data, denoted by lower case letters, consist of n values  (y_i, x_{i1}, \ldots, x_{ip}), \, i = 1, \ldots, n .

In general there are  p + 1 parameters to be determined, \beta_0, \ldots, \beta_p. In order to estimate the parameters it is often useful to use the matrix notation

\mathbf{ Y = X \boldsymbol\beta + \boldsymbol\varepsilon} \,

where Y is a column vector that includes the observed values of  Y_1, \ldots, Y_n ,  \varepsilon includes the unobserved stochastic components  \varepsilon_1, \ldots, \varepsilon_n and the matrix X the observed values of the regressors

 \mathbf {X} = \begin{pmatrix} 1 & x_{11} & \cdots & x_{1p} \\ 1 & x_{21} & \cdots & x_{2p}\\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \cdots & x_{np} \end{pmatrix}.

X includes, typically, a constant column, that is, a column which does not vary across observations, which is used to represent the intercept term  \beta_0 . Matrix X is sometime called the design matrix.

If there is any linear dependence among the columns of X, then the vector of parameters \beta cannot be estimated by least squares unless \beta is constrained, as, for example, by requiring the sum of some of its components to be 0. However, some linear combinations of the components of \boldsymbol\beta may still be uniquely estimable in such cases. For example, the model

 Y_i = \beta_1 X_i + \beta_2 2 X_i + \varepsilon_i

cannot be solved for  \beta_1 and  \beta_2 independently as the matrix of observations has the reduced rank 2. In this case the model can be rewritten as

 Y_i = (\beta_1 + 2\beta_2)X_i + \varepsilon_i

and solved to give a value for the composite entity \beta_1+2\beta_2.

Note that to only perform a least squares estimation of  \beta_0, \beta_1, \ldots, \beta_p it is not necessary to consider the sample as random variables. It may even be conceptually simpler to consider the sample as fixed, observed values, as we have done thus far. However in the context of hypothesis testing and confidence intervals, it will be necessary to interpret the sample as random variables  (Y_i, X_{i1}, \ldots, X_{ip}), \, i = 1, \ldots, n that will produce estimators which are themselves random variables. Then it will be possible to study the distribution of the estimators and draw inferences.

Classical assumptionsEdit

Classical assumptions for linear regression include the assumptions that the sample is selected at random from the population of interest, that the dependent variable is continuous on the real line, and that the error terms follow identical and independent normal distributions, that is, that the errors are i.i.d. and Gaussian. Note that these assumptions imply that the error term does not statistically depend on the values of the independent variables, that is, that  \varepsilon_i is statistically independent of the predictor variables. This article adopts these assumptions unless otherwise stated. Note that all of these assumptions may be relaxed, depending on the nature of the true probabilistic model of the problem at hand. The issue of choosing which assumptions to relax, which functional form to adopt, and other choices related to the underlying probabilistic model are known as specification searches. In particular note that the assumption that the error terms are normally distributed is of no consequence unless the sample is very small because central limit theorems imply that, so long as the error terms have finite variance and are not too strongly correlated, the parameter estimates will be approximately normally distributed even when the underlying errors are not.

Under these assumptions, an equivalent formulation of simple linear regression that explicitly shows the linear regression as a model of conditional expectation can be given as

 \mbox{E}(Y_i \mid X_i = x_i) = \alpha + \beta x_i \,

The conditional expected value of Yi given Xi is an affine function of Xi. Note that this expression follows from the assumption that the mean of \varepsilon_i is zero conditional on Xi.

Least-squares analysisEdit

Least squares estimatesEdit

The first objective of regression analysis is to best-fit the data by estimating the parameters of the model. Of the different criteria that can be used to define what constitutes a best fit, the least squares criterion is a very powerful one. This estimate (or estimator, if we are in the context of a random sample), is given by

 \boldsymbol\hat\beta = (\mathbf {X}^T \mathbf {X})^{-1}\mathbf {X}^T \mathbf {y} \,

For a full derivation see Linear least squares.

Regression inference Edit

The estimates can be used to test various hypotheses.

Denote by  \sigma^2 the variance of the error term  \varepsilon (recall we assume that  \varepsilon_i \sim N(0,\sigma^2)\, for every  i=1,\ldots,n ). An unbiased estimate of  \sigma ^2 is given by

\hat \sigma^2  = \frac {S} {n-p-1} ,

where  S := \sum_{i=1}^n \hat{\varepsilon}_i^2 is the sum of square residuals. The relation between the estimate and the true value is:

\hat\sigma^2 \cdot \frac{n-p-1}{\sigma^2} \sim  \chi_{n-p-1}^2

where  \chi_{n-p-1}^2 has Chi-square distribution with n − p  − 1 degrees of freedom.

The solution to the normal equations can be written as [1]

\hat\boldsymbol\beta=(\mathbf X ^T \mathbf X)^{-1} \mathbf X ^T \mathbf y.

This shows that the parameter estimators are linear combinations of the dependent variable. It follows that, if the observational errors are normally distributed, the parameter estimators will follow a joint normal distribution. Under the assumptions here, the estimated parameter vector is exactly distributed,

 \hat\boldsymbol\beta \sim N ( \boldsymbol\beta, \sigma^2 (\mathbf X^T \mathbf X)^{-1} )

where N denotes the multivariate normal distribution.

The standard error of j-th parameter estimator βj (where j = 0, 1, ..., p) is given by

\hat\sigma_j=\sqrt{ \frac{S}{n-p-1}\left[(\mathbf X^T \mathbf X)^{-1}\right]_{jj}}.

The 100(1 − α)% confidence interval for the parameter, \beta_j , is computed as follows:

\hat \beta_j  \pm t_{\frac{\alpha }{2},n - p - 1} \hat \sigma_j.

The residuals can be expressed as

\mathbf{\hat r =  y-X \hat\boldsymbol\beta= y-X} ( \mathbf X^ T \mathbf X)^{-1} \mathbf X^T \mathbf y.\,

The matrix \mathbf X( \mathbf X^T \mathbf X)^{-1} \mathbf X^T is known as the hat matrix and has the useful property that it is idempotent. Using this property it can be shown that, if the errors are normally distributed, the residuals will follow a normal distribution with covariance matrix (\mathbf I - \mathbf X(\mathbf X^T \mathbf X)^{-1} \mathbf X^T) \mathbf y \,. Studentized residuals are useful in testing for outliers.

The hat matrix is the matrix of the orthogonal projection onto the column space of the matrix X.

Given a value of the independent variable, xd, (where d = 1, 2, ..., n) the predicted response is calculated as

y_d=\sum_{j=1}^{p}x_{dj}\hat\beta_j.

Writing the elements x_{dj},\ j=1, \dots, p as \mathbf z, the 100(1 − α)% mean response confidence interval for the prediction is given, using error propagation theory, by:

\mathbf z^T\hat\boldsymbol\beta \pm t_{ \frac{\alpha }{2} ,n-p} \hat \sigma 
\sqrt {\mathbf z^T( \mathbf X^T \mathbf X)^{-1} \mathbf z }.

The 100(1 − α)% predicted response confidence intervals for the data are given by:

\mathbf z^T \hat\boldsymbol\beta \pm t_{\frac{\alpha }{2},n-p-1} \hat \sigma 
\sqrt {1 + \mathbf z^T(\mathbf X^T \mathbf X)^{-1} \mathbf z}.

Univariate linear case Edit

We consider here the case of the simplest regression model,  Y = \alpha + \beta X + \varepsilon . In order to estimate  \alpha and  \beta , we have a sample  (y_i, x_i), \, i = 1, \ldots, n of observations which are, here, not seen as random variables and denoted by lower case letters. As stated in the introduction, however, we might want to interpret the sample in terms of random variables in some other contexts than least squares estimation.

The idea of least squares estimation is to minimize the following unknown quantity, the sum of squared errors:

 \sum_{i = 1}^n \varepsilon_i^2 = \sum_{i = 1}^n (y_i - \alpha - \beta x_i)^2

Taking the derivative of the preceding expression with respect to  \alpha and  \beta yields the normal equations:

\begin{array}{lcl}
n\  \alpha + \displaystyle\sum_{i = 1}^n x_i\  \beta = \displaystyle\sum_{i = 1}^n y_i \\
\displaystyle\sum_{i = 1}^n x_i\  \alpha + \displaystyle\sum_{i = 1}^n x_i^2\  \beta = \displaystyle\sum_{i = 1}^n x_i y_i
\end{array}

This is a linear system of equations which can be solved using Cramer's rule:

\hat\beta = \frac {n \displaystyle\sum_{i = 1}^n x_i y_i - \displaystyle\sum_{i = 1}^n x_i \displaystyle\sum_{i = 1}^n y_i} {n \displaystyle\sum_{i = 1}^n x_i^2 - \left(\displaystyle\sum_{i = 1}^n x_i\right)^2}
=\frac{\displaystyle\sum_{i = 1}^n(x_i-\bar{x})(y_i-\bar{y})}{\displaystyle\sum_{i = 1}^n(x_i-\bar{x})^2}
\,
\hat\alpha = \frac {\displaystyle\sum_{i = 1}^n x_i^2 \displaystyle\sum_{i = 1}^n y_i - \displaystyle\sum_{i = 1}^n x_i \displaystyle\sum_{i = 1}^n x_iy_i} {n \displaystyle\sum_{i = 1}^n x_i^2 - \left(\displaystyle\sum_{i = 1}^n x_i\right)^2}= \bar y-\bar x \hat\beta
S = \sum_{i = 1}^n (y_i - \hat{y}_i)^2 
= \sum_{i = 1}^n y_i^2 - \frac {n \left(\displaystyle\sum_{i = 1}^n x_i y_i \right)^2 + \left(\displaystyle\sum_{i = 1}^n y_i \right)^2 \displaystyle\sum_{i = 1}^n x_i^2 - 2 \displaystyle\sum_{i = 1}^n x_i \displaystyle\sum_{i = 1}^n y_i \displaystyle\sum_{i = 1}^n x_i y_i } {n \displaystyle\sum_{i = 1}^n x_i^2 - \left(\displaystyle\sum_{i = 1}^n x_i\right)^2}
\hat \sigma^2  = \frac {S} {n-2}.

The covariance matrix is

\frac{1}{n \displaystyle\sum_{i = 1}^n x_i^2 - \left(\displaystyle\sum_{i = 1}^n x_i\right)^2}\begin{pmatrix}
  \sum x_i^2 & -\sum x_i \\
  -\sum x_i & n   
\end{pmatrix}

The mean response confidence interval is given by

y_d = (\alpha+\hat\beta x_d) \pm t_{ \frac{\alpha }{2} ,n-2} \hat \sigma \sqrt {\frac{1}{n} + \frac{(x_d - \bar{x})^2}{\sum (x_i - \bar{x})^2}}

The predicted response confidence interval is given by

y_d = (\alpha+\hat\beta x_d) \pm t_{ \frac{\alpha }{2} ,n-2} \hat \sigma \sqrt {1+\frac{1}{n} + \frac{(x_d - \bar{x})^2}{\sum (x_i - \bar{x})^2}}

The term t_{ \frac{\alpha }{2} ,n-2} is a reference to the Student's t-distribution. \hat \sigma is standard error.

Analysis of variance Edit

In analysis of variance (ANOVA), the total sum of squares is split into two or more components.

The "total (corrected) sum of squares" is

 \text{SST} = \sum_{i=1}^n (y_i - \bar y)^2,

where

 \bar y = \frac{1}{n} \sum_{i=1}^n y_i

is the average value of observed yi. (Here "corrected" means \scriptstyle\bar y\, has been subtracted from each y-value.) Equivalently

 \text{SST}  = \sum_{i=1}^n y_i^2 - \frac{1}{n}\left(\sum_{i=1}^n y_i\right)^2

The total sum of squares is partitioned as the sum of the "regression sum of squares" SSReg (or RSS, also called the "explained sum of squares") and the "error sum of squares" SSE, which is the sum of squares of residuals.

The regression sum of squares is

\text{SSReg} = \sum_{i=1}^n \left( \hat y_i  - \bar y  \right)^2
= \hat\boldsymbol\beta ^T \mathbf X^T
\mathbf y - \frac{1}{n}\left( \mathbf y^T \mathbf u \mathbf u^T \mathbf y \right),

where u is an n-by-1 vector in which each element is 1. Note that

\mathbf y^T \mathbf u = \mathbf u^T \mathbf y = \sum_{i=1}^n y_i,

and

\frac{1}{n} \mathbf y^T \mathbf u \mathbf u^T \mathbf y = \frac{1}{n}\left(\sum_{i=1}^n y_i\right)^2.\,

The error (or "unexplained") sum of squares SSE, which is the sum of square of residuals, is given by

\text{SSE} = \sum_{i=1}^n {\left( {y_i  - \hat y_i} \right)^2 }
= \mathbf y^T \mathbf y - \hat\boldsymbol\beta^T \mathbf X^T y.

The total sum of squares SST is

\text{SST} = \sum_{i=1}^n \left( y_i-\bar y \right)^2 = \mathbf{ y^T y}-\frac{1}{n}\left( 
\mathbf y^T \mathbf u \mathbf u^T \mathbf y\right)=\text{SSReg}+ \text{SSE}.

Pearson's coefficient of regression, R 2 is then given as

R^2  = \frac{\text{SSReg}}{{\text{SST}}}
= 1 - \frac{\text{SSE}}{\text{SST}}.

If the errors are independent and normally distributed with expected value 0 and they all have the same variance, then under the null hypothesis that all of the elements in β = 0 except the constant, the statistic

 \frac{ R^2 / p }{ ( 1 - R^2 )/(n-p-1)}

follows an F-distribution with p and n − p − 1 degrees of freedom (n is the number of regressands and p + 1 is the number of unknown parameters β0, β1, ..., βp). If that statistic is too large, then one rejects the null hypothesis. How large is too large depends on the level of the test, which is the tolerated probability of type I error; see statistical significance.

Goodness of fitEdit

Arises in the following situation: there are two statistical models:

\mathbf Y = \mathbf X_{(1)} \boldsymbol\beta_{(1)} + \boldsymbol\varepsilon \,

and

\mathbf  Y = \mathbf X_{(2)} \boldsymbol\beta_{(2)} + \boldsymbol\varepsilon \,

Which one is better?

Let

\text{SSE}_{(j)} = \sum_{i=1}^n {\left( {y_i  - \hat y_{(j)i}} \right)^2 },\qquad j = 1, 2

be the sum of residuals of the j-th model (j = 1 or 2). Then the ratio

Q = \frac {\text{SSE}_{(1)} / (n - p_{(1)} - 1)} {\text{SSE}_{(2)} / (n - p_{(2)} - 1)}

has F-distribution of n − p(1) − 1 and n − p(2) − 1 degrees of freedom, respectively.

Similar but more restricted test is presented in F-test. See also Lack-of-fit sum of squares.

ExampleEdit

Regression analysisEdit

To illustrate the various goals of regression, we give an example. The following data set gives the average heights and weights for American women aged 30–39 (source: The World Almanac and Book of Facts, 1975).

Height (m) 1.47 1.5 1.52 1.55 1.57 1.60 1.63 1.65 1.68 1.7 1.73 1.75 1.78 1.8 1.83
Weight (kg) 52.21 53.12 54.48 55.84 57.2 58.57 59.93 61.29 63.11 64.47 66.28 68.1 69.92 72.19 74.46

A plot of weight against height (see below) shows that it cannot be modeled by a straight line, so a regression is performed by modeling the data by a parabola.

 Y_i = \beta_0 + \beta_1 X_i + \beta_2 X^2_i +\varepsilon_i \!

where the dependent variable  Y_i is weight and the independent variable  X_i is height.

Place the observations  x_i, \ x_i^2, \, i = 1, \ldots, n , in the matrix X.

 \mathbf{X} =
\begin{pmatrix}
1&1.47&2.16\\
1&1.50&2.25\\
1&1.52&2.31\\
1&1.55&2.40\\
1&1.57&2.46\\
1&1.60&2.56\\
1&1.63&2.66\\
1&1.65&2.72\\
1&1.68&2.82\\
1&1.70&2.89\\
1&1.73&2.99\\
1&1.75&3.06\\
1&1.78&3.17\\
1&1.80&3.24\\
1&1.83&3.35\\
\end{pmatrix}

The values of the parameters are found by solving the normal equations

\mathbf (X^T \mathbf X)\boldsymbol\hat\beta = \mathbf X^T \mathbf y

Element ij of the normal equation matrix, \mathbf X^T \mathbf X is formed by summing the products of column i and column j of X.

x_{ij} = \sum_{k = 1}^{15} x_{ki} x_{kj}

Element i of the right-hand side vector  \mathbf X^T \mathbf y is formed by summing the products of column i of X with the column of dependent variable values.

\left(\mathbf X^T \mathbf y\right)_i= \sum_{k = 1}^{15} x_{ki} y_k

Thus, the normal equations are

\begin{pmatrix}
15&24.76&41.05\\
24.76&41.05&68.37\\
41.05&68.37&114.35\\
\end{pmatrix}

\begin{pmatrix}
\hat\beta_0\\
\hat\beta_1\\
\hat\beta_2\\
\end{pmatrix}
=
\begin{pmatrix}
931\\
1548\\
2586\\
\end{pmatrix}
\hat\beta_0=129 \pm 16 (value \pm standard deviation)
\hat\beta_1=-143 \pm 20
\hat\beta_2=62  \pm 6

The calculated values are given by

 \hat{y}_i = \hat\beta_0 + \hat\beta_1 x_i+ \hat\beta_2 x^2_i

The observed and calculated data are plotted together and the residuals, y_i - \hat{y}_i, are calculated and plotted. Standard deviations are calculated using the sum of squares, S = 0.76.

The confidence intervals are computed using:

[\hat{\beta_j}-\hat\sigma_j t_{m-n;1-\frac{\alpha}{2}};\hat{\beta_j}+\hat\sigma_j t_{m-n;1-\frac{\alpha}{2}}]

with \alpha = 5%, t_{m-n;1-\frac{\alpha}{2}} = 2.2. Therefore, we can say that the 95% confidence intervals are:

\hat\beta_0\in[92.9,164.7]
\hat\beta_1\in[-186.8,-99.5]
\hat\beta_2\in[48.7,75.2]

Goodness of fitEdit

Assume that the following two models have been proposed for the data in this example:

  1. A linear model with β(1)0 and β(1)1 as unknown parameters.
  2. A quadratic model solved above.

Which of these two is better?

The null hypothesis is that there is no difference between these two models. The residual sum of squares are SSE(1) = 7.49 and SSE(2) = 0.76.

Q = \frac {\text{SSE}_{(1)} / (n - p_{(1)} - 1)} {\text{SSE}_{(2)} / (n - p_{(2)} - 1)} =
 \frac {7.49 / (15 - 1 - 1)} {0.76 / (15 - 2 - 1)} = 9.1

The F-distribution with 13 and 12 degrees of freedom, respectively, give for α = 95% F13,12(1 - 0.95) = 2.66. Since Q > 2.66 the zero hypothesis can be rejected on level of 95%.

Examining results of regression modelsEdit

Checking model assumptionsEdit

Some of the model assumptions can be evaluated by calculating the residuals and plotting or otherwise analyzing them. The following plots can be constructed to test the validity of the assumptions:

  1. Residuals against the explanatory variables in the model, as illustrated above. The residuals should have no relation to these variables (look for possible non-linear relations) and the spread of the residuals should be the same over the whole range.
  2. Residuals against explanatory variables not in the model. Any relation of the residuals to these variables would suggest considering these variables for inclusion in the model.
  3. Residuals against the fitted values, \hat\mathbf y\,.
  4. A time series plot of the residuals, that is, plotting the residuals as a function of time.
  5. Residuals against the preceding residual.
  6. A normal probability plot of the residuals to test normality. The points should lie along a straight line.

There should not be any noticeable pattern to the data in all but the last plot.

Assessing goodness of fitEdit

  1. The coefficient of determination gives what fraction of the observed variance of the response variable can be explained by the given variables.
  2. Examine the observational and prediction confidence intervals. In most contexts, the smaller they are the better.

Other proceduresEdit

Generalized least squares Edit

Generalized least squares, which includes weighted least squares as a special case, can be used when the observational errors have unequal variance or serial correlation.

Errors-in-variables model Edit

Errors-in-variables model or total least squares when the independent variables are subject to error

Generalized linear model Edit

Generalized linear model is used when the distribution function of the errors is not a Normal distribution. Examples include exponential distribution, gamma distribution, inverse Gaussian distribution, Poisson distribution, binomial distribution, multinomial distribution

Robust regression Edit

Main article: robust regression

A host of alternative approaches to the computation of regression parameters are included in the category known as robust regression. One technique minimizes the mean absolute error, or some other function of the residuals, instead of mean squared error as in linear regression. Robust regression is much more computationally intensive than linear regression and is somewhat more difficult to implement as well. While least squares estimates are not very sensitive to breaking the normality of the errors assumption, this is not true when the variance or mean of the error distribution is not bounded, or when an analyst that can identify outliers is unavailable.

Among Stata users, Robust regression is frequently taken to mean linear regression with Huber-White standard error estimates due to the naming conventions for regression commands. This procedure relaxes the assumption of homoscedasticity for variance estimates only; the predictors are still ordinary least squares (OLS) estimates. This occasionally leads to confusion; Stata users sometimes believe that linear regression is a robust method when this option is used, although it is actually not robust in the sense of outlier-resistance.

Instrumental variables and related methodsEdit

The assumption that the error term in the linear model can be treated as uncorrelated with the independent variables will frequently be untenable, as omitted-variables bias, "reverse" causation, and errors-in-variables problems can generate such a correlation. Instrumental variable and other methods can be used in such cases.

Applications of linear regressionEdit

Linear regression is widely used in biological, behavioral and social sciences to describe possible relationships between variables. It ranks as one of the most important tools used in these disciplines.

Trend line Edit

For trend lines as used in technical analysis, see Trend lines (technical analysis)

A trend line represents a trend, the long-term movement in time series data after other components have been accounted for. It tells whether a particular data set (say GDP, oil prices or stock prices) have increased or decreased over the period of time. A trend line could simply be drawn by eye through a set of data points, but more properly their position and slope is calculated using statistical techniques like linear regression. Trend lines typically are straight lines, although some variations use higher degree polynomials depending on the degree of curvature desired in the line.

Trend lines are sometimes used in business analytics to show changes in data over time. This has the advantage of being simple. Trend lines are often used to argue that a particular action or event (such as training, or an advertising campaign) caused observed changes at a point in time. This is a simple technique, and does not require a control group, experimental design, or a sophisticated analysis technique. However, it suffers from a lack of scientific validity in cases where other potential changes can affect the data.

Epidemiology Edit

As one example, early evidence relating tobacco smoking to mortality and morbidity came from studies employing regression. Researchers usually include several variables in their regression analysis in an effort to remove factors that might produce spurious correlations. For the cigarette smoking example, researchers might include socio-economic status in addition to smoking to ensure that any observed effect of smoking on mortality is not due to some effect of education or income. However, it is never possible to include all possible confounding variables in a study employing regression. For the smoking example, a hypothetical gene might increase mortality and also cause people to smoke more. For this reason, randomized controlled trials are often able to generate more compelling evidence of causal relationships than correlational analysis using linear regression. When controlled experiments are not feasible, variants of regression analysis such as instrumental variables and other methods may be used to attempt to estimate causal relationships from observational data.


See alsoEdit

NotesEdit

  1. The parameter estimators should be obtained by solving the normal equations as simultaneous linear equations. The inverse normal equations matrix need only be calculated in order to obtain the standard deviations on the parameters.

ReferencesEdit

  • Cohen, J., Cohen P., West, S.G., & Aiken, L.S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. (2nd ed.) Hillsdale, NJ: Lawrence Erlbaum Associates
  • Charles Darwin. The Variation of Animals and Plants under Domestication. (1869) (Chapter XIII describes what was known about reversion in Galton's time. Darwin uses the term "reversion".)
  • Draper, N.R. and Smith, H. Applied Regression Analysis Wiley Series in Probability and Statistics (1998)
  • Francis Galton. "Regression Towards Mediocrity in Hereditary Stature," Journal of the Anthropological Institute, 15:246-263 (1886). (Facsimile at: [1])
  • Robert S. Pindyck and Daniel L. Rubinfeld (1998, 4h ed.). Econometric Models and Economic Forecasts,, ch. 1 (Intro, incl. appendices on Σ operators & derivation of parameter est.) & Appendix 4.3 (mult. regression in matrix form).
  • Kaw, Autar; Kalu, Egwu (2008), Numerical Methods with Applications (1st ed.), [2] , Chapter 6 deals with linear and non-linear regression.

External linksEdit



Template:Least Squares and Regression Analysis

This page uses Creative Commons Licensed content from Wikipedia (view authors).

Around Wikia's network

Random Wiki