# Linear regression

Talk0*34,140*pages on

this wiki

Assessment |
Biopsychology |
Comparative |
Cognitive |
Developmental |
Language |
Individual differences |
Personality |
Philosophy |
Social |

Methods |
Statistics |
Clinical |
Educational |
Industrial |
Professional items |
World psychology |

**Statistics:**
Scientific method ·
Research methods ·
Experimental design ·
Undergraduate statistics courses ·
Statistical tests ·
Game theory ·
Decision theory

In statistics, **linear regression** is a regression method of modeling the conditional expected value of one variable *y* given the values of some other variable or variables *x*.

Linear regression is called "linear" because the relation of the response to the explanatory variables is assumed to be a linear function of some parameters. It is often erroneously thought that the reason the technique is called "linear regression" is that the graph of is a line. But in fact, if the model is (for example)

(in which case we have put the vector in the role formerly played by and the vector in the role formerly played by ), then the problem is still one of **linear** regression, even though the graph is not a straight line.

Regression models which are not a linear function of the parameters are called nonlinear regression models (for example, a multi-layer artificial neural network).

Still more generally, regression may be viewed as a special case of density estimation. The joint distribution of the response and explanatory variables can be constructed from the conditional distribution of the response variable and the marginal distribution of the explanatory variables. In some problems, it is convenient to work in the other direction: from the joint distribution, the conditional distribution of the response variable can be derived. Regression lines can be extrapolated, where the line is extended to fit the model for values of the explanatory variables outside their original range. However extrapolation may be very inaccurate and can only be used reliably in certain instances.

## Historical remarks

The earliest form of linear regression was the method of least squares, which was published by Legendre in 1805,^{[1]} and by Gauss in 1809.^{[2]} The term "least squares" is from Legendre's term, *moindres carrés*. However, Gauss claimed that he had known the method since 1795.

Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the sun. Euler had worked on the same problem (1748) without success.^{[How to reference and link to summary or text]} Gauss published a further development of the theory of least squares in 1821,^{[3]} including a version of the Gauss-Markov theorem.

The term "reversion" was used in the nineteenth century to describe a biological phenomenon, namely that the progeny of exceptional individuals tend on average to be less exceptional than their parents, and more like their more distant ancestors. Francis Galton studied this phenomenon, and applied the slightly misleading term "regression towards mediocrity" to it (parents of exceptional individuals also tend on average to be less exceptional than their children). For Galton, regression had only this biological meaning, but his work (1877, 1885)^{[4]}^{[5]} was extended by Udny Yule and Karl Pearson to a more general statistical context (1897,^{[How to reference and link to summary or text]} 1903).^{[6]}^{[7]} In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is assumed to be Gaussian. This assumption was weakened by R.A. Fisher in his works of 1922 and 1925.^{[8]}^{[9]} Fisher assumed that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this respect, Fisher's assumption is closer to Gauss's formulation of 1821.

## Naming conventions

The measured variable, *y*, is conventionally called the "response variable". The terms "endogenous variable," "output variable," "criterion variable," and "dependent variable" are also used. The controlled or manipulated variable, *x*, are called the explanatory variables. The terms "exogenous variables," "input variables," "predictor variables" and "independent variables" are also used.

The terms *independent* or *explanatory* variable suggest that the variables are statistically independent, which is not what the terms describe. These terms may also convey that the value of the independent/explanatory variable can be chosen at will. The *response* variable is then seen as an effect, that is, causally dependent on the independent/explanatory variable, as in a stimulus-response model. Although many linear regression models are formulated as models of cause and effect, the direction of causation may just as well go the other way, or indeed there need not be any causal relation at all. For that reason, one may prefer the terms "predictor / response" or "endogenous / exogenous," which do not imply causality. (See correlation does not imply causation for more on the topic of cause and effect relationships in correlative designs.)

The explanatory and response variables may be scalars or vectors. Multiple linear regression includes cases with more than one explanatory variable.

## Statement of the simple linear regression model

The simple linear regression model is typically stated in the form

The right hand side may take more general forms, but generally comprises a linear combination of the parameters, here denoted α and β. The term ε represents the unpredicted or unexplained variation in the response variable; it is conventionally called the "error" whether it is really a measurement error or not, and is assumed to be independent of . The error term is conventionally assumed to have expected value equal to zero (a nonzero expected value could be absorbed into α). See also errors and residuals in statistics; the difference between an error and a residual is also dealt with below.

An equivalent formulation which explicitly shows the linear regression as a model of conditional expectation is

with the conditional distribution of *y* given *x* essentially the same as the distribution of the error term.

A linear regression model need not be affine, let alone linear, in the explanatory variables *x*. For example,

is a linear regression model, for the right-hand side is a linear combination of the parameters α, β, and γ. In this case it is useful to think of *x*^{2} as a new explanatory variable, formed by modifying the original variable *x*. Indeed, any linear combination of functions *f*(*x*), *g*(*x*), *h*(*x*), ..., is linear regression model,
so long as these functions do not have any free parameters (otherwise the model is generally a nonlinear regression model). The least-squares estimates of α, β, and γ are linear in the response variable *y*, and nonlinear in *x* (they are nonlinear in *x* even if the γ and α terms are absent; if only β were present then doubling all observed *x* values would multiply the least-squares estimate of β by 1/2).

Often in linear regression problems statisticians rely on the Gauss-Markov assumptions:

- The random errors ε
_{i}have expected value 0. - The random errors ε
_{i}are uncorrelated (this is weaker than an assumption of probabilistic independence). - The random errors ε
_{i}are homoscedastic, i.e., they all have the same variance.

(See also Gauss-Markov theorem. That result says that under the assumptions above, least-squares estimators are in a certain sense optimal.)

Sometimes a stronger set of assumptions is relied on:

- The random errors ε
_{i}have expected value 0. - They are independent.
- They are normally distributed.
- They are homoscedastic.

If *x _{i}* is a vector we can take the product β

*x*to be a scalar product (see "dot product").

_{i}A statistician will usually **estimate** the unobservable values of the parameters α and β by the **method of least squares**, which consists of finding the values of *a* and *b* that minimize the sum of squares of the **residuals**

Those values of and are the "least-squares estimates" of α and β respectively. The residuals may be regarded as estimates of the errors; see also errors and residuals in statistics.

Notice that, whereas the errors are independent, the residuals cannot be independent because the use of least-squares estimates implies that the sum of the residuals must be 0, and the scalar product of the vector of residuals with the vector of *x*-values must be 0, i.e., we must have

and

These two linear constraints imply that the vector of residuals must lie within a certain (*n* − 2)-dimensional subspace of R^{n}; hence we say that there are "*n* − 2 degrees of freedom for error". If one assumes the errors are normally distributed and independent, then it can be shown to follow that 1) the sum of squares of residuals

is distributed as

So we have :

- the sum of squares divided by the error-variance σ
^{2}, has a chi-square distribution with*n*− 2 degrees of freedom, - the sum of squares of residuals is actually probabilistically independent of the estimates of the parameters α and β.

These facts make it possible to use Student's t-distribution with *n* − 2 degrees of freedom (so named in honor of the pseudonymous "Student") to find confidence intervals for α and β.

## Parameter estimation using least-squares

By recognizing that the regression model is a system of linear equations we can express the model using data matrix **X**, *target* vector **Y** and parameter vector . The *i*th row of **X** and **Y** will contain the *x* and *y* value for the *i*th data sample. Then the model can be written as

which when using pure matrix notation becomes

where ε is normally distributed with expected value 0 (i.e., a column vector of 0s) and variance σ^{2} *I*_{n}, where *I _{n}* is the

*n*×

*n*identity matrix.

The least-squares estimator for is

(where *X*′ is the transpose of *X*) and the sum of squares of residuals is

One of the properties of least-squares is that the matrix is the orthogonal projection of *Y* onto the column space of *X*.

The fact that the matrix *X*(*X*′*X*)^{−1}*X*′ is a symmetric idempotent matrix is incessantly relied on in proofs of theorems. The linearity of as a function of the vector *Y*, expressed above by saying

is the reason why this is called "linear" regression. Nonlinear regression uses nonlinear methods of estimation.

The matrix *I _{n}* −

*X*(

*X*′

*X*)

^{−1}

*X*′ that appears above is a symmetric idempotent matrix of rank

*n*− 2. Here is an example of the use of that fact in the theory of linear regression. The finite-dimensional spectral theorem of linear algebra says that any real symmetric matrix

*M*can be diagonalized by an orthogonal matrix

*G*, i.e., the matrix

*G*′

*MG*is a diagonal matrix. If the matrix

*M*is also idempotent, then the diagonal entries in

*G*′

*MG*must be idempotent numbers. Only two real numbers are idempotent: 0 and 1. So

*I*

_{n}−

*X*(

*X*′

*X*)

^{ 1}

*X*′, after diagonalization, has

*n*− 2 0s and two 1s on the diagonal. That is most of the work in showing that the sum of squares of residuals has a chi-square distribution with

*n*−2 degrees of freedom.

Regression parameters can also be estimated by Bayesian methods. This has the advantages that

- confidence intervals can be produced for parameter estimates without the use of asymptotic approximations,
- prior information can be incorporated into the analysis.

Suppose that in the linear regression

we know from domain knowledge that alpha can only take one of the values {−1, +1} but we do not know which. We can build this information into the analysis by choosing a prior for alpha which is a discrete distribution with a probability of 0.5 on −1 and 0.5 on +1. The posterior for alpha will also be a discrete distribution on {−1, +1}, but the probability weights will change to reflect the evidence from the data.

In modern computer applications, the actual value of is calculated using the QR decomposition or slightly more fancy methods when is near singular. The code for the matlab \ function is an excellent example of a robust method.

### Robust regression

A host of alternative approaches to the computation of regression parameters are included in the category known as robust regression. One technique minimizes the mean absolute error, or some other function of the residuals, instead of mean squared error as in linear regression. Robust regression is much more computationally intensive than linear regression and is somewhat more difficult to implement as well. While least squares estimates are not very sensitive to breaking the normality of the errors assumption, this is not true when the variance or mean of the error distribution is not bounded, or when an analyst that can identify outliers is unavailable.

In the Stata culture, *Robust regression* means linear regression with Huber-White standard error estimates. This relaxes the assumption of homoskedasticity for variance estimates only, the predictors are still ordinary least squares (OLS) estimates.

### Summarizing the data

We sum the observations, the squares of the *Y*s and *X*s and the products *XY* to obtain the following quantities.

and similarly.

and `S _{YY}` similarly.

### Estimating beta (the slope)

We use the summary statistics above to calculate , the estimate of β.

### Estimating alpha (the intercept)

We use the estimate of β and the other statistics to estimate α by:

A consequence of this estimate is that the regression line will always pass through the "center" .

### Displaying the residuals

The first method of displaying the residuals use the histogram or cumulative distribution to depict the similarity (or lack thereof) to a normal distribution. Non-normality suggests that the model may not be a good summary description of the data.

We plot the residuals,

against the explanatory variable, *x*. There should be no discernible trend or pattern if the model is satisfactory for these data. Some of the possible problems are:

- Residuals increase (or decrease) as the explanatory variable increases —indicates mistakes in the calculations. Find the mistakes and correct them.
- Residuals first rise and then fall (or first fall and then rise) —indicates that the appropriate model is (at least) quadratic. Adding a quadratic term (and then possibly higher) to the model may be appropriate. See nonlinear regression and multiple linear regression.
- One residual is much larger than the others —suggests that there is one unusual observation which is distorting the fit.
- Verify its value before publishing
*or* - Eliminate it, document your decision to do so, and recalculate the statistics.

- Verify its value before publishing

- Studentized residuals can be used in outlier detection.

- The vertical spread of the residuals increases as the explanatory variable increases (funnel-shaped plot) —indicates that the homoskedasticity assumption is violated (i.e. there is heteroskedasticity: the variability of the response depends on the value of
*x*.- Transform the data: for example, the logarithm or logit transformations are often useful.
- Use a more general modeling approach that can account for non-constant variance, for example a general linear model or a generalized linear model.

### Ancillary statistics

The sum of squared deviations can be partitioned as in ANOVA to indicate what part of the dispersion of the response variable is explained by the explanatory variable.

The correlation coefficient, *r*, can be calculated by

This statistic is a measure of how well a straight line describes the data. Values near zero suggest that the model is ineffective. *r*^{2} is frequently interpreted as the fraction of the variability explained by the explanatory variable, *X*.

## Multiple linear regression

Linear regression can be extended to functions of two or more variables, for example

Here *X* and *Y* are explanatory variables.
The values of the parameters and are estimated by the **method of least squares**, that minimize the sum of squares of the residuals

Take derivatives with respect to in and set equal to zero. This leads to a system of three linear equations i.e. the **normal equations** to find estimates of parameters and :

### Polynomial fitting

A polynomial fit is a specific type of multiple regression. The simple regression model (a first-order polynomial) can be trivially extended to higher orders. The regression model is a system of polynomial equations of order *m* with polynomial coefficients . As before, we can express the model using data matrix **X**, *target* vector **Y** and parameter vector . The *i*th row of **X** and **Y** will contain the *x* and *y* value for the *i*th data sample. Then the model can be written as as system of linear equations:

which when using pure matrix notation remains, as before:

and the vector of polynomial coefficients is

The simple regression model is a special case of a polynomial fit, having the polynomial order *m* = 1.

### Coefficient of determination

*Main article: Coefficient of determination*

The coefficient of determination, *R*^{2}, is a measure of the proportion of variability explained by, or due to the regression (linear relationship), in a sample of paired data. The value of the coefficient of determination can range between zero and one, where a value close to zero suggests a poor model.

## Examples

Linear regression is widely used in biological, behavioural and social sciences to describe relationships between variables. It ranks as one of the most important tools used in these disciplines.

### Medicine

As one example, early evidence relating tobacco smoking to mortality and morbidity came from studies employing regression. Researchers usually include several variables in their regression analysis in an effort to remove factors that might produce spurious correlations. For the cigarette smoking example, researchers might include socio-economic status in addition to smoking to ensure that any observed effect of smoking on mortality is not due to some effect of education or income. However, it is never possible to include all possible confounding variables in a study employing regression. For the smoking example, a hypothetical gene might increase mortality and also cause people to smoke more. For this reason, randomized controlled trials are considered to be more trustworthy than a regression analysis.

### Finance

Linear regression underlies the capital asset pricing model, and the concept of using Beta for analyzing and quantifying the systematic risk of an investment. This comes directly from the Beta coefficient of the linear regression model that relates the return on the investment to the return on all risky assets.

## References

- ↑ A.M. Legendre.
*Nouvelles méthodes pour la détermination des orbites des comètes*(1805). "Sur la Méthode des moindres quarrés" appears as an appendix. - ↑ C.F. Gauss.
*Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientum*. (1809) - ↑ C.F. Gauss.
*Theoria combinationis observationum erroribus minimis obnoxiae*. (1821/1823) - ↑ Francis Galton. "Typical laws of heredity", Nature 15 (1877), 492-495, 512-514, 532-533.
*(Galton uses the term "reversion" in this paper, which discusses the size of peas.)* - ↑ Francis Galton. Presidential address, Section H, Anthropology. (1885)
*(Galton uses the term "regression" in this paper, which discusses the height of humans.)* - ↑ G. Udny Yule. "On the Theory of Correlation", J. Royal Statist. Soc., 1897, p. 812-54.
- ↑ Karl Pearson, G. U. Yule, Norman Blanchard, and Alice Lee. "The Law of Ancestral Heredity",
*Biometrika*(1903) - ↑ R.A. Fisher. "The goodness of fit of regression formulae, and the distribution of regression coefficients", J. Royal Statist. Soc., 85, 597-612 (1922)
- ↑ R.A. Fisher.
*Statistical Methods for Research Workers*(1925)

### Additional sources

- Cohen, J., Cohen P., West, S.G., & Aiken, L.S. (2003).
*Applied multiple regression/correlation analysis for the behavioral sciences.*(2nd ed.) Hillsdale, NJ: Lawrence Erlbaum Associates - Charles Darwin.
*The Variation of Animals and Plants under Domestication*. (1869)*(Chapter XIII describes what was known about reversion in Galton's time. Darwin uses the term "reversion".)* - Draper, N.R. and Smith, H.
*Applied Regression Analysis*Wiley Series in Probability and Statistics (1998) - Francis Galton. "Regression Towards Mediocrity in Hereditary Stature,"
*Journal of the Anthropological Institute*, 15:246-263 (1886).*(Facsimile at: [1])*

### See also

- Econometrics
- Regression analysis
- Robust regression
- Least squares
- Median-median line
- Instrumental variable
- Hierarchical linear modeling
- Empirical Bayes methods

## External links

- Visual Least Squares: An interactive, visual flash demonstration of how linear regression works.
- In Situ Adaptive Tabulation: Combining many linear regressions to approximate any nonlinear function.
- Earliest Known uses of some of the Words of Mathematics. See: [2] for "error", [3] for "Gauss-Markov theorem", [4] for "method of least squares", and [5] for "regression".
- Online linear regression calculator.
- Online regression by eye (simulation).
- Leverage Effect Interactive simulation to show the effect of outliers on the regression results
- Linear regression as an optimisation problem
- Visual Statistics with Multimedia
- Multiple Regression by Elmer G. Wiens. Online multiple and restricted multiple regression package.
- ZunZun.com Online curve and surface fitting.

- de:Regressionsanalyse

es:Regresión lineal fr:Régression linéairehe:רגרסיה לינארית nl:Lineaire regressiept:Regressão linear

This page uses Creative Commons Licensed content from Wikipedia (view authors). |