# Maximum likelihood

34,142pages on
this wiki

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Maximum likelihood estimation (MLE) is a popular statistical method used for fitting a mathematical model to some data. The modeling of real world data using estimation by maximum likelihood offers a way of tuning the free parameters of the model to provide a good fit.

The method was pioneered by geneticist and statistician Sir R. A. Fisher between 1912 and 1922. It has widespread applications in various fields, including:

The method of maximum likelihood corresponds to many well-known estimation methods in statistics. For example, suppose you are interested in the heights of Americans. You have a sample of some number of Americans, but not the entire population, and record their heights. Further, you are willing to assume that heights are normally distributed with some unknown mean and variance. The sample mean is then the maximum likelihood estimator of the population mean, and the sample variance is a close approximation to the maximum likelihood estimator of the population variance (see examples below).

For a fixed set of data and underlying probability model, maximum likelihood picks the values of the model parameters that make the data "more likely" than any other values of the parameters would make them. Maximum likelihood estimation gives a unique and easy way to determine solution in the case of the normal distribution and many other problems, although in very complex problems this may not be the case. If a uniform prior distribution is assumed over the parameters, the maximum likelihood estimate coincides with the most probable values thereof.

## PrerequisitesEdit

The following discussion assumes that readers are familiar with basic notions in probability theory such as probability distributions, probability density functions, random variables and expectation. It also assumes they are familiar with standard basic techniques of maximizing continuous real-valued functions, such as using differentiation to find a function's maxima.

## Principles Edit

Consider a family $D_\theta$ of probability distributions parameterized by an unknown parameter $\theta$ (which could be vector-valued), associated with either a known probability density function (continuous distribution) or a known probability mass function (discrete distribution), denoted as $f_\theta$. We draw a sample $x_1,x_2,\dots,x_n$ of n values from this distribution, and then using $f_\theta$ we compute the (multivariate) probability density associated with our observed data, $f_\theta(x_1,\dots,x_n).\,\!$

As a function of θ with x1, ..., xn fixed, this is the likelihood function

$\mathcal{L}(\theta) = f_{\theta}(x_1,\dots,x_n).\,\!$

The method of maximum likelihood estimates θ by finding the value of θ that maximizes $\mathcal{L}(\theta)$. This is the maximum likelihood estimator (MLE) of θ:

$\widehat{\theta} = \underset{\theta}{\operatorname{arg\ max}}\ \mathcal{L}(\theta).$

From a simple point of view, the outcome of a maximum likelihood analysis is the maximum likelihood estimate. This can be supplemented by an approximation for the covariance matrix of the MLE, where this approximation is derived from the likelihood function. A more complete outcome from a maximum likelihood analysis would be the likelihood function itself, which can be used to construct improved versions of confidence intervals compared to those obtained from the approximate variance matrix. See also Likelihood Ratio Test

Commonly, one assumes that the data drawn from a particular distribution are independent, identically distributed (iid) with unknown parameters. This considerably simplifies the problem because the likelihood can then be written as a product of n univariate probability densities:

$\mathcal{L}(\theta) = \prod_{i=1}^n f_{\theta}(x_i)$

and since maxima are unaffected by monotone transformations, one can take the logarithm of this expression to turn it into a sum:

$\mathcal{L}^*(\theta) = \sum_{i=1}^n \log f_{\theta}(x_i).$

The maximum of this expression can then be found numerically using various optimization algorithms.

This contrasts with seeking an unbiased estimator of θ, which may not necessarily yield the MLE but which will yield a value that (on average) will neither tend to over-estimate nor under-estimate the true value of θ.

Note that the maximum likelihood estimator may not be unique, or indeed may not even exist.

## PropertiesEdit

### Functional invariance Edit

The maximum likelihood estimator selects the parameter value which gives the observed data the largest possible probability (or probability density, in the continuous case). If the parameter consists of a number of components, then we define their separate maximum likelihood estimators, as the corresponding component of the MLE of the complete parameter. Consistent with this, if $\widehat{\theta}$ is the MLE for θ, and if g is any function of θ, then the MLE for α = g(θ) is by definition

$\widehat{\alpha} = g(\widehat{\theta}).\,\!$

It maximizes the so-called profile likelihood:

$\bar{L}(\alpha) = \sup_{\theta: \alpha = g(\theta)} L(\theta).$

### BiasEdit

For small numbers of samples, the bias of maximum-likelihood estimators can be substantial. Consider a case where n tickets numbered from 1 to n are placed in a box and one is selected at random (see uniform distribution); thus, the sample size is 1. If n is unknown, then the maximum-likelihood estimator $\hat{n}$ of n is the number m on the drawn ticket. (The likelihood is 0 for n < m, 1/n for nm, and this is greatest when n = m. Note that the maximum likelihood estimate of n occurs at the lower extreme of possible values {m, m+1, ...}, rather than somewhere in the "middle" of the range of possible values, which would result in less bias.) The Expected value of the number m on the drawn ticket, and therefore the expected value of $\hat{n}$ , is (n+1)/2. As a result, the maximum likelihood estimator for n will systematically underestimate n by (n-1)/2 with a sample size of 1.

### Asymptotics Edit

In many cases, estimation is performed using a set of independent identically distributed measurements. These may correspond to distinct elements from a random sample, repeated observations, etc. In such cases, it is of interest to determine the behavior of a given estimator as the number of measurements increases to infinity, referred to as asymptotic behaviour.

Under certain (fairly weak) regularity conditions, which are listed below, the MLE exhibits several characteristics which can be interpreted to mean that it is "asymptotically optimal". These characteristics include:

• The MLE is asymptotically unbiased, i.e., its bias tends to zero as the sample size increases to infinity.
• The MLE is asymptotically efficient, i.e., it achieves the Cramér-Rao lower bound when the sample size tends to infinity. This means that no asymptotically unbiased estimator has lower asymptotic mean squared error than the MLE.
• The MLE is asymptotically normal. As the sample size increases, the distribution of the MLE tends to the Gaussian distribution with mean $\theta$ and covariance matrix equal to the inverse of the Fisher information matrix.

Since the Cramér-Rao bound only speaks of unbiased estimators while the maximum likelihood estimator is usually biased, asymptotic efficiency as defined here does not mean anything: perhaps there are other nearly unbiased estimators with much smaller variance. However, it can be shown that among all regular estimators, which are estimators which have an asymptotic distribution which is not dramatically disturbed by small changes in the parameters, the asymptotic distribution of the maximum likelihood estimator is the best possible, i.e., most concentrated. [1]

Some regularity conditions which ensure this behavior are:

1. The first and second derivatives of the log-likelihood function must be defined.
2. The Fisher information matrix must not be zero, and must be continuous as a function of the parameter.
3. The maximum likelihood estimator is consistent.

By the mathematical meaning of the word asymptotic, asymptotic properties are properties which only approached in the limit of larger and larger samples: they are approximately true when the sample size is large enough. The theory does not tell us how large the sample needs to be in order to obtain a good enough degree of approximation. Fortunately, in practice they often appear to be approximately true, when the sample size is moderately large. So in practice, inference about the estimated parameters is often based on the asymptotic Gaussian distribution of the MLE. When we do this, the Fisher information matrix is usefully estimated by the observed information matrix.

Some cases where the asymptotic behaviour described above does not hold are outlined next.

Estimate on boundary. Sometimes the maximum likelihood estimate lies on the boundary of the set of possible parameters, or (if the boundary is not, strictly speaking, allowed) the likelihood gets larger and larger as the parameter approaches the boundary. Standard asymptotic theory needs the assumption that the true parameter value lies away from the boundary. If we have enough data, the maximum likelihood estimate will keep away from the boundary too. But with smaller samples, the estimate can lie on the boundary. In such cases, the asymptotic theory clearly does not give a practically useful approximation. Examples here would be variance-component models, where each component of variance, σ2, must satisfy the constraint σ2 ≥0.

Data boundary parameter-dependent. For the theory to apply in a simple way, the set of data values which has positive probability (or positive probability density) should not depend on the unknown parameter. A simple example where such parameter-dependence does hold is the case of estimating θ from a set of independent identically distributed when the common distribution is uniform on the range (0,θ). For estimation purposes the relevant range of θ is such that θ cannot be less than the largest observation. In this instance the maximum likelihood estimate exists and has some good behaviour, but the asymptotics are not as outlined above.

Nuisance parameters. For maximum likelihood estimations, a model may have a number of nuisance parameters. For the asymptotic behaviour outlined to hold, the number of nuisance parameters should not increase with the number of observations (the sample size). A well-known example of this case is where observations occur as pairs, where the observations in each pair have a different (unknown) mean but otherwise the observations are independent and Normally distributed with a common variance. Here for 2N observations, there are N+1 parameters. It is well-known that the maximum likelihood estimate for the variance does not converge to the true value of the variance.

Increasing information. For the asymptotics to hold in cases where the assumption of independent identically distributed observations does not hold, a basic requirement is that the amount of information in the data increases indefinitely as the sample size increases. Such a requirement may not be met if either there is too much dependence in the data (for example, if new observations are essentially identical to existing observations), or if new independent observations are subject to an increasing observation error.

## ExamplesEdit

### Discrete distribution, finite parameter spaceEdit

Consider tossing an unfair coin 80 times (i.e., we sample something like x1=H, x2=T, ..., x80=T, and count the number of HEADS "H" observed). Call the probability of tossing a HEAD p, and the probability of tossing TAILS 1-p (so here p is θ above). Suppose we toss 49 HEADS and 31 TAILS, and suppose the coin was taken from a box containing three coins: one which gives HEADS with probability p=1/3, one which gives HEADS with probability p=1/2 and another which gives HEADS with probability p=2/3. The coins have lost their labels, so we don't know which one it was. Using maximum likelihood estimation we can calculate which coin has the largest likelihood, given the data that we observed. The likelihood function (defined below) takes one of three values:

$\begin{matrix} \Pr(\mathrm{H} = 49 \mid p=1/3) & = & \binom{80}{49}(1/3)^{49}(1-1/3)^{31} \approx 0.000 \\ &&\\ \Pr(\mathrm{H} = 49 \mid p=1/2) & = & \binom{80}{49}(1/2)^{49}(1-1/2)^{31} \approx 0.012 \\ &&\\ \Pr(\mathrm{H} = 49 \mid p=2/3) & = & \binom{80}{49}(2/3)^{49}(1-2/3)^{31} \approx 0.054 \\ \end{matrix}$

We see that the likelihood is maximized when p=2/3, and so this is our maximum likelihood estimate for p.

### Discrete distribution, continuous parameter spaceEdit

Now suppose we had only one coin but its p could have been any value 0 ≤ p ≤ 1. We must maximize the likelihood function:

$L(\theta) = f_D(\mathrm{H} = 49 \mid p) = \binom{80}{49} p^{49}(1-p)^{31}$

over all possible values 0 ≤ p ≤ 1.

One way to maximize this function is by differentiating with respect to p and setting to zero:

\begin{align} {0}&{} = \frac{\partial}{\partial p} \left( \binom{80}{49} p^{49}(1-p)^{31} \right) \\ & {}\propto 49p^{48}(1-p)^{31} - 31p^{49}(1-p)^{30} \\ & {}= p^{48}(1-p)^{30}\left[ 49(1-p) - 31p \right] \\ & {}= p^{48}(1-p)^{30}\left[ 49 - 80p \right] \end{align}

which has solutions p=0, p=1, and p=49/80. The solution which maximizes the likelihood is clearly p=49/80 (since p=0 and p=1 result in a likelihood of zero). Thus we say the maximum likelihood estimator for p is 49/80.

This result is easily generalized by substituting a letter such as t in the place of 49 to represent the observed number of 'successes' of our Bernoulli trials, and a letter such as n in the place of 80 to represent the number of Bernoulli trials. Exactly the same calculation yields the maximum likelihood estimator t / n for any sequence of n Bernoulli trials resulting in t 'successes'.

### Continuous distribution, continuous parameter spaceEdit

For the normal distribution $\mathcal{N}(\mu, \sigma^2)$ which has probability density function

$f(x\mid \mu,\sigma^2) = \frac{1}{\sqrt{2\pi}\ \sigma\ } \exp{\left(-\frac {(x-\mu)^2}{2\sigma^2} \right)},$

the corresponding probability density function for a sample of n independent identically distributed normal random variables (the likelihood) is

$f(x_1,\ldots,x_n \mid \mu,\sigma^2) = \prod_{i=1}^{n} f( x_{i}\mid \mu, \sigma^2) = \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \exp\left( -\frac{ \sum_{i=1}^{n}(x_i-\mu)^2}{2\sigma^2}\right),$

or more conveniently:

$f(x_1,\ldots,x_n \mid \mu,\sigma^2) = \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \exp\left(-\frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2}\right)$,

where $\bar{x}$ is the sample mean.

This family of distributions has two parameters: θ=(μ,σ), so we maximize the likelihood, $\mathcal{L} (\mu,\sigma) = f(x_1,\ldots,x_n \mid \mu, \sigma)$, over both parameters simultaneously, or if possible, individually.

Since the logarithm is a continuous strictly increasing function over the range of the likelihood, the values which maximize the likelihood will also maximize its logarithm. Since maximizing the logarithm often requires simpler algebra, it is the logarithm which is maximized below. (Note: the log-likelihood is closely related to information entropy and Fisher information.)

$0 = \frac{\partial}{\partial \mu} \log \left( \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \exp\left(-\frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2}\right) \right)$

$= \frac{\partial}{\partial \mu} \left( \log\left( \frac{1}{2\pi\sigma^2} \right)^{n/2} - \frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2} \right)$

$= 0 - \frac{-2n(\bar{x}-\mu)}{2\sigma^2}$

which is solved by

$\hat\mu = \bar{x} = \sum^{n}_{i=1}x_i/n$.

This is indeed the maximum of the function since it is the only turning point in μ and the second derivative is strictly less than zero. Its expectation value is equal to the parameter μ of the given distribution,

$E \left[ \widehat\mu \right] = \mu,$

which means that the maximum-likelihood estimator $\widehat\mu$ is unbiased.

Similarly we differentiate the log likelihood with respect to σ and equate to zero:

$0 = \frac{\partial}{\partial \sigma} \log \left( \left( \frac{1}{2\pi\sigma^2} \right)^{n/2} \exp\left(-\frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2}\right) \right)$

$= \frac{\partial}{\partial \sigma} \left( \frac{n}{2}\log\left( \frac{1}{2\pi\sigma^2} \right) - \frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2} \right)$

$= -\frac{n}{\sigma} + \frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{\sigma^3}$

which is solved by

$\widehat\sigma^2 = \sum_{i=1}^n(x_i-\widehat{\mu})^2/n$.

Inserting $\widehat\mu$ we obtain

$\widehat\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_{i} - \bar{x})^2 = \frac{1}{n}\sum_{i=1}^n x_i^2 -\frac{1}{n^2}\sum_{i=1}^n\sum_{j=1}^n x_i x_j$.

To calculate its expected value, it is convenient to rewrite the expression in terms of zero-mean random variables (statistical error) $\delta_i \equiv \mu - x_i$. Expressing the estimate in these variables yields

$\widehat\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (\mu - \delta_i)^2 -\frac{1}{n^2}\sum_{i=1}^n\sum_{j=1}^n (\mu - \delta_i)(\mu - \delta_j)$.

Simplifying the expression above, utilizing the facts that $E\left[\delta_i\right] = 0$ and $E[\delta_i^2] = \sigma^2$, allows us to obtain

$E \left[ \widehat{\sigma^2} \right]= \frac{n-1}{n}\sigma^2$.

This means that the estimator $\widehat\sigma$ is biased (However, $\widehat\sigma$ is consistent).

Formally we say that the maximum likelihood estimator for $\theta=(\mu,\sigma^2)$ is:

$\widehat{\theta} = \left(\widehat{\mu},\widehat{\sigma}^2\right).$

In this case the MLEs could be obtained individually. In general this may not be the case, and the MLEs would have to be obtained simultaneously.

## Non-independent variables Edit

It may be the case that variables are correlated, in which case they are not independent. Two random variables X and Y are only independent if their joint probability density function is the product of the individual probability density functions, i.e.

$f(x,y)=f(x)f(y)\,$

Suppose one constructs an order $n\,$ Gaussian vector out of random variables $(x_1,\ldots,x_n)\,$, where each variable has means given by $(\mu_1, \ldots, \mu_n)\,$. Furthermore, let the covariance matrix be denoted by $\Sigma,$

The joint probability density function of these $n$ random variables is then given by:

$f(x_1,\ldots,x_n)=\frac{1}{2\pi\sqrt{\text{det}(\Sigma)}} \exp\left( -\frac{1}{2} \left[x_1-\mu_1,\ldots,x_n-\mu_n\right]\Sigma^{-1} \left[x_1-\mu_1,\ldots,x_n-\mu_n\right]^T \right)$

In the two variable case, the joint probability density function is given by:

$f(x,y) = \frac{1}{2\pi \sigma_x \sigma_y \sqrt{1-\rho^2}} \exp\left[ -\frac{1}{2(1-\rho^2)} \left(\frac{(x-\mu_x)^2}{\sigma_x^2} - \frac{2\rho(x-\mu_x)(y-\mu_y)}{\sigma_x\sigma_y} + \frac{(y-\mu_y)^2}{\sigma_y^2}\right) \right]$

In this and other cases where a joint density function exists, the likelihood function is defined as above, under Principles, using this density.