# Oja learning rule

34,189pages on
this wiki

Oja's learning rule, or simply Oja's rule, named after a Finnish computer scientist Erkki Oja, is a model of how neurons in the brain or in artificial neural networks change connection strength, or learn, over time. It is a modification of the standard Hebb's Rule (see Hebbian learning) that, through multiplicative normalization, solves all stability problems and generates an algorithm for principal components analysis. This is a computational form of an effect which is believed to happen in biological neurons.

A key problem in artificial neural networks is how neurons learn. The central hypothesis is that learning is based on changing the connections, or synaptic weights between neurons by specific learning rules. In unsupervised learning, the changes in the weights only depend on the inputs and the output of the neuron. A popular assumption is the Hebbian learning rule, according to which the change in a given synaptic weight is proportional to both the pre-synaptic input and the output activity of the post-synaptic neuron.

The Oja learning rule (Oja, 1982) is a mathematical formalization of this Hebbian learning rule, such that over time the neuron actually learns to compute a principal component of its input stream.

## TheoryEdit

Oja's rule requires a number of simplifications to derive, but in its final form it is demonstrably stable, unlike Hebb's rule. It is a single-neuron special case of the Generalized Hebbian Algorithm. However, Oja's rule can also be generalized in other ways to varying degrees of stability and success.

### FormulaEdit

Oja's rule defines the change in presynaptic weights w given the output response $y$ of a neuron to its inputs x to be

$\,\Delta \mathbf{w} ~ = ~ \mathbf{w}_{n+1}-\mathbf{w}_{n} ~ = ~ \eta \, y_{n} (\mathbf{x}_{n} - y_{n}\mathbf{w}_{n}),$

where η is the learning rate which can also change with time. Note that the bold symbols are vectors and n defines a discrete time iteration. The rule can also be made for continuous iterations as

$\,\frac{d\mathbf{w}}{d t} ~ = ~ \eta \, y(t) (\mathbf{x}(t) - y(t)\mathbf{w}(t)).$

### DerivationEdit

The simplest learning rule known is Hebb's rule, which states in conceptual terms that neurons that fire together, wire together. In component form as a difference equation, it is written

$\,\Delta\mathbf{w} ~ = ~ \eta\, y(\mathbf{x}_n) \mathbf{x}_{n}$,

or with implicit n-dependence,

$\,w_{i}(n+1) ~ = ~ w_{i} + \eta\, y(\mathbf{x}) x_{i}$,

where y(xn) is again the output, this time explicitly dependent on its input vector x.

Hebb's rule has synaptic weights approaching infinity with a positive learning rate. We can stop this by normalizing the weights so that each weight's magnitude is restricted between 0, corresponding to no weight, and 1, corresponding to being the only input neuron with any weight. Mathematically, this takes the form

$\,w_i (n+1) ~ = ~ \frac{w_i + \eta\, y(\mathbf{x}) x_i}{\left(\sum_{j=1}^m [w_j + \eta\, y(\mathbf{x}) x_j]^p \right)^{1/p}}$.

Note that in Oja's original paper,[1] p=2, corresponding to quadrature (root sum of squares), which is the familiar Cartesian normalization rule. However, any type of normalization, even linear, will give the same result without loss of generality.

Our next step is to expand this into a Taylor series for a small learning rate $| \eta | \ll 1$, giving

$\,w_i (n+1) ~ = ~ \frac{w_i}{\left( \sum_j w_j^p \right)^{1/p}} ~ + ~ \eta \left( \frac{y x_i}{\left(\sum_j w_j^p \right)^{1/p}} - \frac{w_i \sum_j y x_j w_j}{\left(\sum_j w_j^p \right)^{(1 + 1/p)}} \right) ~ + ~ O(\eta^2)$.

For small η, our higher-order terms O(η2) go to zero. We again make the specification of a linear neuron, that is, the output of the neuron is equal to the sum of the product of each input and its synaptic weight, or

$\,y(\mathbf{x}) ~ = ~ \sum_{j=1}^m x_j w_j$.

We also specify that our weights normalize to 1, which will be a necessary condition for stability, so

$\,| \mathbf{w} | ~ = ~ \left( \sum_{j=1}^m w_j^p \right)^{1/p} ~ = ~ 1$,

which, when substituted into our expansion, gives Oja's rule, or

$\,w_i (n+1) ~ = ~ w_i + \eta\, y(x_i - w_i y)$.

## The simple neuron modelEdit

Consider a simplified neuron model that has $n$ inputs $x_1, ... x_n\ ,$ each with a weight $w_i\ .$ The neuron first computes the weighted sum of the inputs,

Failed to parse (unknown function\label): \label{linear} y = \sum_{i=1}^n w_i x_i

and then passes this sum to the next neurons in the network. The problem of learning is how to change the weights $w_i$ when a stream of input vectors ${\mathbf x} = (x_1, ..., x_n)\ ,$ one at a time, are given to this neuron as inputs.

### From Hebbian learning to the Oja learning ruleEdit

Using the mathematical notation above, the Hebbian learning principle could be stated as

Failed to parse (unknown function\label): \label{Hebb} \Delta w_i = \alpha x_i y, \; i = 1,...,n

where $\Delta w_i$ denotes the change in the value of the weight $w_i\ ,$ $x_i$ is the input coming through the weight $w_i\ ,$ and $y$ is the output of the neuron as given in equation (1). The coefficient $\alpha$ is called the learning rate and it is typically small. Due to this, one input vector (whose $i-th$ component is the term $x_i$) only causes a small instantaneous change in the weights, but when the small changes accumulate over time, the weights will settle to some values.

Equation \eqref{Hebb} represents the Hebbian principle, because the term is the product of the input and the output. However, this learning rule has a severe problem: there is nothing there to stop the connections from growing all the time, finally leading to very large values. There should be another term to balance this growth. In many neuron models, another term representing "forgetting" has been used: the value of the weight itself should be subtracted from the right hand side. The central idea in the Oja learning rule is to make this forgetting term proportional, not only to the value of the weight, but also to the square of the output of the neuron. The Oja rule reads:

Failed to parse (unknown function\label): \label{Ojarule} \Delta w_i = \alpha (x_i y - y^2 w_i), \; i= 1,...,n.

Now, the forgetting term balances the growth of the weight. The squared output $y^2$ guarantees that the larger the output of the neuron becomes, the stronger is this balancing effect.

## Oja learning rule and principal component analysisEdit

A mathematical analysis of the Oja learning rule in \eqref{Ojarule} goes as follows (a much more thorough and rigorous analysis appears in the book (Oja, 1983)). First, change into vector notation, in which $\mathbf x$ is the column vector with elements $x_i$ and $\mathbf w$ is the column vector with elements $w_i\ .$ They are called the input vector and the weight vector, respectively. In vector-matrix notation, equation (1) then reads

Failed to parse (unknown function\label): \label{linearVec} y = {\mathbf w}^T{\mathbf x} = {\mathbf x}^T{\mathbf w}

where T means the transpose, changing a column vector into a row vector. This is the well-known inner product between two vectors, defined as the sum of products of their elements (see equation (1)).

Next, write equation \eqref{Ojarule} in vector notation :

Failed to parse (unknown function\label): \label{OjaruleVec} \Delta {\mathbf w} = \alpha ({\mathbf x} y - y^2 {\mathbf w}).

Then, substitute $y$ from equation \eqref{linearVec} into equation \eqref{OjaruleVec}:

$\Delta {\mathbf w} = \alpha ({\mathbf x} {\mathbf x}^T{\mathbf w} - {\mathbf w}^T{\mathbf x}{\mathbf x}^T {\mathbf w}{\mathbf w}).$

This is the incremental change for just one input vector ${\mathbf x}\ .$ When the algorithm is run for a long time, changing the input vector at every step, one can look at the average behaviour. An especially interesting question is what is the value of the weights when the average change in the weight is zero. This is the point of convergence of the algorithm.

Averaging the right hand side over the ${\mathbf x}\ ,$ conditional on ${\mathbf w}$ staying constant, and setting this to zero gives the following equation for the weight vector at the point of convergence:

Failed to parse (unknown function\label): \label{PCAsolution} {\mathbf C}{\mathbf w} - {\mathbf w}^T{\mathbf C}{\mathbf w}{\mathbf w} = 0

where the matrix ${\mathbf C}$ is the average of ${\mathbf x}{\mathbf x}^T\ .$ Assuming the input vectors have zero means, this is in fact the well-known covariance matrix of the inputs.

Considering that the quadratic form ${\mathbf w}^T{\mathbf C}{\mathbf w}$ is a scalar, this equation clearly is the eigenvalue-eigenvector equation for the covariance matrix ${\mathbf C}\ .$ This analysis shows that if the weights converge in the Oja learning rule, then the weight vector becomes one of the eigenvectors of the input covariance matrix, and the output of the neuron becomes the corresponding principal component. Principal components are defined as the inner products between the eigenvectors and the input vectors. For this reason, the simple neuron learning by the Oja rule becomes a principal component analyzer (PCA).

Although not shown here, it has been proven that it is the first principal component that the neuron will find, and the norm of the weight vector tends to one. For details, see (Oja, 1983; 1992). This analysis is based on stochastic approximation theory (see e.g. Kushner and Clark, 1978) and depends on a set of mathematical assumptions. Especially, the learning rate $\alpha$ cannot be a constant but has to decrease over time. A typical decreasing sequence is

$\alpha(t) = 1/t .$

## Extensions of the Oja learning ruleEdit

This learning rule has been extended to several directions. Two extensions are briefly reviewed here: Oja rule for several parallel neurons, and nonlinearities in the rule.

### Oja rule for several neuronsEdit

It is possible to define this learning rule for a layer of parallel neurons, each receiving the same input vector $\mathbf x\ .$ Then, in order to prevent all the neurons from learning the same thing, parallel connections between them are needed. The result is that a subset or all of the principal components are learned. Such neural layers have been considered by (Oja, 1983, 1992; Sanger, 1989; Földiák, 1989).

### Nonlinear Hebbian learning and independent component analysisEdit

Independent component analysis (ICA) is a technique that is related to PCA, but is potentially much more powerful: instead of finding uncorrelated components like in PCA, statistically independent components are found, if they exist in the original data. It turns out that quite small changes in the Oja rule can produce independent, instead of principal, components in such a case. What is needed is to change the linear output factor $y$ in the Hebbian term to a suitable nonlinearity, such as $y^3\ .$ Also the forgetting term must be changed accordingly. The ensuing learning rule

Failed to parse (unknown function\label): \label{ICAruleVec} \Delta {\mathbf w} = \alpha ({\mathbf x} y^3 - {\mathbf w})

can be shown to give one of the independent hidden factors under suitable assumptions (Hyvärinen and Oja, 1998). The main requirement is that prior to entering this algorithm, the input vectors have to be zero mean and whitened so that their covariance matrix $\mathbf C$ is equal to the identity matrix. This can be achieved with a simple linear transformation, or by a variant of the Oja rule (see also Hyvärinen et al, 2001).

### Stability and PCAEdit

In analyzing the convergence of a single neuron evolving by Oja's rule, one extracts the first principal component, or feature, of a data set. Furthermore, with extensions using the Generalized Hebbian Algorithm, one can create a multi-Oja neural network that can extract as many features as desired, allowing for principal components analysis.

A principal component aj is extracted from a dataset x through some associated vector qj, or aj = qjx, and we can restore our original dataset by taking

$\mathbf{x} ~ = ~ \sum_j a_j \mathbf{q}_j$.

In the case of a single neuron trained by Oja's rule, we find the weight vector converges to q1, or the first principal component, as time or number of iterations approaches infinity. We can also define, given a set of input vectors Xi, that its correlation matrix Rij = XiXj has an associated eigenvector given by qj with eigenvalue λj. The variance of outputs of our Oja neuron σ2(n) = ⟨y2(n)⟩ then converges with time iterations to the principal eigenvalue, or

$\lim_{n\rightarrow\infty} \sigma^2(n) ~ = ~ \lambda_1$.

These results are derived using Lyapunov function analysis, and they show that Oja's neuron necessarily converges on strictly the first principal component if certain conditions are met in our original learning rule. Most importantly, our learning rate η is allowed to vary with time, but only such that its sum is divergent but its power sum is convergent, that is

$\sum_{n=1}^\infty \eta(n) = \infty, ~~~ \sum_{n=1}^\infty \eta(n)^p < \infty, ~~~ p > 1$.

Our output activation function y(x(n)) is also allowed to be nonlinear and nonstatic, but it must be continuously differentiable in both x and w and have derivatives bounded in time.[2]

## ApplicationsEdit

Oja's rule was originally described in Oja's 1982 paper,[1] but the principle of self-organization to which it is applied is first attributed to Alan Turing in 1952.[2] PCA has also had a long history of use before Oja's rule formalized its use in network computation in 1989. The model can thus be applied to any problem of self-organizing mapping, in particular those in which feature extraction is of primary interest. Therefore, Oja's rule has an important place in image and speech processing. It is also useful as it expands easily to higher dimensions of processing, thus being able to integrate multiple outputs quickly. A canonical example is its use in binocular vision.[3]

### Biology and Oja's subspace ruleEdit

There is clear evidence for both long-term potentiation and long-term depression in biological neural networks, along with a normalization effect in both input weights and neuron outputs. However, while there is no direct experimental evidence yet of Oja's rule active in a biological neural network, a biophysical derivation of a generalization of the rule is possible. Such a derivation requires retrograde signalling from the postsynaptic neuron, which is biologically plausible (see neural backpropagation), and takes the form of

$\Delta w_{ij} ~ \propto ~ \langle x_i y_j \rangle - \epsilon \left\langle \left(c_\mathrm{pre} * \sum_k w_{ik} y_k \right) \cdot \left(c_\mathrm{post} * y_j \right) \right\rangle,$

where as before wij is the synaptic weight between the ith input and jth output neurons, x is the input, y is the postsynaptic output, and we define ε to be a constant analogous the learning rate, and cpre and cpost are presynaptic and postsynaptic functions that model the weakening of signals over time. Note that the angle brackets denote the average and the operator is a convolution. By taking the pre- and post-synaptic functions into frequency space and combining integration terms with the convolution, we find that this gives an arbitrary-dimensional generalization of Oja's rule known as Oja's Subspace,[4] namely

$\Delta w ~ = ~ C x\cdot w - w\cdot C y.$[5]

## ReferencesEdit

• Földiák P. (1989) Adaptive network for optimal linear feature extraction. Proceedings of the IEEE/INNS International Joint Conference on Neural Networks, Washington D.C., 1:401-405 (IEEE Press, New York, 1989).
• Hyvärinen A. and Oja E. (1998) Independent component analysis by general nonlinear Hebbian-like learning rules. Signal Processing, 64:301-313
• Hyvärinen A., Karhunen J., and Oja E. (2001) Independent component analysis. Wiley
• Kushner H.J. and Clark D.S. (1978) Stochastic approximation methods for constrained and unconstrained systems. Springer
• Oja E. (1982) A simplified neuron model as a principal component analyzer. Journal of Mathematical Biology, 15:267-2735
• Oja E. (1983) Subspace methods of pattern recognition. Research Studies Press
• Oja E. (1992) Principal components, minor components, and linear neural networks. Neural Networks, 5:927-935
• Sanger T. D. (1989) Optimal unsupervised learning in a single-layered linear feedforward network. Neural Networks, 2:459-473

Internal references

• Jan A. Sanders (2006) Averaging. Scholarpedia, 1(11):1760.
• Valentino Braitenberg (2007) Brain. Scholarpedia, 2(11):2918.