# Contingency tables

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
34,200pages on
this wiki

In statistics, contingency tables are used to record and analyse the relationship between two or more variables, most usually categorical variables.

Suppose that we have two variables, sex (male or female) and handedness (right-handed or left-handed). We observe the values of both variables in a random sample of 100 people. Then a contingency table can be used to express the relationship between these two variables, as follows:

 right-handed left-handed TOTAL male 43 9 52 female 44 4 48 TOTAL 87 13 100

The figures in the right-hand column and the bottom row are called marginal totals and the figure in the bottom right-hand corner is the grand total.

The table allows us to see at a glance that the proportion of men who are right-handed is about the same as the proportion of women who are. However the two proportions are not identical, and the statistical significance of the difference between them can be tested with a Pearson's chi-square test, a G-test or Fisher's exact test, provided the entries in the table represent a random sample from the population contemplated in the null hypothesis. If the proportions of individuals in the different columns varies between rows (and, therefore, vice versa) we say that the table shows contingency between the two variables. If there is no contingency, we say that the two variables are independent.

The example above is for the simplest kind of contingency table, in which each variable has only two levels; this is called a 2 x 2 contingency table. In principle, any number of rows and columns may be used. There may also be more than two variables, but higher order contingency tables are hard to represent on paper. The relationship between ordinal variables, or between ordinal and categorical variables, may also be represented in contingency tables, though this is less often done since the distributions of ordinal variables can be summarised efficiently by the median.

The degree of association between the two variables can be assessed by a number of coefficients: the simplest is the phi coefficient defined by

φ = √(χ2 / N)

where χ2 is derived from the Pearson test, and N is the grand total number of observations. φ varies from 0 (corresponding to no association between the variables) to 1 (complete association). This coefficient can only be used for 2 x 2 tables. Alternatives include the tetrachoric correlation coefficient (also only useful for 2 x 2 tables), the contingency coefficient C and Cramér's V. C suffers from the disadvantage that it does not reach a maximum of 1 with complete association in asymmetrical tables (those where the numbers of row and columns are not equal). The tetrachoric correlation coefficient is essentially the Pearson product-moment correlation coefficient between the row and column variables, their values for each observation being taken as 0 or 1 depending on the category it falls into. The formulae for the other coefficients are:

C = √(χ2 / (N+ χ2))

V = √(χ2 / (N(k-1)))

k being the number of rows or the number of columns, whichever is less.

C can be adjusted so it reaches a maximum of 1 when there is complete association in a table of any number of rows and columns by dividing it by √((k-1) / k).

The term contingency table was first used by Karl Pearson in "On the Theory of Contingency and its Relation to Association and Normal Correlation" in Drapers' Company Research Memoirs (1904) Biometric Series I.[1]