Changes: Statistical reliability

Latest revision as of 18:06, 17 November 2008

Statistics: Scientific method · Research methods · Experimental design · Undergraduate statistics courses · Statistical tests · Game theory · Decision theory

In statistics, reliability is the consistency of a set of measurements or measuring instrument, often used to describe a test. This can either be whether the measurements of the same instrument give or are likely to give the same measurement (test-retest), or in the case of more subjective instruments, such as personality or trait inventories, whether two independent assessors give similar scores (inter-rater reliability). Reliability is inversely related to random error.

Reliability does not imply validity. That is, a reliable measure is measuring something consistently, but not necessarily what it is supposed to be measuring. For example, while there are many reliable tests of specific abilities, not all of them would be valid for predicting, say, job performance. In terms of accuracy and precision, reliability is precision, while validity is accuracy.

In experimental science, reliability is the extent to which the measurements of a test remain consistent over repeated tests of the same subject under identical conditions. An experiment is reliable if it yields consistent results of the same measure. It is unreliable if repeated measurements give different results. It can also be interpreted as the lack of random error in measurement.^[1]

An often-used example used to elucidate the difference between reliability and validity in the experimental sciences is a common bathroom scale. If someone that weighs 200 lbs. steps on the scale 10 times, and it reads "200" each time, then the measurement is reliable and valid. If the scale consistently reads "150", then it is not valid, but it is still reliable because the measurement is very consistent. If the scale varied a lot around 200 (190, 205, 192, 209, etc.), then the scale could be considered valid but not reliable.

Estimation

Reliability may be estimated through a variety of methods that fall into two types: single-administration and multiple-administration. Multiple-administration methods require that two assessments are administered. In the test-retest method, reliability is estimated as the Pearson product-moment correlation coefficient between two administrations of the same measure. In the alternate forms method, reliability is estimated by the Pearson product-moment correlation coefficient of two different forms of a measure, usually administered together. Single-administration methods include split-half and internal consistency. The split-half method treats the two halves of a measure as alternate forms. This "halves reliability" estimate is then stepped up to the full test length using the Spearman-Brown prediction formula. The most common internal consistency measure is Cronbach's alpha, which is usually interpreted as the mean of all possible split-half coefficients.^[2] Cronbach's alpha is a generalization of an earlier form of estimating internal consistency, Kuder-Richardson Formula 20.^[2]

Each of these estimation methods isn't sensitive to different sources of error and so might not be expected to be equal. Also, reliability is a property of the scores of a measure rather than the measure itself and are thus said to be sample dependent. Reliability estimates from one sample might differ from those of a second sample (beyond what might be expected due to sampling variations) if the second sample is drawn from a different population because the true reliability is different in this second population. (This is true of measures of all types--yardsticks might measure houses well yet have poor reliability when used to measure the lengths of insects.)

Reliability may be improved by clarity of expression (for written assessments), lengthening the measure,^[2] and other informal means. However, formal psychometric analysis, called the item analysis, is considered the most effective way to increase reliability. This analysis consists of computation of item difficulties and item discrimination indices, the latter index involving computation of correlations between the items and sum of the item scores of the entire test. If items that are too difficult, too easy, and/or have near-zero or negative discrimination are replaced with better items, the reliability of the measure will increase.

$R(t) = 1 - F(t)$ .

$R(t) = \exp(-\lambda t)$ . (where $\lambda$ is the failure rate)

Classical test theory

In classical test theory, reliability is defined mathematically as the ratio of the variation of the true score and the variation of the observed score. Or, equivalently, one minus the ratio of the variation of the error score and the variation of the observed score:

{\rho}_{xx'}=\frac{{\sigma}^2_T}{{\sigma}^2_X}=1-\frac{{{\sigma}^2_E}}{{{\sigma}^2_X}}

where ${\rho}_{xx'}$ is the symbol for the reliability of the observed score, X; ${\sigma}^2_X$ , ${\sigma}^2_T$ , and ${\sigma}^2_E$ are the variances on the measured, true and error scores respectively. Unfortunately, there is no way to directly observe or calculate the true score, so a variety of methods are used to estimate the reliability of a test.

Some examples of the methods to estimate reliability include test-retest reliability, internal consistency reliability, and parallel-test reliability. Each method comes at the problem of figuring out the source of error in the test somewhat differently.

Item response theory

It was well-known to classical test theorists that measurement precision is not uniform across the scale of measurement. Tests tend to distinguish better for test-takers with moderate trait levels and worse among high- and low-scoring test-takers. Item response theory extends the concept of reliability from a single index to a function called the information function. The IRT information function is the inverse of the conditional observed score standard error at any given test score. Higher levels of IRT information indicate higher precision and thus greater reliability.

References

↑ Rudner, L.M., & Shafer, W.D. (2001). Reliability. ERIC Digest. College Park, MD: ERIC Clearinghouse on Assessment and Evaluation. [1]
↑ ^2.0 ^2.1 ^2.2 Cortina, J.M., (1993). What Is Coefficient Alpha? An Examination of Theory and Applications. Journal of Applied Psychology, 78(1), 98-104.

External links

This page uses Creative Commons Licensed content from Wikipedia (view authors).

[1] Rudner, L.M., & Shafer, W.D. (2001). Reliability. ERIC Digest. College Park, MD: ERIC Clearinghouse on Assessment and Evaluation. [1]

[Cortina-2] 2.0 ^2.1 ^2.2 Cortina, J.M., (1993). What Is Coefficient Alpha? An Examination of Theory and Applications. Journal of Applied Psychology, 78(1), 98-104.

[1]

[2]

@@ Line 1: / Line 1: @@
 {{StatsPsy}}
-In [[statistics]], '''reliability''' is the consistency of a set of measurements or measuring instrument. Reliability does not imply [[validity (psychometric)|validity]]. That is, a reliable measure is measuring something consistently, but not necessarily what it is supposed to be measuring. For example, while there are many reliable tests, not all of them would validly predict job performance.
+In [[statistics]], '''reliability''' is  the consistency of a set of measurements or measuring instrument, often used to describe a [[Test (student assessment)|test]]. This can either be whether the measurements of the same instrument give or are likely to give the same measurement (test-retest), or in the case of more subjective instruments, such as personality or trait inventories, whether two independent assessors give similar scores ([[inter-rater reliability]]). Reliability is inversely related to [[random error]].
+Reliability does not imply [[validity (psychometric)|validity]]. That is, a reliable measure is measuring something consistently, but not necessarily what it is supposed to be measuring. For example, while there are many reliable tests of specific abilities, not all of them would be valid for predicting, say, job performance. In terms of [[Accuracy and precision|accuracy and precision]], reliability is precision, while validity is accuracy.
-In [[experiment]]al sciences, '''reliability''' is the extent to which the measurements of a test remain consistent over repeated tests of the same subject under identical conditions.  An experiment is reliable if it yields consistent results of the same measure. It is unreliable if repeated measurements give different results.
+In experimental science, '''reliability''' is the extent to which the measurements of a test remain consistent over repeated tests of the same subject under identical conditions.  An experiment is reliable if it yields consistent results of the same measure. It is unreliable if repeated measurements give different results.  It can also be interpreted as the lack of random error in measurement.<ref>Rudner, L.M., & Shafer, W.D. (2001). Reliability.  ''ERIC Digest''. College Park, MD: ERIC Clearinghouse on Assessment and Evaluation.  [http://www.ericdigests.org/2002-2/reliability.htm] </ref>
+An often-used example used to elucidate the difference between reliability and validity in the experimental sciences is a common bathroom scale.  If someone that weighs 200 lbs. steps on the scale 10 times, and it reads "200" each time, then the measurement is reliable and valid.  If the scale consistently reads "150", then it is not valid, but it is still reliable because the measurement is very consistent.  If the scale varied a lot around 200 (190, 205, 192, 209, etc.), then the scale could be considered valid but not reliable.
 ==Estimation==
+Reliability may be estimated through a variety of methods that fall into two types:  single-administration and multiple-administration.  Multiple-administration methods require that two assessments are administered.  In the [[test-retest]] method, reliability is estimated as the [[Pearson product-moment correlation coefficient]] between two administrations of the same measure.  In the ''alternate forms'' method, reliability is estimated by the Pearson product-moment correlation coefficient of two different forms of a measure, usually administered together.  Single-administration methods include ''split-half'' and ''[[internal consistency]]''.  The split-half method treats the two halves of a measure as alternate forms.  This "halves reliability" estimate is then stepped up to the full test length using the [[Spearman-Brown prediction formula]].  The most common internal consistency measure is [[Cronbach's alpha]], which is usually interpreted as the mean of all possible split-half coefficients.<ref name="Cortina">Cortina, J.M., (1993). What Is Coefficient Alpha? An Examination of Theory and Applications. ''Journal of Applied Psychology, 78''(1), 98-104. </ref>  Cronbach's alpha is a generalization of an earlier form of estimating internal consistency, [[Kuder-Richardson Formula 20]].<ref name="Cortina" />
+Each of these estimation methods isn't sensitive to different sources of error and so might not be expected to be equal.  Also, reliability is a property of the ''scores of a measure'' rather than the measure itself and are thus said to be ''sample dependent''.  Reliability estimates from one sample might differ from those of a second sample (beyond what might be expected due to sampling variations) if the second sample is drawn from a different population because the true reliability is different in this second population.  (This is true of measures of all types--yardsticks might measure houses well yet have poor reliability when used to measure the lengths of insects.)
-Reliability may be estimated through a variety of methods that fall into two types:  Single-administration and multiple-administration.  Multiple-administration methods require that two assessments are administered.  In the ''test-retest'' method, reliability is estimated as the [[Pearson product-moment correlation coefficient]] between two administrations of the same measure.  In the ''alternate forms'' method, reliability is estimated by the Pearson product-moment correlation coefficient of two different forms of a measure, usually administered together.  Single-administration methods include ''split-half'' and ''internal consistency''.  The split-half method treats the two halves of a measure as alternate forms.  This "halves reliability" estimate is then stepped up to the full test length using the [[Spearman-Brown prediction formula]].  The most common internal consistency measure is [[Cronbach's alpha]], which is usually interpreted as the mean of all possible split-half coefficients.
+Reliability may be improved by clarity of expression (for written assessments), lengthening the measure,<ref name="Cortina" /> and other informal means. However, formal psychometric analysis, called the item analysis, is considered the most effective way to increase reliability. This analysis consists of computation of '''item difficulties''' and '''item discrimination''' indices, the latter index involving computation of correlations between the items and sum of the item scores of the entire test. If items that are too difficult, too easy, and/or have near-zero or negative discrimination are replaced with better items, the reliability of the measure will increase.
-Each of these estimation methods is sensitive to different sources of error and so might not be expected to be equal.  Also, reliability is a property of the ''scores of a measure'' rather than the measure itself and are thus said to be ''sample dependent''.  Reliability estimates from one sample might differ from those of a second sample (beyond what might be expected due to sampling variations) if the second sample is drawn from a different population because the true reliability is different in this second population.  (This is true of measures of all types--yardsticks might measure houses well yet have poor reliability when used to measure the lengths of insects.)
+* <math>R(t) = 1 - F(t)</math>.
-Reliability may be improved by clarity of expression (for written assessments), lengthening the measure, and other informal means. However, formal psychometric analysis, called the item analysis, is considered the most effective way to increase reliability. This analysis consists of computation of '''item difficulties''' and '''item discrimination''' indices, the later index involving computation of correlations between the items and sum of the item scores of the entire test.
+* <math>R(t) = \exp(-\lambda t)</math>. (where <math>\lambda</math> is the failure rate)
-* R(t) = 1 - F(t).
-* R(t) = e^-ʎ.(t). (where ʎ is the failure rate)
 ==Classical test theory==
@@ Line 29: / Line 32: @@
 ==Item response theory==
 It was well-known to classical test theorists that measurement precision is not uniform across the scale of measurement.  Tests tend to distinguish better for test-takers with moderate trait levels and worse among high- and low-scoring test-takers.  [[Item response theory]] extends the concept of reliability from a single index to a function called the ''information function''.  The IRT information function is the inverse of the conditional observed score standard error at any given test score.  Higher levels of IRT information indicate higher precision and thus greater reliability.
 ==See also==
 * [[Accuracy]]
-* [[homogeneity (statistics)]]
+* [[Bayesian inference]]
+* [[Censoring (statistics)]]
+* [[Consistenc (statistics)]]
+* [[Coefficient of variation]]
+* [[Homogeneity (statistics)]]
+* [[Internal consistency]]
 * [[Levels of measurement]]
-* [[Precision]]
+* [[Population (statistics)]]
+* [[Prediction errors]]
+* [[Proportional reduction in loss]]
 * [[Reliability theory]]
+* [[Reproducibility]]
+* [[Sampling (experimental)]]
 * [[Scientific method]]
-* [[Statistics]]
+* [[Statistical validity]]
-* [[Validity (statistics)]]
+==References==
-* [[Variation]]
+{{Reflist}}
 ==External links==
-* [http://www.ericdigests.org/2002-2/reliability.htm Reliability]
-* [http://www.uncertainty-in-engineering.net Uncertainty models, uncertainty quantification, and uncertainty processing in engineering]
 * [http://www.visualstatistics.net/Statistics/Principal%20Components%20of%20Reliability/PCofReliability.asp The relationships between correlational and internal consistency concepts of test reliability]
 * [http://www.visualstatistics.net/Statistics/Reliability%20Negative/Negative%20Reliability.asp The problem of negative reliabilities]
-[[Category:Statistics]]
+[[Category:Comparison of assessments]]
 [[Category:Psychometrics]]
-[[Category:Marketing research]]
+[[Category:Market research]]
-[[Category:Educational psychology]]
+[[Category:Educational psychology research methods]]
+<!--
+[[de:Reliabilität]]
+[[id:Reliabilitas]]
+[[it:Attendibilità]]
 [[no:Reliabilitet]]
 [[pl:Rzetelność (psychometria)]]
+[[ru:Надёжность психологического теста]]
+[[su:Réliabilitas (statistika)]]
 [[sv:Reliabilitet]]
+-->
 {{enWP|Reliability (psychometric)}}