Psychology Wiki
m (Sampling (statistics) moved to Sampling (experimental): align thesaurus)
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
{{StatsPsy}}
 
{{StatsPsy}}
   
'''Sampling''' is that part of [[statistical practice]] concerned with the selection of individual observations intended to yield some knowledge about a [[population (statistics)|population]] of concern, especially for the purposes of [[statistical inference]]. In particular, results from [[probability theory]] and [[statistical theory]] are employed to guide practice.
+
'''Sampling''' is that part of [[statistical practice]] concerned with the selection of individual observations intended to yield some knowledge about a [[population (statistics)|population]] of concern, especially for the purposes of [[statistical inference]].
  +
Each '''observation''' measures one or more properties (weight, location, etc.) of an observable entity enumerated to distinguish objects or individuals. Survey weights often need to be applied to the data to adjust for the sample design. Results from [[probability theory]] and [[statistical theory]] are employed to guide practice.
   
The sampling process consists of five stages:(Makerere University Institute of Statistics & Applied Economics (ISAE)
+
The sampling process comprises several stages:
   
* Definition of population of concern
+
* Defining the population of concern
  +
* Specifying a [[#Sampling_frame|sampling frame]], a [[Set (mathematics)|set]] of items or events possible to measure
* Specification of a sampling frame, the source from which a sample is drawn. It is a list of all those within a population who can be sampled, and may include individuals, households or institutions
 
* Specification of a sampling method for selecting items or events from the frame
+
* Specifying a [[#Sampling_method|sampling method]] for selecting items or events from the frame
  +
* Determining the sample size
  +
* Implementing the sampling plan
 
* Sampling and data collecting
 
* Sampling and data collecting
* Review of sampling process
+
* Reviewing the sampling process
   
 
==Population definition==
 
==Population definition==
Successful statistical practice is based on focused [[problem definition]]. Typically, we seek to take action on some [[Statistical population|population]], for example trying to reduce the incidence of psychological problems in people involved in major
+
Successful statistical practice is based on focused [[problem definition]]. Typically, we seek to take action on some [[Statistical population|population]], for example when a [[batch]] of material from [[batch production|production]] must be released to the customer or sentenced for scrap or rework.
  +
accidents.
 
  +
Alternatively, we seek knowledge about the [[cause system]] of which the population is an outcome, for example when a researcher performs an experiment on rats with the intention of gaining insights into [[biochemistry]] that can be applied for the benefit of [[humans]]. In the latter case, the population of concern can be difficult to specify, as it is in the case of measuring some physical characteristic such as the [[electrical conductivity]] of [[copper]].
Time spent in making the population of concern precise is often well spent, often because it raises many issues, ambiguities and questions that would otherwise have been overlooked at this stage.
 
  +
 
However, in all cases, time spent in making the population of concern precise is often well spent, often because it raises many issues, ambiguities and questions that would otherwise have been overlooked at this stage.
   
 
==Sampling frame==
 
==Sampling frame==
Line 21: Line 26:
 
These imprecise populations are not amenable to sampling in any of the ways below and to which we could apply statistical theory.
 
These imprecise populations are not amenable to sampling in any of the ways below and to which we could apply statistical theory.
   
As a remedy, we seek a ''sampling frame'' which has the property that we can identify every single element and include any in our sample. For example, in an electoral poll, possible sampling frames include:
+
As a remedy, we seek a ''sampling frame'' which has the property that we can identify every single element and include any in our sample. For example, in an [[opinion poll]], possible sampling frames include:
   
* Electoral register
+
* [[Electoral register]]
* Telephone directory
+
* [[Telephone directory]]
 
* Shoppers in Anytown, High Street on the Monday afternoon before the election.
 
* Shoppers in Anytown, High Street on the Monday afternoon before the election.
   
The sampling frame must be representative of the population and this is a question outside the scope of statistical theory demanding the judgement of experts in the particular subject matter being studied. All the above frames omit some people who will vote at the next election and contain some people who will not. People not in the frame have no prospect of being sampled. Statistical theory tells us about the uncertainties in extrapolating from a sample to the frame. In extrapolating from frame to population its role is motivational and suggestive.
+
The sampling frame must be representative of the population and this is a question outside the scope of statistical theory demanding the judgment of experts in the particular subject matter being studied. All the above frames omit some people who will vote at the next election and contain some people who will not. People not in the frame have no prospect of being sampled. Statistical theory tells us about the uncertainties in extrapolating from a sample to the frame. In extrapolating from frame to population, its role is motivational and suggestive.
   
  +
There is, however, a strong but unnoticed division of views about the acceptability of representative sampling across different domains of study. To the philosopher or doctor, the representative sampling procedure has no justification whatsoever because it is not how truth is pursued in philosophy. "To the scientist, however, representative sampling is the only justified procedure for choosing individual objects for use as the basis of generalization, and is therefore usually the only acceptable basis for ascertaining truth." (Andrew A. Marino) [http://www.ortho.lsuhsc.edu/Faculty/Marino/Point1/Representative.html]. It is important to understand this difference to steer clear of confusing prescriptions found in many web pages.
In defining the frame, practical, economic, ethical and technical issues need to be addressed. The need to obtain timely results may prevent extending the frame far into the future.
 
   
 
In defining the frame, practical, economic, ethical, and technical issues need to be addressed. The need to obtain timely results may prevent extending the frame far into the future.
The difficulties can be extreme when the population and frame are [[disjoint]]. This is a particular problem in [[forecasting]] where inferences about the future are made from historical [[data]]. In fact, in 1703, when Jacob Bernoulli proposed to Gottfried Leibniz the possibility of using historical mortality data to predict the [[probability]] of early death of a living man, Leibniz recognised the problem in replying:
 
   
 
The difficulties can be extreme when the population and frame are [[disjoint]]. This is a particular problem in [[forecasting]] where inferences about the future are made from historical [[data]]. In fact, in 1703, when [[Jacob Bernoulli]] proposed to [[Gottfried Leibniz]] the possibility of using historical mortality data to predict the [[probability]] of early death of a living man, [[Gottfried Leibniz]] recognized the problem in replying:
''Nature has established patterns originating in the return of events but only for the most part. New illnesses flood the human race, so that no matter how many experiments you have done on corpses, you have not thereby imposed a limit on the nature of events so that in the future they could not vary.''
 
   
 
"Nature has established patterns originating in the return of events but only for the most part. New illnesses flood the human race, so that no matter how many experiments you have done on corpses, you have not thereby imposed a limit on the nature of events so that in the future they could not vary."
Having established the frame, there are a number of ways of organising it to improve efficiency and effectiveness.
 
   
 
Having established the frame, there are a number of ways for organizing it to improve efficiency and effectiveness.
===Simple sampling===
 
In this case, all elements of the frame are treated equally and it is not subdivided or partitioned. One of the sampling methods below is applied to the whole frame.
 
   
  +
It is at this stage that the researcher should decide whether the sample is in fact to be the whole population and would therefore be a census.
===[[Stratified sampling]]===
 
  +
Where the population embraces a number of distinct categories, the frame can be organised by these categories into separate ''strata'' or [[demographics]]. One of the sampling methods below is then applied to each ''stratum'' separately. Major gains in efficiency (either lower sample sizes or higher precision) can be achieved by varying the [[sampling fraction]] from stratum to stratum. The sample size should be made proportional to the stratum [[standard deviation]]. From the efficiency point of view (i.e. maximum precision for a given sample size) strata should be chosen to have
 
 
==Sampling method==
* [[mean]]s which differ substantially from one another
 
 
Within any of the types of frame identified above, a variety of sampling methods can be employed, individually or in combination.
* [[variance]]s which are different from one another, and lower than the overall variance
 
  +
sampling is divided in two categories
  +
1. Probability Sampling
 
2. Nonprobability Sampling
  +
  +
Probability sampling includes: Simple Random Method, Systematic Sampling, Stratified Sampling and Cluster or Multistage Sampling.
  +
<br />Nonprobability Sampling includes: Accidental Sampling, Quota Sampling and Purposive Sampling.
  +
 
===Simple random sampling===
  +
In a [[simple random sample]] of a given size, all such subsets of the frame are given an equal probability. Each element of the frame thus has an equal probability of selection: the frame is not subdivided or partitioned. It is possible that the sample will not be completely random.
  +
 
===Systematic sampling===
  +
Selecting (say) every ''10''th name from the telephone directory is called an '''every ''10''th''' sample, which is an example of [[systematic sampling]]. It is a type of [[probability sampling]] unless the directory itself is not [[randomization|randomized]]. It is easy to implement and the [[Stratified sampling|stratification]] induced can make it efficient, but it is especially vulnerable to periodicities in the list. If periodicity is present and the period is a multiple of ''10'', then [[Biased sample|bias]] will result. It is important that the first name chosen is not simply the first in the list, but is chosen to be (say) the ''7''th, where ''7'' is a random integer in the range 1,...,''10''-1. Every ''10''th sampling is especially useful for efficient sampling from [[databases]].
  +
 
===Stratified sampling===
 
Where the population embraces a number of distinct categories, the frame can be organized by these categories into separate "strata." A sample is then selected from each "stratum" separately, producing a [[stratified sample]]. The two main reasons for using a stratified sampling design are [1] to ensure that particular groups within a population are adequately represented in the sample, and [2] to improve efficiency by gaining greater control on the composition of the sample. In the second case, major gains in efficiency (either lower sample sizes or higher precision) can be achieved by varying the [[sampling fraction]] from stratum to stratum. The sample size is usually proportional to the relative size of the strata. However, if variances differ significantly across strata, sample sizes should be made proportional to the stratum [[standard deviation]]. Disproportionate stratification can provide better precision than proportionate stratification. Typically, strata should be chosen to:
 
* have [[mean]]s which differ substantially from one another
  +
* minimize [[variance]] within strata and maximize variance between strata.
   
 
===Cluster sampling===
 
===Cluster sampling===
  +
Sometimes it is cheaper to 'cluster' the sample in some way e.g. by selecting respondents from certain areas only, or certain time-periods only. (Nearly all samples are in some sense 'clustered' in time - although this is rarely taken into account in the analysis.)
Random sampling of a population spread across a large area, eg all of Europe involves a lot of travelling, cost and delay. '''Cluster''' or '''area sampling''' addresses this problem. There are three stages: 1) the target population is divided into many regional clusters (groups) eg London, Berlin, Rome etc 2) a few clusters are randomly selected for study 3) A few subjects are randomly chosen from within a cluster
 
   
  +
[[Cluster sampling]] is an example of '[[two-stage sampling]]' or '[[multistage sampling]]': in the first stage a sample of areas is chosen; in the second stage a sample of respondent ''within'' those areas is selected.
===Quota sampling===
 
In '''quota sampling''', the population is first segmented into [[mutually exclusive]] sub-groups, just as in [[stratified sampling]]. Then judgement is used to select the subjects or units from each segment based on a specified proportion. For example, an interviewer may be told to sample 200 females and 300 males between the age of 45 and 60.
 
   
  +
This can reduce travel and other administrative costs. It also means that one does not need a [[sampling frame]] for the entire population, but only for the selected clusters.
It is this second step which makes the technique one of non-probability sampling. In quota sampling the selection of the sample is non-[[random]]. For example interviewers might be tempted to interview those people in the street who look most helpful. The problem is that these samples may be [[biased]] because not everyone gets a chance of selection. This non-random element is its greatest weakness and quota versus probability has been a matter of controversy for many years.
 
  +
'''
  +
Cluster sampling generally increases the variability of sample estimates above that of simple random sampling, depending on how the clusters differ between themselves, as compared with the within-cluster variation.
  +
'''
   
 
===Matched random sampling===
==Sampling method==
 
  +
A method of assigning participants to groups in which pairs of participants are first matched on some characteristic and then individually assigned randomly to groups. (Brown, Cozby, Kee, & Worden, 1999, p.371).
Within any of the types of frame identified above, a variety of sampling methods can be employed, individually or in combination.
 
   
  +
The Procedure for Matched random sampling can be briefed with the following contexts,
===Random sampling===
 
   
  +
a) Two samples in which the members are clearly paired, or are matched explicitly by the researcher. For example, IQ measurements or pairs of identical twins.
{{Main|Random sampling}}
 
   
  +
b) Those samples in which the same attribute, or variable, is measured twice on each subject, under different circumstances. Commonly called repeated measures. Examples include the times of a group of athletes for 1500m before and after a week of special training; the milk yields of cows before and after being fed a particular diet.
In random sampling, also known as probability sampling, every combination of items from the frame, or stratum, has a known probability of occurring, but these probabilities are not necessarily equal. With any form of sampling there is a risk that the sample may not adequately represent the population but with random sampling there is a large body of statistical theory which quantifies the risk and thus enables an appropriate sample size to be chosen. Furthermore, once the sample has been taken the [[sampling error]] associated with the measured results can be computed. With non-random sampling there is no measure of the associated sampling error. While such methods may be cheaper this is largely meaningless since there is no measure of quality. There are several forms of random sampling. For example, in [[simple random sample|simple random sampling]], each element has an equal probability of occurring. It may be infeasible in many practical situations. Other examples of probability sampling include [[stratified sampling]] and [[multistage sampling]].
 
   
===[[Systematic sampling]]===
+
===Quota sampling===
 
In '''quota sampling''', the population is first segmented into [[mutually exclusive]] sub-groups, just as in [[stratified sampling]]. Then judgment is used to select the subjects or units from each segment based on a specified proportion. For example, an interviewer may be told to sample 200 females and 300 males between the age of 45 and 60.
Selecting (say) every tenth name from the telephone directory is simple to implement and is an example of [[systematic sampling]]. Though simple to implement, asymmetries and biases in the structure of the data can lead to [[bias (statistics)|bias]] in results. It is a type of [[nonprobability sampling]] unless the directory itself is randomized.
 
  +
 
It is this second step which makes the technique one of non-probability sampling. In quota sampling the selection of the sample is non-[[random]]. For example interviewers might be tempted to interview those who look most helpful. The problem is that these samples may be [[Biased samples|biased]] because not everyone gets a chance of selection. This random element is its greatest weakness and quota versus probability has been a matter of controversy for many years.
   
 
===Mechanical sampling===
 
===Mechanical sampling===
[[Mechanical sampling]]does not occurs typically in sampling [[solid]]s, [[liquid]]s and [[gas]]es, using devices such as grabs, scoops, [[thief probe]]s, the [[coliwasa]] and [[riffle splitter]].
+
[[Mechanical sampling]] is typically used in sampling [[solid]]s, [[liquid]]s and [[gas]]es, using devices such as grabs, scoops, [[thief probe]]s, the [[Composite Liquid Waste Sampler|COLIWASA]] and [[riffle splitter]].
   
Mechanical sampling is not [[randomness|random]] and is a type of [[nonprobability sampling]]. Care is needed in ensuring that the sample is representative of the frame. Much work in this area was developed by [[Pierre Gy]].
+
Care is needed in ensuring that the sample is representative of the frame. Much work in this area was developed by [[Pierre Gy]].
   
 
===Convenience sampling===
 
===Convenience sampling===
Sometimes called, ''grab'' or ''opportunity'' sampling, this is the method of choosing items arbitrarily and in an unstructured manner from the frame. Though almost impossible to treat rigorously, it is the method most commonly employed in many practical situations. In social science research, [[snowball sampling]] is a similar technique, where existing study subjects are used to recruit more subjects into the sample.
+
Sometimes called ''grab'' or ''opportunity'' sampling, this is the method of choosing items arbitrarily and in an unstructured manner from the frame. Though almost impossible to treat rigorously, it is the method most commonly employed in many practical situations. In social science research, [[snowball sampling]] is a similar technique, where existing study subjects are used to recruit more subjects into the sample.
   
===Sample size===
+
===Line-intercept sampling===
  +
[[Line-intercept sampling]] is a method of sampling elements in a region whereby an element is sampled if a chosen line segment, called a “transect”, intersects the element.
Where the frame and population are identical, statistical theory yields exact recommendations on sample size. However, where it is not straightforward to define a frame representative of the population, it is more important to understand the [[cause system]] of which the population are outcomes and to ensure that all sources of variation are embraced in the frame. Large number of observations are of no value if major sources of variation are neglected in the study. In other words, it is taking a sample group that matches the survey category and is easy to survey.
 
  +
  +
==Other types==
 
*[[Accidental sample]]
 
*[[Cluster sample]]
 
*[[Opportunity sample]]
  +
*[[Random digit dialling]]
  +
*[[Probability sample]]
  +
*[[Self-se;ected sample]]
  +
  +
==Sample size==
 
Where the frame and population are identical, statistical theory yields exact recommendations on [[sample size]].<ref>Mathematical details are displayed in the [[Sample size]] article.</ref> However, where it is not straightforward to define a frame representative of the population, it is more important to understand the [[cause system]] of which the population are outcomes and to ensure that all sources of variation are embraced in the frame. Large number of observations are of no value if major sources of variation are neglected in the study. In other words, it is taking a sample group that matches the survey category and is easy to survey. Bartlett, Kotrlik, and Higgins (2001) published a paper titled [http://www.osra.org/itlpj/bartlettkotrlikhiggins.pdf Organizational Research: Determining Appropriate Sample Size in Survey Research Information Technology, Learning, and Performance Journal] that provides an explanation of Cochran’s (1977) formulas. A discussion and illustration of sample size formulas, including the formula for adjusting the sample size for smaller populations, is included. A table is provided that can be used to select the sample size for a research problem based on three alpha levels and a set error rate.
  +
  +
==Types of data==
  +
===Categorical and numerical===
  +
There are two types of random variables: categorical and numerical. '''Categorical random variables''' yield responses such as 'yes' or 'no'. Categorical variables can yield more than two possible responses. For example: 'Which day of the week are you most likely to wash clothes?' '''Numerical random variables''' yield numerical responses, such as your height in centimeters.
  +
  +
There are two types of numerical variables: discrete and continuous. '''Discrete random variables''' produce numerical responses from a counting process. An example is 'how many times do you visit the cash machine in a typical month?' '''Continuous random variables''' produce responses from a measuring process. Height is an example of a continuous variable because the response takes on a value from an interval. Precision of the measurement instrument(s) may lead to ''tied observations''. A tied observation occurs when the measuring device is not sensitive or sophisticated enough to detect incremental differences in the experimental or survey data.
  +
  +
Generally continuous random variable requires less samples than of discrete random variable. This can be justified by referring to the [[Central Limit Theorem]]
   
 
==Sampling and data collection==
 
==Sampling and data collection==
Line 83: Line 127:
 
* Noting comments and other contextual events
 
* Noting comments and other contextual events
 
* Recording non-responses
 
* Recording non-responses
  +
  +
Most sampling books and papers written by non-statisticians focus only in the data collection aspect, which is just a small part of the sampling process.
   
 
==Review of sampling process==
 
==Review of sampling process==
 
After sampling, a review should be held of the exact process followed in sampling, rather than that intended, in order to study any effects that any divergences might have on subsequent analysis. A particular problem is that of ''non-responses''.
 
After sampling, a review should be held of the exact process followed in sampling, rather than that intended, in order to study any effects that any divergences might have on subsequent analysis. A particular problem is that of ''non-responses''.
   
===Non-responses===
+
===Non-response===
In [[survey sampling]], many of the individuals identified as part of the sample may be unwilling to participate or impossible to contact. In this case, there is a risk of differences, between (say) the willing and unwilling, leading to [[selection bias]] in conclusions. This is often addressed by follow-up studies which make a repeated attempt to contact the unresponsive and to characterise their similarities and differences with the rest of the frame.
+
In [[survey sampling]], many of the individuals identified as part of the sample may be unwilling to participate or impossible to contact. In this case, there is a risk of differences, between (say) the willing and unwilling, leading to [[selection bias]] in conclusions. This is often addressed by follow-up studies which make a repeated attempt to contact the unresponsive and to characterize their similarities and differences with the rest of the frame. The effects can also be mitigated by weighting the data when population benchmarks are available. Nonresponse is particularly a problem in internet sampling. One of the main reasons for this problem could be that people may hold multiple e-mail adresses, which they don't use anymore or don't check them regurlarly.
   
==Weighting of samples==
+
==Survey weights==
   
 
In many situations the sample fraction may be varied by stratum and data will have to be weighted to correctly represent the population. Thus for example, a simple random sample of individuals in the United Kingdom might include some in remote Scottish islands who would be inordinately expensive to sample. A cheaper method would be to use a stratified sample with urban and rural strata. The rural sample could be under-represented in the sample, but weighted up appropriately in the analysis to compensate.
 
In many situations the sample fraction may be varied by stratum and data will have to be weighted to correctly represent the population. Thus for example, a simple random sample of individuals in the United Kingdom might include some in remote Scottish islands who would be inordinately expensive to sample. A cheaper method would be to use a stratified sample with urban and rural strata. The rural sample could be under-represented in the sample, but weighted up appropriately in the analysis to compensate.
   
  +
More generally, data should usually be weighted if the sample design does not give each individual an equal chance of being selected. For instance, when households have equal selection probabilities but one person is interviewed from within each household, this gives people from large households a smaller chance of being interviewed. This can be accounted for using survey weights. Similarly, households with more than one telephone line have a greater chance of being selected in a random digit dialing sample, and weights can adjust for this.
==History of sampling==
 
The idea of random sampling by the use of lots is an old one, mentioned several times in the Bible. In 1786 Pierre Simon [[Laplace]] estimated the population of France by using a sample, along with [[ratio estimator]]. He also computed probabilistic estimates of the error. These were not expressed as modern [[confidence interval]]s but as the sample size that would be needed to achieve a particular upper bound on the sampling error with probability 1000/1001. His estimates used [[Bayes' theorem]] with a uniform [[prior probability]] and it assumed his sample was random.The theory of small-sample statistics developed by [[William Sealy Gossett]] put the subject on a more rigorous basis in the 20th century. However, the importance of random sampling was not universally appreciated and in the USA the 1936 ''Literary Digest'' prediction of a Republican win in the presidential election went badly awry, due to severe [[bias]]. A sample size of one million was obtained through magazine subscription lists and telephone directories. It was not appreciated that these lists were heavily biased towards Republicans and the resulting sample, though very large, was deeply flawed.
 
   
  +
Weights can also serve other purposes, such as helping to correct for non-response.
   
==See also==
+
==History==
 
Random sampling by using lots is an old idea, mentioned several times in the Bible. In 1786 Pierre Simon [[Laplace]] estimated the population of France by using a sample, along with [[ratio estimator]]. He also computed probabilistic estimates of the error. These were not expressed as modern [[confidence interval]]s but as the sample size that would be needed to achieve a particular upper bound on the sampling error with probability 1000/1001. His estimates used [[Bayes' theorem]] with a uniform [[prior probability]] and it assumed his sample was random. The theory of small-sample statistics developed by [[William Sealy Gossett]] put the subject on a more rigorous basis in the 20th century. However, the importance of random sampling was not universally appreciated and in the USA the 1936 ''[[Literary Digest]]'' prediction of a Republican win in the [[U.S. presidential election, 1936|presidential election]] went badly awry, due to severe [[bias]] [Experimental:Sex]].com/public/article/SB115974322285279370-_rk13XDUHmIcnA8DYs5VUscZG94_20071001.html?mod=rss_free]. A sample size of one million was obtained through magazine subscription lists and telephone directories. It was not appreciated that these lists were heavily biased towards Republicans and the resulting sample, though very large, was deeply flawed.
* [[Simple random sample]]
 
* [[Cluster sampling]]
 
* [[Marketing research]]
 
* [[Multistage sampling]]
 
* [[Nonprobability sampling]]
 
* [[Quantitative marketing research]]
 
* [[Random sample]]
 
* [[Stratified sampling]]
 
* [[Systematic sampling]]
 
   
  +
== See also ==
==Graduate Degree Programs specializing in Sampling/Survey Methods==
 
 
*[[Biased sampling]]
  +
*[[Data collection]]
  +
*[[Experimental design]]
  +
*[[Experimentation]]
  +
*[[Official statistics]]
 
*[[Random sampling]]
  +
*[[Replication (statistics)]]
  +
*[[Reporting bias]]
  +
*[[Sample (statistics)]]
  +
*[[Sample size]] [[rule of thumb]] for [[estimate]] of [[population mean]]
  +
*[[Sampling (case studies)]]
  +
*[[Sampling error]]
  +
*[[Selection bias]]
  +
*[[Spectrum bias]]
  +
*[[Statistical analysis]]
  +
*[[Statistical power]]
  +
*[[Statistical reliability]]
  +
*[[Statistical samples]]
  +
*[[Statistical variables]]
   
  +
==External links==
Psychology
 
  +
*[http://www.socialresearchmethods.net/kb/sampling.php Chapter on Sampling at the Research Methods Knowledge Base]
  +
*[http://www.statpac.com/surveys/sampling.htm Survey Sampling Methods at the SatPac survey software site]
  +
*[http://trsl.sourceforge.net/ TRSL &ndash; Template Range Sampling Library] is a C++ library that implements several sampling schemes behind an (STL-like) iterator interface.
  +
*[http://inderscience.metapress.com/openurl.asp?genre=article&eissn=1740-8857&volume=4&issue=4&spage=393 Continuous Sampling vs. Costs - Electronics Industry Example]
   
  +
==Notes==
===Doctoral and Masters Degrees===
 
  +
{{reflist}}
*[http://www.socsci.soton.ac.uk/socstats/Study_Opportunities/Postgraduate/Programmes/phdbyresearch.php?NavContext=Discipline/ Program in Social Statistics (Survey Methodology) - University of Southampton]
 
*[http://www.jpsm.umd.edu Joint Program in Survey Methodology (JPSM) - University of Maryland-College Park and University of Michigan-Ann Arbor]
 
*[http://sram.unl.edu/ Survey Research and Methodology - University of Nebraska-Lincoln]
 
*[http://www.isr.umich.edu/gradprogram/ Program in Survey Methodology - University of Michigan-Ann Arbor]
 
   
  +
==References==
===Masters Degrees Only===
 
  +
* Brown, K.W., Cozby, P.C., Kee, D.W., & Worden, P.E. (1999). ''Research Methods in Human Development,'' 2d ed. Mountain View, CA: Mayfield. ISBN 1-55934-875-5
*[http://www.socsci.soton.ac.uk/ Magister in Official Statistics - University of Southampton, UK]
 
  +
* [http://www.osra.org/itlpj/bartlettkotrlikhiggins.pdf Bartlett, J. E., II, Kotrlik, J. W., & Higgins, C. (2001). Organizational research: Determining appropriate sample size for survey research. Information Technology, Learning, and Performance Journal, 19(1) 43-50.]
*[http://www.msr.uconn.edu/ Graduate Program in Survey Research - University of Connecticut]
 
  +
* Chambers, R L, and Skinner, C J (editors) (2003), ''Analysis of Survey Data'', Wiley, ISBN 0-471-89987-9
*[http://www.stat.huji.ac.il/diploma.htm Diploma in Official Statistics - Hebrew University, Israel]
 
 
* Cochran, W G (1977) ''Sampling Techniques'', Wiley, ISBN 0-471-16240-X
*[http://www.fss.uu.nl/master/mands Methodology and Statistics for the Social and Behavioral Sciences - Utrecht University, the Netherlands]
 
 
* Deming, W E (1975) On probability as a basis for action, ''The American Statistician'', 29(4), pp146-152.
 
  +
* Flyvbjerg, B (2006) "Five Misunderstandings About Case Study Research." Qualitative Inquiry, vol. 12, no. 2, April 2006, pp. 219-245. [http://flyvbjerg.plan.aau.dk/Publications2006/0604FIVEMISPUBL2006.pdf]
==Bibliography==
 
* Cochran, W G (1977) ''Sampling Techniques''
 
* Deming, W E (1975) On probability as a basis for action, ''The American Statistician'', 29(4), pp146-152
 
 
* Gy, P (1992) ''Sampling of Heterogeneous and Dynamic Material Systems: Theories of Heterogeneity, Sampling and Homogenizing''
 
* Gy, P (1992) ''Sampling of Heterogeneous and Dynamic Material Systems: Theories of Heterogeneity, Sampling and Homogenizing''
* Sarndal, Swenson, and Wretman (1992), Model Assisted Survey Sampling, Springer-Verlag.
+
* Kish, L (1995) ''Survey Sampling'', Wiley, ISBN 0-471-10949-5
  +
* Korn, E L, and Graubard, B I (1999) ''Analysis of Health Surveys'', Wiley, ISBN 0-471-13773-1
  +
* Lohr, H (1999) ''Sampling: Design and Analysis'', Duxbury, ISBN 0-534-35361-4
  +
* Sarndal, Swenson, and Wretman (1992), Model Assisted Survey Sampling, Springer-Verlag, ISBN 0-387-40620-4
 
* Stuart, Alan (1962) ''Basic Ideas of Scientific Sampling'', Hafner Publishing Company, New York
 
* Stuart, Alan (1962) ''Basic Ideas of Scientific Sampling'', Hafner Publishing Company, New York
  +
*ASTM E105 Standard Practice for Probability Sampling Of Materials
  +
*ASTM E122 Standard Practice for Calculating Sample Size to Estimate, With a Specified Tolerable Error, the Average for Characteristic of a Lot or Process
  +
*ASTM E141 Standard Practice for Acceptance of Evidence Based on the Results of Probability Sampling
  +
*ASTM E1402 Standard Terminology Relating to Sampling
  +
*ASTM E1994 Standard Practice for Use of Process Oriented AOQL and LTPD Sampling Plans
  +
*ASTM E2234 Standard Practice for Sampling a Stream of Product by Attributes Indexedby AQL
   
[[Category:Statistics]]
+
{{Statistics}}
  +
[[Category:Sampling (statistics)| ]]
 
  +
[[Category:Sampling (experimental)]]
 
  +
<!--
:da:Stikprøve
+
[[da:Stikprøve]]
 
[[es:Muestreo en estadística]]
 
[[es:Muestreo en estadística]]
  +
[[fr:Échantillon (statistiques)]]
  +
[[ko:표집]]
 
[[id:Teknik sampling]]
 
[[it:Campionamento statistico]]
 
[[it:Campionamento statistico]]
 
[[he:מדגם]]
 
[[he:מדגם]]
 
[[lt:Atranka]]
 
[[lt:Atranka]]
  +
[[hu:Mintavétel]]
 
  +
[[ja:標本調査]]
  +
[[pl:Dobór próby]]
  +
[[pt:Base de sondagem]]
  +
[[ru:Семплирование (математическая статистика)]]
  +
[[simple:Sampling (statistics)]]
  +
[[su:Sampling (statistika)]]
  +
[[fi:Otanta]]
  +
-->
 
{{enWP|Sampling (statistics)}}
 
{{enWP|Sampling (statistics)}}

Revision as of 21:15, 6 September 2013

Assessment | Biopsychology | Comparative | Cognitive | Developmental | Language | Individual differences | Personality | Philosophy | Social |
Methods | Statistics | Clinical | Educational | Industrial | Professional items | World psychology |

Statistics: Scientific method · Research methods · Experimental design · Undergraduate statistics courses · Statistical tests · Game theory · Decision theory


Sampling is that part of statistical practice concerned with the selection of individual observations intended to yield some knowledge about a population of concern, especially for the purposes of statistical inference. Each observation measures one or more properties (weight, location, etc.) of an observable entity enumerated to distinguish objects or individuals. Survey weights often need to be applied to the data to adjust for the sample design. Results from probability theory and statistical theory are employed to guide practice.

The sampling process comprises several stages:

  • Defining the population of concern
  • Specifying a sampling frame, a set of items or events possible to measure
  • Specifying a sampling method for selecting items or events from the frame
  • Determining the sample size
  • Implementing the sampling plan
  • Sampling and data collecting
  • Reviewing the sampling process

Population definition

Successful statistical practice is based on focused problem definition. Typically, we seek to take action on some population, for example when a batch of material from production must be released to the customer or sentenced for scrap or rework.

Alternatively, we seek knowledge about the cause system of which the population is an outcome, for example when a researcher performs an experiment on rats with the intention of gaining insights into biochemistry that can be applied for the benefit of humans. In the latter case, the population of concern can be difficult to specify, as it is in the case of measuring some physical characteristic such as the electrical conductivity of copper.

However, in all cases, time spent in making the population of concern precise is often well spent, often because it raises many issues, ambiguities and questions that would otherwise have been overlooked at this stage.

Sampling frame

In the most straightforward case, such as the sentencing of a batch of material from production (acceptance sampling by lots), it is possible to identify and measure every single item in the population and to include any one of them in our sample. However, in the more general case this is not possible. There is no way to identify all rats in the set of all rats. There is no way to identify every voter at a forthcoming election (in advance of the election).

These imprecise populations are not amenable to sampling in any of the ways below and to which we could apply statistical theory.

As a remedy, we seek a sampling frame which has the property that we can identify every single element and include any in our sample. For example, in an opinion poll, possible sampling frames include:

  • Electoral register
  • Telephone directory
  • Shoppers in Anytown, High Street on the Monday afternoon before the election.

The sampling frame must be representative of the population and this is a question outside the scope of statistical theory demanding the judgment of experts in the particular subject matter being studied. All the above frames omit some people who will vote at the next election and contain some people who will not. People not in the frame have no prospect of being sampled. Statistical theory tells us about the uncertainties in extrapolating from a sample to the frame. In extrapolating from frame to population, its role is motivational and suggestive.

There is, however, a strong but unnoticed division of views about the acceptability of representative sampling across different domains of study. To the philosopher or doctor, the representative sampling procedure has no justification whatsoever because it is not how truth is pursued in philosophy. "To the scientist, however, representative sampling is the only justified procedure for choosing individual objects for use as the basis of generalization, and is therefore usually the only acceptable basis for ascertaining truth." (Andrew A. Marino) [1]. It is important to understand this difference to steer clear of confusing prescriptions found in many web pages.

In defining the frame, practical, economic, ethical, and technical issues need to be addressed. The need to obtain timely results may prevent extending the frame far into the future.

The difficulties can be extreme when the population and frame are disjoint. This is a particular problem in forecasting where inferences about the future are made from historical data. In fact, in 1703, when Jacob Bernoulli proposed to Gottfried Leibniz the possibility of using historical mortality data to predict the probability of early death of a living man, Gottfried Leibniz recognized the problem in replying:

"Nature has established patterns originating in the return of events but only for the most part. New illnesses flood the human race, so that no matter how many experiments you have done on corpses, you have not thereby imposed a limit on the nature of events so that in the future they could not vary."

Having established the frame, there are a number of ways for organizing it to improve efficiency and effectiveness.

It is at this stage that the researcher should decide whether the sample is in fact to be the whole population and would therefore be a census.

Sampling method

Within any of the types of frame identified above, a variety of sampling methods can be employed, individually or in combination. sampling is divided in two categories 1. Probability Sampling 2. Nonprobability Sampling

Probability sampling includes: Simple Random Method, Systematic Sampling, Stratified Sampling and Cluster or Multistage Sampling.
Nonprobability Sampling includes: Accidental Sampling, Quota Sampling and Purposive Sampling.

Simple random sampling

In a simple random sample of a given size, all such subsets of the frame are given an equal probability. Each element of the frame thus has an equal probability of selection: the frame is not subdivided or partitioned. It is possible that the sample will not be completely random.

Systematic sampling

Selecting (say) every 10th name from the telephone directory is called an every 10th sample, which is an example of systematic sampling. It is a type of probability sampling unless the directory itself is not randomized. It is easy to implement and the stratification induced can make it efficient, but it is especially vulnerable to periodicities in the list. If periodicity is present and the period is a multiple of 10, then bias will result. It is important that the first name chosen is not simply the first in the list, but is chosen to be (say) the 7th, where 7 is a random integer in the range 1,...,10-1. Every 10th sampling is especially useful for efficient sampling from databases.

Stratified sampling

Where the population embraces a number of distinct categories, the frame can be organized by these categories into separate "strata." A sample is then selected from each "stratum" separately, producing a stratified sample. The two main reasons for using a stratified sampling design are [1] to ensure that particular groups within a population are adequately represented in the sample, and [2] to improve efficiency by gaining greater control on the composition of the sample. In the second case, major gains in efficiency (either lower sample sizes or higher precision) can be achieved by varying the sampling fraction from stratum to stratum. The sample size is usually proportional to the relative size of the strata. However, if variances differ significantly across strata, sample sizes should be made proportional to the stratum standard deviation. Disproportionate stratification can provide better precision than proportionate stratification. Typically, strata should be chosen to:

  • have means which differ substantially from one another
  • minimize variance within strata and maximize variance between strata.

Cluster sampling

Sometimes it is cheaper to 'cluster' the sample in some way e.g. by selecting respondents from certain areas only, or certain time-periods only. (Nearly all samples are in some sense 'clustered' in time - although this is rarely taken into account in the analysis.)

Cluster sampling is an example of 'two-stage sampling' or 'multistage sampling': in the first stage a sample of areas is chosen; in the second stage a sample of respondent within those areas is selected.

This can reduce travel and other administrative costs. It also means that one does not need a sampling frame for the entire population, but only for the selected clusters. Cluster sampling generally increases the variability of sample estimates above that of simple random sampling, depending on how the clusters differ between themselves, as compared with the within-cluster variation.

Matched random sampling

A method of assigning participants to groups in which pairs of participants are first matched on some characteristic and then individually assigned randomly to groups. (Brown, Cozby, Kee, & Worden, 1999, p.371).

The Procedure for Matched random sampling can be briefed with the following contexts,

a) Two samples in which the members are clearly paired, or are matched explicitly by the researcher. For example, IQ measurements or pairs of identical twins.

b) Those samples in which the same attribute, or variable, is measured twice on each subject, under different circumstances. Commonly called repeated measures. Examples include the times of a group of athletes for 1500m before and after a week of special training; the milk yields of cows before and after being fed a particular diet.

Quota sampling

In quota sampling, the population is first segmented into mutually exclusive sub-groups, just as in stratified sampling. Then judgment is used to select the subjects or units from each segment based on a specified proportion. For example, an interviewer may be told to sample 200 females and 300 males between the age of 45 and 60.

It is this second step which makes the technique one of non-probability sampling. In quota sampling the selection of the sample is non-random. For example interviewers might be tempted to interview those who look most helpful. The problem is that these samples may be biased because not everyone gets a chance of selection. This random element is its greatest weakness and quota versus probability has been a matter of controversy for many years.

Mechanical sampling

Mechanical sampling is typically used in sampling solids, liquids and gases, using devices such as grabs, scoops, thief probes, the COLIWASA and riffle splitter.

Care is needed in ensuring that the sample is representative of the frame. Much work in this area was developed by Pierre Gy.

Convenience sampling

Sometimes called grab or opportunity sampling, this is the method of choosing items arbitrarily and in an unstructured manner from the frame. Though almost impossible to treat rigorously, it is the method most commonly employed in many practical situations. In social science research, snowball sampling is a similar technique, where existing study subjects are used to recruit more subjects into the sample.

Line-intercept sampling

Line-intercept sampling is a method of sampling elements in a region whereby an element is sampled if a chosen line segment, called a “transect”, intersects the element.

Other types

  • Accidental sample
  • Cluster sample
  • Opportunity sample
  • Random digit dialling
  • Probability sample
  • Self-se;ected sample

Sample size

Where the frame and population are identical, statistical theory yields exact recommendations on sample size.[1] However, where it is not straightforward to define a frame representative of the population, it is more important to understand the cause system of which the population are outcomes and to ensure that all sources of variation are embraced in the frame. Large number of observations are of no value if major sources of variation are neglected in the study. In other words, it is taking a sample group that matches the survey category and is easy to survey. Bartlett, Kotrlik, and Higgins (2001) published a paper titled Organizational Research: Determining Appropriate Sample Size in Survey Research Information Technology, Learning, and Performance Journal that provides an explanation of Cochran’s (1977) formulas. A discussion and illustration of sample size formulas, including the formula for adjusting the sample size for smaller populations, is included. A table is provided that can be used to select the sample size for a research problem based on three alpha levels and a set error rate.

Types of data

Categorical and numerical

There are two types of random variables: categorical and numerical. Categorical random variables yield responses such as 'yes' or 'no'. Categorical variables can yield more than two possible responses. For example: 'Which day of the week are you most likely to wash clothes?' Numerical random variables yield numerical responses, such as your height in centimeters.

There are two types of numerical variables: discrete and continuous. Discrete random variables produce numerical responses from a counting process. An example is 'how many times do you visit the cash machine in a typical month?' Continuous random variables produce responses from a measuring process. Height is an example of a continuous variable because the response takes on a value from an interval. Precision of the measurement instrument(s) may lead to tied observations. A tied observation occurs when the measuring device is not sensitive or sophisticated enough to detect incremental differences in the experimental or survey data.

Generally continuous random variable requires less samples than of discrete random variable. This can be justified by referring to the Central Limit Theorem

Sampling and data collection

Good data collection involves:

  • Following the defined sampling process
  • Keeping the data in time order
  • Noting comments and other contextual events
  • Recording non-responses

Most sampling books and papers written by non-statisticians focus only in the data collection aspect, which is just a small part of the sampling process.

Review of sampling process

After sampling, a review should be held of the exact process followed in sampling, rather than that intended, in order to study any effects that any divergences might have on subsequent analysis. A particular problem is that of non-responses.

Non-response

In survey sampling, many of the individuals identified as part of the sample may be unwilling to participate or impossible to contact. In this case, there is a risk of differences, between (say) the willing and unwilling, leading to selection bias in conclusions. This is often addressed by follow-up studies which make a repeated attempt to contact the unresponsive and to characterize their similarities and differences with the rest of the frame. The effects can also be mitigated by weighting the data when population benchmarks are available. Nonresponse is particularly a problem in internet sampling. One of the main reasons for this problem could be that people may hold multiple e-mail adresses, which they don't use anymore or don't check them regurlarly.

Survey weights

In many situations the sample fraction may be varied by stratum and data will have to be weighted to correctly represent the population. Thus for example, a simple random sample of individuals in the United Kingdom might include some in remote Scottish islands who would be inordinately expensive to sample. A cheaper method would be to use a stratified sample with urban and rural strata. The rural sample could be under-represented in the sample, but weighted up appropriately in the analysis to compensate.

More generally, data should usually be weighted if the sample design does not give each individual an equal chance of being selected. For instance, when households have equal selection probabilities but one person is interviewed from within each household, this gives people from large households a smaller chance of being interviewed. This can be accounted for using survey weights. Similarly, households with more than one telephone line have a greater chance of being selected in a random digit dialing sample, and weights can adjust for this.

Weights can also serve other purposes, such as helping to correct for non-response.

History

Random sampling by using lots is an old idea, mentioned several times in the Bible. In 1786 Pierre Simon Laplace estimated the population of France by using a sample, along with ratio estimator. He also computed probabilistic estimates of the error. These were not expressed as modern confidence intervals but as the sample size that would be needed to achieve a particular upper bound on the sampling error with probability 1000/1001. His estimates used Bayes' theorem with a uniform prior probability and it assumed his sample was random. The theory of small-sample statistics developed by William Sealy Gossett put the subject on a more rigorous basis in the 20th century. However, the importance of random sampling was not universally appreciated and in the USA the 1936 Literary Digest prediction of a Republican win in the presidential election went badly awry, due to severe bias [Experimental:Sex]].com/public/article/SB115974322285279370-_rk13XDUHmIcnA8DYs5VUscZG94_20071001.html?mod=rss_free]. A sample size of one million was obtained through magazine subscription lists and telephone directories. It was not appreciated that these lists were heavily biased towards Republicans and the resulting sample, though very large, was deeply flawed.

See also

External links

Notes

  1. Mathematical details are displayed in the Sample size article.

References

  • Brown, K.W., Cozby, P.C., Kee, D.W., & Worden, P.E. (1999). Research Methods in Human Development, 2d ed. Mountain View, CA: Mayfield. ISBN 1-55934-875-5
  • Bartlett, J. E., II, Kotrlik, J. W., & Higgins, C. (2001). Organizational research: Determining appropriate sample size for survey research. Information Technology, Learning, and Performance Journal, 19(1) 43-50.
  • Chambers, R L, and Skinner, C J (editors) (2003), Analysis of Survey Data, Wiley, ISBN 0-471-89987-9
  • Cochran, W G (1977) Sampling Techniques, Wiley, ISBN 0-471-16240-X
  • Deming, W E (1975) On probability as a basis for action, The American Statistician, 29(4), pp146-152.
  • Flyvbjerg, B (2006) "Five Misunderstandings About Case Study Research." Qualitative Inquiry, vol. 12, no. 2, April 2006, pp. 219-245. [2]
  • Gy, P (1992) Sampling of Heterogeneous and Dynamic Material Systems: Theories of Heterogeneity, Sampling and Homogenizing
  • Kish, L (1995) Survey Sampling, Wiley, ISBN 0-471-10949-5
  • Korn, E L, and Graubard, B I (1999) Analysis of Health Surveys, Wiley, ISBN 0-471-13773-1
  • Lohr, H (1999) Sampling: Design and Analysis, Duxbury, ISBN 0-534-35361-4
  • Sarndal, Swenson, and Wretman (1992), Model Assisted Survey Sampling, Springer-Verlag, ISBN 0-387-40620-4
  • Stuart, Alan (1962) Basic Ideas of Scientific Sampling, Hafner Publishing Company, New York
  • ASTM E105 Standard Practice for Probability Sampling Of Materials
  • ASTM E122 Standard Practice for Calculating Sample Size to Estimate, With a Specified Tolerable Error, the Average for Characteristic of a Lot or Process
  • ASTM E141 Standard Practice for Acceptance of Evidence Based on the Results of Probability Sampling
  • ASTM E1402 Standard Terminology Relating to Sampling
  • ASTM E1994 Standard Practice for Use of Process Oriented AOQL and LTPD Sampling Plans
  • ASTM E2234 Standard Practice for Sampling a Stream of Product by Attributes Indexedby AQL



This page uses Creative Commons Licensed content from Wikipedia (view authors).