Overview of Computer Intensive Statistical Inference Procedures
P. Adam Kelly
Ph. D. Student
Educational Research - Measurement & Statistics
Florida State University
February, 1999
Resampling procedures, also commonly referred to as computer intensive statistical inference procedures, may be used to assess the significance of a statistic in a hypothesis test or to determine the lower and upper bounds for a confidence interval when the usual assumptions of parametric statistical procedures are not met (Manly, 1991). Computer intensive procedures require the recomputation of hundreds or thousands of artificially constructed data sets. Like other nonparametric statistical procedures, these procedures existed as theory on paper long before they were brought into the practical mainstream. The Monte Carlo method of resampling, for example, was introduced by Barnard in 1963 (Noreen, 1989), but at that time could only be illustrated and implemented operationally on very small sample sizes.
However, with the advent of fast, inexpensive computing, essentially since around 1990, the use of computer intensive procedures has grown dramatically, particularly in the area of basic academic research. Actually, with the widespread availability of powerful personal computers and statistical software that even brings resampling-type methods right into the home, the name computer intensive seems today to be as anachronistic as it was descriptive just a few years ago.
Computer intensive procedures are for probability estimation; that is, they are used to calculate p-values for test statistics or lower and upper bounds for confidence intervals without relying on classical inferential assumptions like normality of the sampling distribution and the Central Limit Theorem (Noreen, 1989). The taxonomy of computer intensive methods is difficult to define, primarily for three reasons: 1) there are many subtly different ways to perform each of the methods, with each way leading to slightly different results and interpretations; 2) each theoretician that has contributed to the literature on computer intensive procedures has a name for his procedure, and sees unique links to the other procedures, and 3) asymptotically, all the procedures are forms of each other and of the permutation test (Noreen, 1989). Manly (1991) and Noreen (1989) concur on a taxonomy that divides computer intensive procedures into two related yet unique streams: randomization methods and Monte Carlo methods.
Randomization Methods
Randomization methods, which use resampling without replacement from a target population of interest, test the null hypothesis that two variables are unrelated. In this context, the term "randomization" refers to the process of random recombination of the data simulating random assignment of the study subjects, not random sampling of study subjects. In fact, randomization methods of statistical inference work well with either random or nonrandom samples (Edgington, 1987). Two randomization methods are presented: the permutation test and the randomization test.
Permutation Test
The exhaustive form of randomization is the permutation test, also known as the "exact randomization test," in which every possible combination of the drawn data is listed (i.e., an "exact" probability distribution is obtained) and the probability of the given data is determined exactly by the percentile position of the given data on the list of all possible combinations. All computer intensive procedures are said to be asymptotically equivalent to the permutation test (Good, 1994). In fact, a permutation test is, by definition, a fully adequate substitute for any other parametric or nonparametric test (Ludbrook & Dudley, 1998). Unfortunately, the sheer number of recombinations of a data set are so numerous that performing a permutation test on all but the smallest of data sets is computationally impractical.
Randomization Test
Since exhaustive listing is at least impractical and at most impossible for large sets of data, the "approximate randomization test," or "randomization test" for short, is widely substituted. It is an asymptotic approximation of the permutation test that works well for virtually any data set as long as a sufficiently large number of all the possible combinations are listed (Noreen, 1989). Manly (1991) lists the steps in performing a randomization test:
The appropriate number of resamplings N to use is a matter of subjective judgment; Manly (1991) suggests 5,000, while Noreen (1989) provides illustrations using 1,000 or less, and Mooney & Duval (1993) advocate using 50-200 for estimating standard errors but at least 1,000 to estimate confidence intervals. The researcher should use personal familiarity with the population of interest to determine whether a smaller number of resamplings properly represents the population, and weigh the decision against the cost of performing additional resamplings (Westfall & Young, 1993). Of course, as computation costs have declined dramatically in recent years, the marginal cost of increasing N has become negligible, making the entire issue somewhat moot today.
Manly (1991) goes on to explain that "typical" or "unusual" would be quantified by the percentile position of D0 on the estimated randomization distribution. For example, if D0 were found to be at the 95th percentile on the estimated randomization distribution, a p-value of .05 would be assigned. [If this were a permutation test, the randomization distribution would be exact, not an estimate, and the p-value would correspond to the exact location of the data on the distribution.] Manly (1991) presents results from randomization tests done by Popham & Manly that demonstrate the asymptotic approximation of the randomization test to both the permutation test and the t-test.
Computation of an associated confidence interval involves two simple steps. First, subtract D0, an estimate of the "true" difference m D, from each observation in the hypothesized higher group. This in theory "equalizes" all observations. Second, construct a new estimated randomization distribution of A - (B – D0) by resampling, say, 5000 times. Third, on this new distribution, identify DL and DU corresponding to the 95th and 5th percentiles, respectively. These are the lower and upper limits, respectively, of a 95% confidence interval on m D. It should be noted that, unlike confidence intervals derived from parametric methods, there is no requirement that a randomization confidence interval be symmetric about the point estimate, although many are.
Randomization methods are considered to be "uniformly most powerful" tests; that is, for a given set of data and significance level and given N repetitions sufficiently large, they provide as much or more statistical power than any other nonparametric or parametric tests (Good, 1994). Manly (1991) explains that this is due to the "exact enumeration" of the extremeness of actual data values occurring that a randomization method provides, as compared to the probability of the rank of extremeness of a data set provided by other nonparametric tests, or an assumed distributional-form estimation of extremeness provided by a parametric test.
Monte Carlo Methods
Monte Carlo methods are most often used in simulation studies on computer-generated data to show how p-values for a statistical test or estimation method functions when no convenient real data exist. Monte Carlo methods are used to make inferences about the population from which a sample has been drawn. The Monte Carlo methods presented here are Monte Carlo estimation, bootstrapping, the jackknife, and Markov Chain Monte Carlo estimation.
It’s important to note that Monte Carlo refers to the type of resampling process used to produce a probability estimate, not the act of generating a data set. That is, it is possible to computer-generate a data set, and numerous recombinations of it, without it being a Monte Carlo procedure. For example, Hambleton, Swaminathan, and Rogers (1991) identify a procedure for detecting differential item functioning in IRT-calibrated test items that uses one or more simulations of test data to generate sets of item and examinee ability parameters for comparison, but this is not truly a Monte Carlo procedure. On the other hand, Yen (1986), in assessing the distributional qualities of Thurstonian absolute scaling, presents a method for generating two simulation samples of data drawn at random from an assumed population. In this case, even though only two samples are drawn at a time, this is, according to Noreen (1989), considered a Monte Carlo method.
Monte Carlo Estimation
The "original" Monte Carlo estimation method, as introduced by Barnard in 1963 (Noreen, 1989), is used to test the hypothesis that the data are a random sample from a specified population. This method requires that a computational model of the specified population be available, so that simulated random samples can be generated for use in computing the test statistic(s) of interest (Manly, 1991). For example, to use Monte Carlo estimation on a sample of high school mathematics test scores, it would first be necessary to create a model for the true population distribution of test scores. As Noreen (1989) points out, "the Monte Carlo estimation method is particularly valuable in situations where the population distribution is known, but the sampling distribution of a statistic has not been analytically derived."
The steps for performing a Monte Carlo procedure, as given by Noreen (1989), are:
The procedure for establishing a confidence interval for the "real" statistic in a Monte Carlo procedure is analogous to that used in a randomization test, as described above. As with randomization confidence intervals, a Monte Carlo confidence interval need not be symmetrical.
Bootstrapping
Bootstrapping is a special case of Monte Carlo estimation (Efron, 1982). Bootstrapping procedures use resampling with replacement from an already-drawn sample. Bootstrapping is used most often to approximate standard errors and associated p-values on estimates of population parameters when the sampling distribution of the target population is either indeterminate or difficult to obtain empirically.
Efron & Tibshirani (1993) provide the generic algorithm for performing a bootstrapping procedure as follows:
An estimate of the bias of the statistic of interest is obtained simply by subtracting the mean bootstrap statistic from the original sample statistic. While this bias estimate may be useful as a descriptive tool for readers (Efron, 1982), a problem arises from "adjusting out" the bias: "the bootstrap bias estimator from a single sample contains an indeterminate amount of random variability along with bias, and this may artificially inflate the mean squared error of the statistic" (Mooney & Duval, 1993).
A shortcoming of bootstrapping is that all methods for estimating bootstrap confidence intervals rely to some degree on either the normal or t-distribution (Efron & Tibshirani, 1993). For N reasonably large, however, this should not pose a problem, even for relatively small sample sizes (Mooney & Duval, 1993), although no cited studies have shown the behavior of such confidence intervals on extremely small, nonnormal data sets. The various procedures for establishing a confidence interval around the "real" statistic all use some form of point estimate plus or minus a normal (or t) variate times the bootstrap standard error.
There are a variety of bootstrapping methods available; the shift method and the normal approximation method are two popular methods (Noreen, 1989). According to Noreen, both of these methods are frequently used "for estimating significance levels based on the bootstrap sampling distribution … the ‘shift’ method assumes that the bootstrap sampling distribution and the null hypothesis sampling distribution have the same shape but different distributions… the ‘normal approximation’ method assumes that the null hypothesis sampling distribution is normal; the bootstrap sampling distribution is used only to estimate the variance of the normal distribution."
In the literature on computer intensive methods, bootstrapping methods are considered somewhat speculative (Mooney & Duval, 1993). Bootstrap-derived probability estimates of population parameters are highly sample-sensitive as well as sample size-sensitive and, in contrast to Monte Carlo methods, "bootstrapping relies on an analogy between the sample and the population from which the sample is drawn." Monte Carlo studies (!) on simulated highly nonnormal distributions have shown bootstrap estimates to be discomfortingly liberal with respect to Type I error (Mooney & Duval, 1993). Currently, bootstrapping is most commonly used to estimate population variances in the absence of conventional parametric estimation assumptions (Noreen, 1989).
The "Jackknife"
First presented by Tukey in 1958, the jackknife is a special case of the bootstrap. The procedure given by Mooney & Duval (1993) for performing a jackknife is:
Since the jackknife estimate of q is second-order unbiased, an estimate of the first-order bias of q hat is simply the difference of q hat and the jackknife estimate, similar to the bias estimate from bootstrapping. Yet while the jackknife is intuitive and relatively easy to perform on a computer, it also has several severe weaknesses. Like bootstrap estimates, jackknife estimates are purely sample-referenced; inference to a population is at best speculative. However, jackknife estimates "fail for markedly nonlinear statistics such as the sample median, unlike bootstrap estimates" (Efron ,1982).
Since bootstrapping is considered a more general statistical procedure and works at least as well as the jackknife in most situations, the jackknife is generally only of historical interest today. An exception to this is in the area of identifying influential cases or strata in a statistical model. Here, "the case subgroup or stratum pseudovalue can be used to determine whether the subgroup or stratum has a greater-than-average effect on the overall parameter estimate than other subgroups" (Mooney & Duval, 1993). This useful by-product of jackknife estimation remains popular.
Markov Chain Monte Carlo Estimation
During the past decade, there have been interesting new applications of the Monte Carlo framework. The most popular of these, judging from the literature, is Markov Chain Monte Carlo estimation, or "MCMC," which combines the traditional Monte Carlo method with Bayesian inference. Essentially, the assumed prior sampling distribution of the population parameter becomes the "population" from which the Monte Carlo procedure draws many random samples. Both the sampling and the "update" of the assumed distribution continue iteratively until the parameter estimates converge. This procedure is said to provide superior information in the case in which the sampling distribution of the true population of interest is unknown but a "starting point," or prior distribution, is reasonable (Draper, 1995).
Applications of MCMC currently popular in the literature include Gibbs sampling and data augmentation. Both of these have been used in the context of hierarchical modeling in the social sciences. Gibbs sampling has been particularly popular in recent years, as interest in modeling multilevel social phenomena has unearthed the problem of estimating level-wise parameters when marginal posterior distributions of those parameters are unknown. This problem is elegantly resolved by use of Gibbs sampling, in which Monte Carlo estimation is performed on conditional posterior distributions of known form (e.g., "known" normals, as in standard scores). Plots of Monte Carlo-estimated values then reveal the contours of the unknown marginal posterior and joint posterior distributions of the parameters of interest, and descriptions and inferences can be presented (Seltzer, 1993). For a complete introduction to data augmentation, I refer the reader to Tanner & Wong (1987); for Gibbs sampling, Gelfand & Smith (1990).
A Selection of Contemporary Examples of Monte Carlo and MCMC Implementation
By far the most common appearance of computer intensive methods recently in the literature is in the form of Monte Carlo and MCMC studies. These studies, in turn, have dealt almost exclusively with a variety of methods for estimating the bias of statistical estimators. I have selected several examples of the following: item parameter estimation in IRT; assessment of meta-analytic procedures; and assessment of multivariate modeling, including hierarchical modeling, factor analysis, covariance analysis, and structural equation modeling.
IRT
Harwell (1997) illustrates how Monte Carlo studies are used in IRT research to describe the properties of new parameter estimation procedures, to compare item analysis programs, and to study the effects of multidimensional data on parameter estimation. A criticism that Harwell makes is that Monte Carlo studies in IRT, although designed to reasonably simulate the complexities of real test data , fail to bring results together in a way that clarifies findings of the studies. Harwell suggests that researchers conceptualize Monte Carlo studies as statistical sampling experiments, complete with experimental designs such as fixed and random effects, assessment of threats to internal and external validity, and clear definition of simulation methods that facilitates replication of a study.
Akkermans (1994) applies MCMC to approximate the conditioning constants for conditional maximum likelihood estimates of both item and examinee parameters in the Rasch model. Using assumed prior distributions of item and examinee parameters, Akkermans generates convergent conditioning means that serve well as substitutes for conditioning constants obtained in the usual way. Akkermans points out that, due to the invariance of the conditioning means generated by MCMC, this procedure should work equally well with any maximum likelihood item response model.
Meta-Analysis
Harwell (1990) provides an overview of many types of Monte Carlo studies done for the purpose of illustrating the performance of certain statistical methods, such as single- and multifactor ANOVA and ANCOVA, multiple regression, and hierarchical models. Harwell reviews methods employed for synthesizing Monte Carlo studies, and criticizes the way Monte Carlo results are conducted without regard for some "overarching theory" to guide interpretation. He presents a five-step strategy for summarizing Monte Carlo results that includes specific problem formulation, data design and collection data evaluation, analysis and interpretation, and presentation of results. In a later article (1992), Harwell embellishes this strategy specifically for one- and two-factor fixed effects ANOVA.
Cornwell & Ladd (1993) use a Monte Carlo procedure to shed light on the efficacy of the Schmidt & Hunter meta-analytic procedures for combining independent correlation coefficients across studies. Similar to Harwell (1990, 1992), Cornwell & Ladd criticize existing Monte Carlo studies on the grounds that the studies (a) often simulate data that fails to take in to account the kinds of anomalies encountered in "real" data, and (b) are not accompanied by sufficient inferential information to allow the reader to make an informed decision about the models being tested. Using their Monte Carlo design and results, Cornwell & Ladd show how previous studies fail to take into account sample size and its effect on statistical power both in the number of correlations being meta-analyzed and in the number of observations used to determine each sample correlation. Prior to this study, this issue had not been addressed explicitly.
Harwell (1997) investigates the sensitivity of the Raudenbush meta-analytic method for variance heterogeneity to both platykurtosis of score distributions and the ratio of study sample size to the number of studies being meta-analyzed. Harwell uses a Monte Carlo design to estimate the effects of platykurtosis on Type I error rates, performing resampling on simulated platykurtic data for a large number N of random trials. He then varies sample size for the most platykurtic simulated samples, and reports the effects on Type I error and statistical power. Harwell claims that the results of the Monte Carlo studies presented are exhaustive enough of all the possibilities that the Raudenbush method should be ruled out from consideration for meta-analysis entirely.
Multivariate Modeling
Draper (1995) provides an extensive coverage of the use of hierarchical models (HM) in social science, particularly in education. He illustrates the relative advantages of using HM in educational settings, stating that "multi-level school analyses seem to have been waiting for HM to come along." Draper criticizes earlier studies that use traditional statistical methods, such as multiple regression and ANOVA, on clearly hierarchically nested data. He points out that, prior to implementing HM in education, researchers need to carefully consider design issues and inference-related interpretations of results so as not to mislead the body of knowledge in the process. Draper recommends the increased use of MCMC methods, and particularly Gibbs sampling, in place of the more common maximum likelihood estimation procedures for student- and school-level effects on HMs.
Seltzer (1993) draws the same conclusion about the use of Gibbs sampling with HMs, providing more detail on what goes wrong when, under the standard assumptions of normality in HM, "fat-tailed" data is simulated. Seltzer concludes that many MCMC studies may already exist that mislead the reader as to the precision of the HM primarily because proper treatment for platykurtic data has not been addressed.
Several other authors present Monte Carlo investigation studies for performance of statistical estimation procedures: Bacon (1995) for performance of correlational outlier identification methods over a variety of data distribution types; Bengt (1994) for testing of a Bayesian process for filling in missing data for covariance structure modeling; Wolins (1995) for comparing the speed and efficiency of maximum likelihood and unweighted least squares estimation procedures in factor analysis, over many samples and with a variety of distributions represented; and Finch, et al. (1997) for examining bias in the estimation of indirect effects and their standard errors, using simulated skewed data, in structural equation models using maximum likelihood estimation.
These studies represent the type of research, almost entirely investigative in nature, that has arisen from the advances in Monte Carlo methods. While highly sophisticated and likely inaccessible beyond the abstract and discussion to those not intimately familiar with the procedures, these studies nevertheless convey the complex dimensionality and depth of inquiry now achievable using these new methods and a good computer.
Overall Assessment of Strengths and Weaknesses
In this section, I provide a summary of (my impressions of) the strengths and weaknesses of the computer intensive methods I have addressed.
Strengths
Weaknesses
Recommendations for Introduction into Graduate Coursework in Applied Statistics
Randomization methods: Definitely candidates for inclusion in a course, but NOT until a suitable computer package is available to make coverage of the methods useful. Hopefully, this should be available very soon; SPSS has been rumored to have a module for such methods now in beta testing.
Monte Carlo estimation: Should be introduced extensively but to advanced students only, preferably in a two-seminar sequence on advanced topics in statistical methods, along with a good general treatment of Bayesian methods.
Bootstrapping and the Jackknife: Could be covered briefly (probably just bootstrapping) in the advanced seminar suggested, with follow-up directed study for those with an interest to pursue it.
MCMC and variations: Still too new, complex, and context-specific for extensive coverage, even in the advanced seminar, although that may change rapidly; could be introduced in the seminar very generally; in the context of a HM seminar, would be more appropriate for a more extensive treatment.
References
Akkermans, W. M. W. (1994). Monte Carlo estimation of the conditional Rasch model. Research Report 94-09, Faculty of Educational Science and Technology, University of Twente, The Netherlands.
Bacon, D. R. (1995). A Maximum likelihood approach to correlational outlier identification. Multivariate Behavioral Research, 30(2), 125-148.
Cornwell, J. M., & Ladd, R. T. (1993). Power and accuracy of the Schmidt and Hunter meta-analytic procedures. Educational and Psychological Measurement, 53(4), 877-895.
Draper, D. (1995). Inference and hierarchical modeling in the social sciences. Journal of Educational and Behavioral Statistics, 20(2), 115-147.
Edgington, E. S. (1987). Randomization tests (2nd Ed.). New York: Marcel Dekker.
Efron, B. (1982). The Jackknife, the Bootstrap and other resampling plans. Philadelphia: Society for Industrial and Applied mathematics.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman & Hall.
Finch, J. F., et al. (1997). Effects of sample size and nonnormality on the estimation of mediated effects in latent variable models. Structural Equation Modeling, 4(2), 87-107.
Gelfand, A. E., & Smith, A. F. M. (1990). Sampling based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398-409.
Good, P. (1994). Permutation tests: A practical guide to resampling methods for testing hypotheses. New York: Springer-Verlag New York.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage Publications.
Harwell, M. R. (1997). An investigation of the Raudenbush (1988) test for studying variance homogeneity. Journal of Experimental Education, 65(2), 181-190.
Harwell, M. R. (1997). Analyzing the results of Monte Carlo studies in item response theory. Educational and Psychological Measurement, 57(2), 266-279.
Harwell, M. R. (1990). Summarizing Monte Carlo results in methodological research. Journal of Educational Statistics, 17(4), 297-313.
Harwell, M. R., Rubinstein, E.N., Hayes, W. S., & Olds, C. C. (1992). Summarizing Monte Carlo results in methodological research: The one- and two-factor fixed effects ANOVA cases. Journal of Educational Statistics, 17(4), 315-339.
Ludbrook, J., & Dudley, H. (1998). Why permutation tests are superior to t and F tests in biomedical research. The American Statistician, 52(2), 127-132.
Manly, B. F. J. (1991). Randomization and Monte Carlo methods in biology. London, U.K.: Chapman & Hall
Mooney, C. Z., & Duval, R. D. (1993). Bootstrapping: A nonparametric approach to statistical inference. Newbury Park, CA: Sage Publications.
Muthen, B. (1994). A simple approach to inference in covariance structure modeling with missing data: Bayesian analysis. Evaluative report; Project 2.4; available from ERIC: Identifier ED379321.
Noreen, E. W. (1989). Computer intensive methods for testing hypotheses: An introduction. New York: John Wiley & Sons.
Seltzer, M. H. (1993). Sensitivity analysis for fixed effects in the hierarchical model: A Gibbs sampling approach. Journal of Educational Statistics, 18(3), 207-235.
Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation (with discussion). Journal of the American Statistical Association, 82, 528-550.
Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple testing. New York: John Wiley & Sons.
Wolins, L. (1995). A Monte Carlo study of constrained factor analysis using maximum likelihood and unweighted least squares. Educational and Psychological Measurement, 55(4), 545-557.
Yen, W.M. (1986). The choice of scale for educational measurement: An IRT perspective. Journal of Educational Measurement, 23(4), 299-325.