A commonly occurring problem in all kinds of studies is that of missing data. These missing values can occur for a number of reasons, including equipment malfunctions and, more typically, subjects recruited to a study not participating fully. In particular, in a longitudinal study, one or more of the repeated measurements on a subject might be missing. The way in which missing values are dealt with depends on the data analyst's experience with statistical techniques. The most common way in which data analysts proceed is to use the complete case analysis method, i.e. removing cases with missing values for any of the variables and running the analysis on the remaining cases. Although this method is very straightforward to implement and is used by the vast majority of data analysts, it can lead to biased results unless data are missing completely at random. Complete Case analysis can dramatically reduce the sample size of the study, as only those cases for which all variables are measured are included in the analysis. Therefore the complete case analysis method is "not generally recommended" (Diggle et al., 2002). Alternative approaches to the complete case analysis method involve filling in (or imputing) values for the incomplete cases, making "more efficient use of the available data" (Schafer, 1997). The purpose of this thesis is to compare and contrast the results obtained from analysing the relationship between growth and feeding behaviour in the first year of life using the complete case analysis and three imputation methods: single hot-decking, multiple hot-decking and the EM algorithm. The data used in this research come from the Gateshead Millennium Study, a prospective study of a cohort of just over 1,000 babies. In practical terms, the purpose of the work is to confirm the conclusions from the published complete-case analysis. It is of more theoretical interest to determine which imputation method is the most appropriate for dealing with missing data in this study. Chapter 1 provides an introduction to the problem of missing data and how they may arise and a description of the Gateshead Millennium Study data, to which all the missing data methods will be applied. It concludes by giving the aims of this thesis. Chapter 2 provides an in depth review of various missing data approaches and indicates which characteristics of the missing data have to be considered in order to determine which of these approaches can be employed to deal with the missing values. Also in Chapter 2, various aspects of the Gateshead Millennium Study data are reviewed. Measures of growth and feeding behaviour in the first year of life are described as these are important variables in the published analysis. Chapter 3 assesses how complete the Gateshead Millennium Study data is by producing a detailed description of each of the questions in each of the questionnaires. This is achieved by examining the Wave Non-response, Section Non-response and Item Non-response for each of the six questionnaires. Chapter 4 recreates the results from the complete case analyses for the relationship between development of growth and feeding in the first year of life which have already been performed and published in the paper - How Does Maternal and Child Feeding Behaviour Relate to Weight Gain and Failure to Thrive? Data From a Prospective Birth Cohort (Wright et al., 2006a). This chapter also gives insight as to whether or not it is appropriate to assume that the missing data mechanism is MCAR and therefore whether or not it is reasonable to believe the results obtained from the complete case analysis. Chapter 5 focusses on the various methods used to impute the missing values in the Gateshead Millennium Study data. This chapter begins by considering the EM Algorithm. It gives details of how the EM Algorithm was performed and the results obtained. In addition to the EM Algorithm, this chapter also considers the procedures and results for Single Imputation and Multiple Imputation by hot-decking. This chapter concludes by comparing the results of these methods to one another and also to the complete case analysis results from Chapter 4. Finally, Chapter 6 provides a summary of the results from the various missing data methods applied and discusses various alternative methods which could also have been performed.
Item Type: | Thesis (MSc(R)) | ||||||
---|---|---|---|---|---|---|---|
Qualification Level: | Masters | ||||||
Additional Information: | The questionnaires in Appendix A of this thesis are the intellectual property of the Gateshead Millennium Study Team. | ||||||
Keywords: | Missing Data, Missing Data Mechanisms, Complete Case Analysis, EM Algorithm, Hot-deck Imputation, Multiple Imputation, Gateshead Millennium Study | ||||||
Subjects: | > | ||||||
Colleges/Schools: | > > | ||||||
Supervisor's Name: | McColl, Prof. John | ||||||
Date of Award: | 2010 | ||||||
Depositing User: | |||||||
Unique ID: | glathesis:2010-2312 | ||||||
Copyright: | Copyright of this thesis is held by the author. | ||||||
Date Deposited: | 05 Jan 2011 | ||||||
Last Modified: | 10 Dec 2012 13:53 | ||||||
URI: |
View Item |
Downloads per month over past year
View more statistics
The University of Glasgow is a registered Scottish charity: Registration Number SC004401
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .
Indiana University-Bloomington, Bloomington, Indiana USA
The impact of missing data on quantitative research can be serious, leading to biased estimates of parameters, loss of information, decreased statistical power, increased standard errors, and weakened generalizability of findings. In this paper, we discussed and demonstrated three principled missing data methods: multiple imputation, full information maximum likelihood, and expectation-maximization algorithm, applied to a real-world data set. Results were contrasted with those obtained from the complete data set and from the listwise deletion method. The relative merits of each method are noted, along with common features they share. The paper concludes with an emphasis on the importance of statistical assumptions, and recommendations for researchers. Quality of research will be enhanced if (a) researchers explicitly acknowledge missing data problems and the conditions under which they occurred, (b) principled methods are employed to handle missing data, and (c) the appropriate treatment of missing data is incorporated into review standards of manuscripts submitted for publication.
Missing data are a rule rather than an exception in quantitative research. Enders ( 2003 ) stated that a missing rate of 15% to 20% was common in educational and psychological studies. Peng et al. ( 2006 ) surveyed quantitative studies published from 1998 to 2004 in 11 education and psychology journals. They found that 36% of studies had no missing data, 48% had missing data, and about 16% cannot be determined. Among studies that showed evidence of missing data, 97% used the listwise deletion (LD) or the pairwise deletion (PD) method to deal with missing data. These two methods are ad hoc and notorious for biased and/or inefficient estimates in most situations ( Rubin 1987 ; Schafer 1997 ). The APA Task Force on Statistical Inference explicitly warned against their use ( Wilkinson and the Task Force on Statistical Inference 1999 p. 598). Newer and principled methods, such as the multiple-imputation (MI) method, the full information maximum likelihood (FIML) method, and the expectation-maximization (EM) method, take into consideration the conditions under which missing data occurred and provide better estimates for parameters than either LD or PD. Principled missing data methods do not replace a missing value directly; they combine available information from the observed data with statistical assumptions in order to estimate the population parameters and/or the missing data mechanism statistically.
A review of the quantitative studies published in Journal of Educational Psychology (JEP) between 2009 and 2010 revealed that, out of 68 articles that met our criteria for quantitative research, 46 (or 67.6%) articles explicitly acknowledged missing data, or were suspected to have some due to discrepancies between sample sizes and degrees of freedom. Eleven (or 16.2%) did not have missing data and the remaining 11 did not provide sufficient information to help us determine if missing data occurred. Of the 46 articles with missing data, 17 (or 37%) did not apply any method to deal with the missing data, 13 (or 28.3%) used LD or PD, 12 (or 26.1%) used FIML, four (or 8.7%) used EM, three (or 6.5%) used MI, and one (or 2.2%) used both the EM and the LD methods. Of the 29 articles that dealt with missing data, only two explained their rationale for using FIML and LD, respectively. One article misinterpreted FIML as an imputation method. Another was suspected to have used either LD or an imputation method to deal with attrition in a PISA data set ( OECD 2009 ; Williams and Williams 2010 ).
Compared with missing data treatments by articles published in JEP between 1998 and 2004 ( Table 3.1 in Peng et al. 2006 ), there has been improvement in the decreased use of LD (from 80.7% down to 21.7%) and PD (from 17.3% down to 6.5%), and an increased use of FIML (from 0% up to 26.1%), EM (from 1.0% up to 8.7%), or MI (from 0% up to 6.5%). Yet several research practices still prevailed from a decade ago, namely, not explicitly acknowledging the presence of missing data, not describing the particular approach used in dealing with missing data, and not testing assumptions associated with missing data methods. These findings suggest that researchers in educational psychology have not fully embraced principled missing data methods in research.
Although treating missing data is usually not the focus of a substantive study, failing to do so properly causes serious problems. First, missing data can introduce potential bias in parameter estimation and weaken the generalizability of the results ( Rubin 1987 ; Schafer 1997 ). Second, ignoring cases with missing data leads to the loss of information which in turn decreases statistical power and increases standard errors( Peng et al. 2006 ). Finally, most statistical procedures are designed for complete data ( Schafer and Graham 2002 ). Before a data set with missing values can be analyzed by these statistical procedures, it needs to be edited in some way into a “complete” data set. Failing to edit the data properly can make the data unsuitable for a statistical procedure and the statistical analyses vulnerable to violations of assumptions.
Because of the prevalence of the missing data problem and the threats it poses to statistical inferences, this paper is interested in promoting three principled methods, namely, MI, FIML, and EM, by illustrating these methods with an empirical data set and discussing issues surrounding their applications. Each method is demonstrated using SAS 9.3. Results are contrasted with those obtained from the complete data set and the LD method. The relative merits of each method are noted, along with common features they share. The paper concludes with an emphasis on assumptions associated with these principled methods and recommendations for researchers. The remainder of this paper is divided into the following sections: (1) Terminology, (2) Multiple Imputation (MI), (3) Full Information Maximum-Likelihood (FIML), (4) Expectation-Maximization (EM) Algorithm, (5) Demonstration, (6) Results, and (6) Discussion.
Missing data occur at two levels: at the unit level or at the item level. A unit-level non-response occurs when no information is collected from a respondent. For example, a respondent may refuse to take a survey, or does not show up for the survey. While the unit non-response is an important and common problem to tackle, it is not the focus of this paper. This paper focuses on the problem of item non-response . An item non-response refers to the incomplete information collected from a respondent. For example, a respondent may miss one or two questions on a survey, but answered the rest. The missing data problem at the item level needs to be tackled from three aspects: the proportion of missing data, the missing data mechanisms, and patterns of missing data. A researcher must address all three before choosing an appropriate procedure to deal with missing data. Each is discussed below.
The proportion of missing data is directly related to the quality of statistical inferences. Yet, there is no established cutoff from the literature regarding an acceptable percentage of missing data in a data set for valid statistical inferences. For example, Schafer ( 1999 ) asserted that a missing rate of 5% or less is inconsequential. Bennett ( 2001 ) maintained that statistical analysis is likely to be biased when more than 10% of data are missing. Furthermore, the amount of missing data is not the sole criterion by which a researcher assesses the missing data problem. Tabachnick and Fidell ( 2012 ) posited that the missing data mechanisms and the missing data patterns have greater impact on research results than does the proportion of missing data.
According to Rubin ( 1976 ), there are three mechanisms under which missing data can occur: missing at random (MAR), missing completely at random (MCAR), and missing not at random (MNAR). To understand missing data mechanisms, we partition the data matrix Y into two parts: the observed part ( Y obs ) and the missing part ( Y mis ). Hence, Y = ( Y obs , Y mis ). Rubin ( 1976 ) defined MAR to be a condition in which the probability that data are missing depends only on the observed Y obs , but not on the missing Y mis , after controlling for Y obs . For example, suppose a researcher measures college students’ understanding of calculus in the beginning (pre-test) and at the end (post-test) of a calculus course. Let’s suppose that students who scored low on the pre-test are more likely to drop out of the course, hence, their scores on the post-test are missing. If we assume that the probability of missing the post-test depends only on scores on the pre-test, then the missing mechanism on the post-test is MAR. In other words, for students who have the same pre-test score, the probability of their missing the post-test is random. To state the definition of MAR formally, let R be a matrix of missingness with the same dimension as Y . The element of R is either 1 or 0, corresponding to Y being observed (coded as 1) or missing (coded as 0). If the distribution of R , written as P ( R | Y , ξ ), where ξ = missingness parameter, can be modeled as Equation 1 , then the missing condition is said to be MAR ( Schafer 1997 p. 11):
In other words, the probability of missingness depends on only the observed data and ξ. Furthermore, if (a) the missing data mechanism is MAR and (b) the parameter of the data model ( θ ) and the missingness parameter ξ are independent, the missing data mechanism is said to be ignorable ( Little and Rubin 2002 ). Since condition (b) is almost always true in real world settings, ignorability and MAR (together with MCAR) are sometimes viewed as equivalent ( Allison 2001 ).
Although many modern missing data methods (e.g., MI, FIML, EM) assume MAR, violation of this assumption should be expected in most cases ( Schafer and Graham 2002 ). Fortunately, research has shown that violation of the MAR assumption does not seriously distort parameter estimates ( Collins et al. 2001 ). Moreover, MAR is quite plausible when data are missing by design. Examples of missing by design include the use of multiple booklets in large scale assessment, longitudinal studies that measure a subsample at each time point, and latent variable analysis in which the latent variable is missing with a probability of 1, therefore, the missing probability is independent of all other variables.
MCAR is a special case of MAR. It is a missing data condition in which the likelihood of missingness depends neither on the observed data Y obs , nor on the missing data Y mis . Under this condition, the distribution of R is modeled as follows:
If missing data meet the MCAR assumption, they can be viewed as a random sample of the complete data. Consequently, ignoring missing data under MCAR will not introduce bias, but will increase the SE of the sample estimates due to the reduced sample size. Thus, MCAR poses less threat to statistical inferences than MAR or MNAR.
The third missing data mechanism is MNAR. It occurs when the probability of missing depends on the missing value itself. For example, missing data on the income variable is likely to be MNAR, if high income earners are more inclined to withhold this information than average- or low-income earners. In case of MNAR, the missing mechanism must be specified by the researcher, and incorporated into data analysis in order to produce unbiased parameter estimates. This is a formidable task not required by MAR or MCAR.
The three missing data methods discussed in this paper are applicable under either the MCAR or the MAR condition, but not under MNAR. It is worth noting that including variables in the statistical inferential process that could explain missingness makes the MAR condition more plausible. Return to the college students’ achievement in a calculus course for example. If the researcher did not collect students’ achievement data on the pre-test, the missingness on the post-test is not MAR, because the missingness depends on the unobserved score on the post-test alone. Thus, the literature on missing data methods often suggests including additional variables into a statistical model in order to make the missing data mechanism ignorable ( Collins et al. 2001 ; Graham 2003 ; Rubin 1996 ).
The tenability of MCAR can be examined using Little’s multivariate test ( Little and Schenker 1995 ). However, it is impossible to test whether the MAR condition holds, given only the observed data ( Carpenter and Goldstein 2004 ; Horton and Kleinman 2007 ; White et al. 2011 ). One can instead examine the plausibility of MAR by a simple t -test of mean differences between the group with complete data and that with missing data ( Diggle et al. 1995 ; Tabachnick and Fidell 2012 ). Both approaches are illustrated with a data set at http://public.dhe.ibm.com/software/analytics/spss/documentation/statistics/20.0/en/client/Manuals/IBM_SPSS_Missing_Values.pdf . Yet, Schafer and Graham ( 2002 ) criticized the practice of dummy coding missing values, because such a practice redefines the parameters of the population. Readers should therefore be cautioned that the results of these tests should not be interpreted as providing definitive evidence of either MCAR or MAR.
There are three patterns of missing data: univariate, monotone, and arbitrary; each is discussed below. Suppose there are p variables, denoted as, Y 2 , Y 2 , …, Y p . A data set is said to have a univariate pattern of missing if the same participants have missing data on one or more of the p variables. A dataset is said to have a monotone missing data pattern if the variables can be arranged in such a way that, when Y j is missing, Y j + 1 , Y j + 2 , …, Y p are missing as well. The monotone missing data pattern occurs frequently in longitudinal studies where, if a participant drops out at one point, his/her data are missing on subsequent measures. For the treatment of missing data, the monotone missing data pattern subsumes the univariate missing data pattern. If missing data occur in any variable for any participant in a random fashion, the data set is said to have an arbitrary missing data pattern. Computationally, the univariate or the monotone missing data pattern is easier to handle than an arbitrary pattern.
MI is a principled missing data method that provides valid statistical inferences under the MAR condition ( Little and Rubin 2002 ). MI was proposed to impute missing data while acknowledging the uncertainty associated with the imputed values ( Little and Rubin 2002 ). Specifically, MI acknowledges the uncertainty by generating a set of m plausible values for each unobserved data point, resulting in m complete data sets, each with one unique estimate of the missing values. The m complete data sets are then analyzed individually using standard statistical procedures, resulting in m slightly different estimates for each parameter. At the final stage of MI, m estimates are pooled together to yield a single estimate of the parameter and its corresponding SE . The pooled SE of the parameter estimate incorporates the uncertainty due to the missing data treatment (the between imputation uncertainty) into the uncertainty inherent in any estimation method (the within imputation uncertainty). Consequently, the pooled SE is larger than the SE derived from a single imputation method (e.g., mean substitution) that does not consider the between imputation uncertainty. Thus, MI minimizes the bias in the SE of a parameter estimate derived from a single imputation method.
In sum, MI handles missing data in three steps: (1) imputes missing data m times to produce m complete data sets; (2) analyzes each data set using a standard statistical procedure; and (3) combines the m results into one using formulae from Rubin ( 1987 ) or Schafer ( 1997 ). Below we discuss each step in greater details and demonstrate MI with a real data set in the section Demonstration .
The imputation step in MI is the most complicated step among the three steps. The aim of the imputation step is to fill in missing values multiple times using the information contained in the observed data. Many imputation methods are available to serve this purpose. The preferred method is the one that matches the missing data pattern. Given a univariate or monotone missing data pattern, one can impute missing values using the regression method ( Rubin 1987 ), or the predictive mean matching method if the missing variable is continuous ( Heitjan and Little 1991 ; Schenker and Taylor 1996 ). When data are missing arbitrarily, one can use the Markov Chain Monte Carlo (MCMC) method ( Schafer 1997 ), or the fully conditional specification (also referred to as chained equations) if the missing variable is categorical or non-normal ( Raghunathan et al. 2001 ; van Buuren 2007 ; van Buuren et al. 1999 ; van Buuren et al. 2006 ). The regression method and the MCMC method are described next.
Suppose that there are p variables, Y 1 , Y 2 , …, Y p in a data set and missing data are uniformly or monotonically present from Y j to Y p , where 1 < j ≤ p . To impute the missing values for the j th variable, one first constructs a regression model using observed data on Y 1 through Y j - 1 to predict the missing values on Y j :
When the missing data pattern is arbitrary, it is difficult to develop analytical formulae for the missing data. One has to turn to numerical simulation methods, such as MCMC ( Schafer 1997 ) in this case. The MCMC technique used by the MI procedure of SAS is described below [interested readers should refer to SAS/STAT 9.3 User’s Guide ( SAS Institute Inc 2011 ) for a detailed explanation].
The second step of MI analyzes the m sets of data separately using a statistical procedure of a researcher’s choice. At the end of the second step, m sets of parameter estimates are obtained from separate analyses of m data sets.
Mi related issues.
When implementing MI, the researcher needs to be aware of several practical issues, such as, the multivariate normality assumption, the imputation model, the number of imputations, and the convergence of MCMC. Each is discussed below.
The regression and MCMC methods implemented in statistical packages (e.g., SAS) assume multivariate normality for variables. It has been shown that MI based on the multivariate normal model can provide valid estimates even when this assumption is violated ( Demirtas et al. 2008 ; Schafer 1997 , 1999 ). Furthermore, this assumption is robust when the sample size is large and when the missing rate is low, although the definition for a large sample size or for a low rate of missing is not specified in the literature ( Schafer 1997 ).
When an imputation model contains categorical variables, one cannot use the regression method or MCMC directly. Techniques such as, logistic regression and discriminant function analysis, can substitute for the regression method, if the missing data pattern is monotonic or univariate. If the missing data pattern is arbitrary, MCMC based on other probability models (such as the joint distribution of normal and binary) can be used for imputation. The free MI software NORM developed by Schafer ( 1997 ) has two add-on modules—CAT and MIX—that deal with categorical data. Specifically, CAT imputes missing data for categorical variables, and MIX imputes missing data for a combination of categorical and continuous variables. Other software packages are also available for imputing missing values in categorical variables, such as the ICE module in Stata ( Royston 2004 , 2005 , 2007 ; Royston and White 2011 ), the mice package in R and S-Plus ( van Buuren and Groothuis-Oudshoorn 2011 ), and the IVEware ( Raghunathan et al. 2001 (Yucel ). Interested readers are referred to a special volume of the Journal of Statistical Software 2011 ) for recent developments in MI software.
When researchers use statistical packages that impose a multivariate normal distribution assumption on categorical variables, a common practice is to impute missing values based on the multivariate normal model, then round the imputed value to the nearest integer or to the nearest plausible value. However, studies have shown that this naïve way of rounding would not provide desirable results for binary missing values ( Ake 2005 ; Allison 2005 ; Enders 2010 ). For example, Horton et al. ( 2003 ) showed analytically that rounding the imputed values led to biased estimates, whereas imputed values without rounding led to unbiased results. Bernaards et al. ( 2007 ) compared three approaches to rounding in binary missing values: (1) rounding the imputed value to the nearest plausible value, (2) randomly drawing from a Bernoulli trial using the imputed value, between 0 and 1, as the probability in the Bernoulli trial, and (3) using an adaptive rounding rule based on the normal approximation to the binomial distribution. Their results showed that the second method was the worst in estimating odds ratio, and the third method provided the best results. One merit of their study is that it is based on a real-world data set. However, other factors may influence the performance of the rounding strategies, such as the missing mechanism, the size of the model, distributions of the categorical variables. These factors are not within a researcher’s control. Additional research is needed to identify one or more good strategy in dealing with categorical variables in MI, when a multivariate normal-based software is used to perform MI.
Unfortunately, even less is known about the effect of rounding in MI, when imputing ordinal variables with three or more levels. It is possible that as the level of the categorical variable increases, the effect of rounding decreases. Again, studies are needed to further explore this issue.
MI requires two models: the imputation model used in step 1 and the analysis model used in step 2. Theoretically, MI assumes that the two models are the same. In practice, they can be different ( Schafer 1997 ). An appropriate imputation model is the key to the effectiveness of MI; it should have the following two properties.
First, an imputation model should include useful variables. Rubin ( 1996 ) recommends a liberal approach when deciding if a variable should be included in the imputation model. Schafer ( 1997 ) and van Buuren et al. ( 1999 ) recommended three kinds of variables to be included in an imputation model: (1) variables that are of theoretical interest, (2) variables that are associated with the missing mechanism, and (3) variables that are correlated with the variables with missing data. The latter two kinds of variables are sometimes referred to as auxiliary variables ( Collins et al. 2001 ). The first kind of variables is necessary, because omitting them will downward bias the relation between these variables and other variables in the imputation model. The second kind of variables makes the MAR assumption more plausible, because they account for the missing mechanism. The third kind of variables helps to estimate missing values more precisely. Thus, each kind of variables has a unique contribution to the MI procedure. However, including too many variables in an imputation model may inflate the variance of estimates, or lead to non-convergence. Thus, researchers should carefully select variables to be included into an imputation model. van Buuren et al. ( 1999 ) recommended not including auxiliary variables that have too many missing data. Enders ( 2010 ) suggested selecting auxiliary variables that have absolute correlations greater than .4 with variables with missing data.
Second, an imputation model should be general enough to capture the assumed structure of the data. If an imputation model is more restrictive, namely, making additional restrictions than an analysis model, one of two consequences may follow. One consequence is that the results are valid but the conclusions may be conservative (i.e., failing to reject the false null hypothesis), if the additional restrictions are true ( Schafer 1999 ). Another consequence is that the results are invalid because one or more of the restrictions is false ( Schafer 1999 ). For example, a restriction may restrict the relationship between a variable and other variables in the imputation model to be merely pairwise. Therefore, any interaction effect that involves at least three variables will be biased toward zero. To handle interactions properly in MI, Enders ( 2010 ) suggested that the imputation model include the product of the two variables if both are continuous. For categorical variables, Enders suggested performing MI separately for each subgroup defined by the combination of the levels of the categorical variables.
However, methodologists have not agreed on the optimal number of imputations. Schafer and Olsen ( 1998 ) suggested that “in many applications, just 3–5 imputations are sufficient to obtain excellent results” (p. 548). Schafer and Graham ( 2002) were more conservative in asserting that 20 imputations are enough in many practical applications to remove noises from estimations. Graham et al. ( 2007 ) commented that RE should not be an important criterion when specifying m , because RE has little practical meaning. Other factors, such as, the SE , p -value, and statistical power, are more related to empirical research and should also be considered, in addition to RE. Graham et al. ( 2007 ) reported that statistical power decreased much faster than RE, as λ increases and/or m decreases. In an extreme case in which λ=.9 and m = 3, the power for MI was only .39, while the power of an equivalent FIML analysis was 0.78. Based on these results, Graham et al. ( 2007 ) provided a table for the number of imputations needed, given λ and an acceptable power falloff, such as 1%. They defined the power falloff as the percentage decrease in power, compared to an equivalent FIML analysis, or compared to m = 100. For example, to ensure a power falloff less than 1%, they recommended m = 20, 40, 100, or > 100 for a true λ =.1, .5, .7, or .9 respectively. Their recommended m is much larger than what is derived from the Rubin rule based on RE ( Rubin 1987 ). Unfortunately, Graham et al.’s study is limited to testing a small standardized regression coefficient (β = 0.0969) in a simple regression analysis. The power falloff of MI may be less severe when the true β is larger than 0.0969. At the present, the literature does not shed light on the performance of MI when the regression model is more complex than a simple regression model.
Recently, White et al. ( 2011 ) argued that in addition to relative efficiency and power, researchers should also consider Monte Carlo errors when specifying the optimal number of imputations. Monte Carlo error is defined as the standard deviation of the estimates (e.g. regression coefficients, test statistic, p -value) “across repeated runs of the same imputation procedure with the same data” ( White et al. 2011 p. 387). Monte Carlo error converges to zero as m increases. A small Monte Carlo error implies that results from a particular run of MI could be reproduced in the subsequent repetition of the MI analysis. White et al. also suggested that the number of imputations should be greater than or equal to the percentage of missing observations in order to ensure an adequate level of reproducibility. For studies that compare different statistical methods, the number of imputations should be even larger than the percentage of missing observations, usually between 100 and 1000, in order to control the Monte Carlo error ( Royston and White 2011 ).
It is clear from the above discussions that a simple recommendation for the number of imputations (e.g., m = 5) is inadequate. For data sets with a large amount of missing information, more than five imputations are necessary in order to maintain the power level and control the Monte Carlo error. A larger imputation model may require more imputations, compared to a smaller or simpler model. This is so because a large imputation model results in increased SE s, compared to a smaller or simpler model. Therefore, for a large model, additional imputations are needed to offset the increased SE s. Specific guidelines for choosing m await empirical research. In general, it is a good practice to specify a sufficient m to ensure the convergence of MI within a reasonable computation time.
The convergence of the Markov Chain is one of the determinants of the validity of the results obtained from MI. If the Markov Chain does not converge, the imputed values are not considered random samples from the posterior distribution of the missing data, given the observed data, i.e., P ( Y mis | Y obs ). Consequently, statistical results based on these imputed values are invalid. Unfortunately, the importance of assessing the convergence was rarely mentioned in articles that reviewed the theory and application of MCMC ( Schafer 1999 ; Schafer and Graham 2002 ; Schlomer et al. 2010 ; Sinharay et al. 2001 ). Because the convergence is defined in terms of both probability and procedures, it is complex and difficult to determine the convergence of MCMC ( Enders 2010 ). One way to roughly assess convergence is to visually examine the trace plot and the autocorrelation function plot; both are provided by SAS PROC MI ( SAS Institute Inc 2011 ). For a parameter θ , a trace plot is a plot of the number of iterations ( t ) against the value of θ ( t ) on the vertical axis. If the MCMC converges, there is no indication of a systematic trend in the trace plot. The autocorrelation plot displays the autocorrelations between θ ( t ) s at lag k on the vertical axis against k on the horizontal axis. Ideally, the autocorrelation at any lag should not be statistically significantly different from zero. Since the convergence of a Markov Chain may be at different rates for different parameters, one needs to examine these two plots for each parameter. When there are many parameters, one can choose to examine the worst linear function ( or WLF, Schafer 1997 ). The WLF is a constructed statistic that converges more slowly than all other parameters in the MCMC method. Thus if the WLF converges, all parameters should have converged (see pp. 2–3 of the Appendix for an illustration of both plots for WLF, accessible from https://oncourse.iu.edu/access/content/user/peng/Appendix.Dong%2BPeng.Principled%20missing%20methods.current.pdf ). Another way to assess the convergence of MCMC is to start the chain multiple times, each with a different initial value. If all the chains yield similar results, one can be confident that the algorithm has converged.
FIML is a model-based missing data method that is used frequently in structural equating modeling (SEM). In our review of the literature, 26.1% studies that had missing data used FIML to deal with missing data. Unlike MI, FIML does not impute any missing data. It estimates parameters directly using all the information that is already contained in the incomplete data set. The FIML approach was outlined by Hartley and Hocking ( 1971 ). As the name suggests, FIML obtains parameter estimates by maximizing the likelihood function of the incomplete data. Under the assumption of multivariate normality, the log likelihood function of each observation i is:
where x i is the vector of observed values for case i , K i is a constant that is determined by the number of observed variables for case i , and μ and Σ are, respectively, the mean vector and the covariance matrix that are to be estimated ( Enders 2001 ). For example, if there are three variables ( X 1 , X 2 , and X 3 ) in the model. Suppose for case i , X 1 = 10 and X 2 = 5, while X 3 is missing. Then in the likelihood function for case i is:
The total sample log likelihood is the sum of the individual log likelihood across n cases. The standard ML algorithm is used to obtain the estimates of μ and Σ, and the corresponding SE s by maximizing the total sample log likelihood function.
As with MI, FIML also assumes MAR and multivariate normality for the joint distribution of all the variables. When the two assumptions are met, FIML is demonstrated to produce unbiased estimates ( Enders and Bandalos 2001 ) and valid model fit information ( Enders 2001 ). Furthermore, FIML is generally more efficient than other ad hoc missing data methods, such as LD ( Enders 2001 ). When the normality assumption was violated, Enders ( 2001 ) reported that (1) FIML provided unbiased estimates across different missing rates, sample sizes, and distribution shapes, as long as the missing mechanism was MCAR or MAR, but (2) FIML resulted in negatively biased SE estimates and an inflated model rejection rate (namely, rejecting fitted models too frequently). Thus, Enders recommended using correction methods, such as rescaled statistics and bootstrap, to correct the bias associated with nonnormality.
Because FIML assumes MAR, adding auxiliary variables to a fitted model is beneficial to data analysis in terms of bias and efficiency ( Graham 2003 ; Section titled The Imputation Model). Collins et al. ( 2001 ) showed that auxiliary variables are especially helpful when (1) missing rate is high (i.e., > 50%), and/or (2) the auxiliary variable is at least moderately correlated (i.e., Pearson’s r > .4) with either the variable containing missing data or the variable causing missingness. However, incorporating auxiliary variables into FIML is not as straightforward as it is with MI. Graham ( 2003 ) proposed the saturated correlates model to incorporate auxiliary variables into a substantive SEM model, without affecting the parameter estimates of the SEM model or its model fit index. Specifically, Graham suggested that, after the substantive SEM model is constructed, the auxiliary variables be added into the model according to the following rules: (a) all auxiliary variables are specified to be correlated with all exogenous manifest variables in the model; (b) all auxiliary variables are specified to be correlated with the residuals for all the manifest variables that are predicted; and (c) all auxiliary variables are specified to be correlated to each other. Afterwards, the saturated correlates model can be fitted to data by FIML to increase efficiency and decrease bias.
The EM algorithm is another maximum-likelihood based missing data method. As with FIML, the EM algorithm does not “fill in” missing data, but rather estimates the parameters directly by maximizing the complete data log likelihood function. It does so by iterating between the E step and the M step ( Dempster et al. 1977 ).
The E (expectation) step calculates the expectation of the log-likelihood function of the parameters, given data. Assuming a data set ( Y ) is partitioned into two parts: the observed part and the missing part, namely, Y = ( Y obs , Y mis ). The distribution of Y depending on the unknown parameter θ can be therefore written as:
Equation 13 can be written as a likelihood function as Equation 14 :
where c is a constant relating to the missing data mechanism that can be ignored under the MAR assumption and the independence between model parameters and the missing mechanism parameters ( Schafer 1997 p. 12). Taking the log of both sides of Equation 14 yields the following:
where l ( θ | Y ) = log P ( Y | θ ) is the complete-data log likelihood, l ( θ | Y obs ) is the observed-data log likelihood, log c is a constant, and P ( Y mis | Y obs , θ (Schafer ) is the predictive distribution of the missing data, given θ 1997 ). Since log c does not affect the estimation of θ , this term can be dropped in subsequent calculations.
Because Y mis is unknown, the complete-data log likelihood cannot be determined directly. However, if there is a temporary or initial guess of θ (denoted as θ ( t ) ), it is possible to compute the expectation of l ( θ | Y ) with respect to the assumed distribution of the missing data P ( Y mis | Y obs , θ ( t ) ) as Equation 16 :
It is at the E step of the EM algorithm that Q ( θ | θ ( t ) ) is calculated.
At the M (Maximization) step, the next guess of θ is obtained by maximizing the expectation of the complete data log likelihood from the previous E step:
The EM algorithm is initialized with an arbitrary guess of θ 0 , usually estimates based solely on the observed data. It proceeds by alternating between the E step and M step. It is terminated when successive estimates of θ are nearly identical. The θ ( t +1) that maximizes Q ( θ | θ ( t ) ) is guaranteed to yield an observed data log likelihood that is greater than or equal to that provided by θ ( t ) ( Dempster et al. 1977 ).
However, the EM algorithm also has several disadvantages. First, the EM algorithm does not compute the derivatives of the log likelihood function. Consequently, it does not provide estimates of SE s. Although extensions of EM have been proposed to allow for the estimation of SE s, these extensions are computationally complex. Thus, EM is not a choice of the missing data method when statistical tests or confidence intervals of estimated parameters are the primary goals of research. Second, the rate of convergence can be painfully slow, when the percent of missing information is large ( Little and Rubin 2002 ). Third, many statistical programs assume the multivariate normal distribution when constructing l ( θ | Y ). Violation of this multivariate normality assumption may cause convergence problems for EM, and also for other ML-based methods, such as FIML. For example, if the likelihood function has more than one mode, the mode to which EM will converge depends on the starting value of the iteration. Schafer ( 1997 ) cautions that multiple modes do occur in real data sets, especially when “the data are sparse and/or the missingness pattern is unusually pernicious.” (p. 52). One way to check if the EM provides valid results is to initialize the EM algorithm with different starting values, and check if the results are similar. Finally, EM is model specific. Each proposed data model requires a unique likelihood function. In sum, if used flexibly and with df , EM is powerful and can provide smaller SE estimates than MI. Schafer and Graham ( 2002 ) compiled a list of packages that offered the EM algorithm. To the best of our knowledge, the list has not been updated in the literature.
In this section, we demonstrate the three principled missing data methods by applying them to a real-world data set. The data set is complete and described under Data Set . A research question posted to this data set and an appropriate analysis strategy are described next under Statistical Modeling . From the complete data set, two missing data conditions were created under the MAR assumption at three missing data rates. These missing data conditions are described under Generating Missing Data Conditions . For each missing data condition, LD, MI, FIML, and EM were applied to answer the research question. The application of these four methods is described under Data Analysis . Results obtained from these methods were contrasted with those obtained from the complete data set. The results are discussed in the next section titled Results .
Self-reported health data by 432 adolescents were collected in the fall of 1988 from two junior high schools (Grades 7 through 9) in the Chicago area. Of the 432 participants, 83.4% were Whites and the remaining Blacks or others, with a mean age of 13.9 years and nearly even numbers of girls ( n = 208) and boys ( n = 224). Parents were notified by mail that the survey was to be conducted. Both the parents and the students were assured of their rights to optional participation and confidentiality of students’ responses. Written parental consent was waived with the approval of the school administration and the university Institutional Review Board ( Ingersoll et al. 1993 ). The adolescents reported their health behavior, using the Health Behavior Questionnaire (HBQ) ( Ingersoll and Orr 1989 ; Peng et al. 2006 ; Resnick et al. 1993 ), self-esteem, using Rosenberg’s inventory ( Rosenberg 1989 ), gender, race, intention to drop out of school, and family structure. The HBQ asked adolescents to indicate whether they engaged in specific risky health behaviors (Behavioral Risk Scale) or had experienced selected emotions (Emotional Risk Scale). The response scale ranged from 1 ( never ) to 4 ( about once a week ) for both scales. Examples of behavioral risk items were “I use alcohol (beer, wine, booze),” “I use pot,” and “I have had sexual intercourse/gone all the way.” These items measured frequency of adolescents’ alcohol and drug use, sexual activity, and delinquent behavior. Examples of emotional risk items were “I have attempted suicide,” and “I have felt depressed.” Emotional risk items measured adolescents’ quality of relationship with others, and management of emotions. Cronbach’s alpha reliability ( Nunnally 1978 ) was .84 for the Behavioral Risk Scale and .81 for the Emotional Risk Scale ( Peng and Nichols 2003 ). Adolescents’ self-esteem was assessed using Rosenberg’s self-esteem inventory ( Rosenberg 1989 ). Self-esteem scores ranged from 9.79 to 73.87 with a mean of 50.29 and SD of 10.04. Furthermore, among the 432 adolescents, 12.27% ( n = 53) indicated an intention to drop out of school; 67.4% ( n = 291) were from families with two parents, including those with one step-parent, and 32.63% ( n = 141) were from families headed by a single parent. The data set is hereafter referred to as the Adolescent data and is available from https://oncourse.iu.edu/access/content/user/peng/logregdata_peng_.sav as an SPSS data file.
For the Adolescent data, we were interested in predicting adolescents’ behavioral risk from their gender, intention to drop out from school, family structure, and self-esteem scores. Given this objective, a linear regression model was fit to the data using adolescents’ score on the Behavioral Risk Scale of the HBQ as the dependent variable (BEHRISK) and gender (GENDER), intention to drop out of school (DROPOUT), type of family structure (FAMSTR), and self-esteem score (ESTEEM) as predictors or covariates. The emotional risk (EMORISK) was used subsequently as an auxiliary variable to illustrate the missing data methods. Hence, it was not included in the regression model. For the linear regression model, gender was coded as 1 for girls and 0 for boys, DROPOUT was coded as 1 for yes and 0 for no, and FAMSTR was coded as 1 for single-parent families and 0 for intact or step families. BEHRISK and ESTEEM were coded using participant’s scores on these two scales. Because the distribution of BEHRISK was highly skewed, a natural log transformation was applied to BEHRISK to reduce its skewness from 2.248 to 1.563. The natural-log transformed BEHRISK (or LBEHRISK) and ESTEEM were standardized before being included in the regression model to facilitate the discussion of the impact of different missing data methods. Thus, the regression model fitted to the Adolescent data was:
The regression coefficients obtained from SAS 9.3 using the complete data were:
According to the results, when all other covariates were held as a constant, boys, adolescents with intention to drop out of school, those with low self-esteem scores, or adolescents from single-parent families, were more likely to engage in risky behaviors.
The missing data on LBEHRISK and ESTEEM were created under the MAR mechanism. Specifically, the probability of missing data on LBEHRISK was made to depend on EMORISK. And the probability of missing data on ESTEEM depended on FAMSTR. Peugh and Enders ( 2004 ) reviewed missing data reported in 23 applied research journals, and found that “the proportion of missing cases per analysis ranged from less than 1% to approximately 67%” (p. 539). Peng, et al. ( 2006 ) reported missing rates ranging from 26% to 72% based on 1,666 studies published in 11 education and psychology journals. We thus designed our study to correspond to the wide spread of missing rates encountered by applied researchers. Specifically, we manipulated the overall missing rate at three levels: 20%, 40%, or 60% (see Table 1 ).We did not include lower missing rates such as, 10% or 5%, because we expected missing data methods to perform similarly and better at low missing rates than at high missing rates. Altogether we generated three missing data conditions using SPSS 20 (see the Appendix for SPSS syntax for generating missing data). Due to the difficulty in manipulating missing data in the outcome variable and the covariates, the actual overall missing rates could not be controlled exactly at 20% or 60%. They did closely approximate these pre-specified rates (see the description below).
Probability of missing for LBEHRISK and ESTEEM at three missing rates
Overall missing rate | Missing variable | FAMSTR | Missing variable | EMORISK | |||
---|---|---|---|---|---|---|---|
Single family | Intact/step family | ≤ | Between & | ≥ | |||
20% | ESTEEM | .20 | .02 | LBEHRISK | .00 | .10 | .30 |
40% | ESTEEM | .40 | .05 | LBEHRISK | .10 | .20 | .60 |
60% | ESTEEM | .80 | .10 | LBEHRISK | .20 | .40 | .80 |
Note . Q1 = first quartile, Q3 = third quartile.
According to Table 1 , at the 20% overall missing rate, participants from a single-parent family had a probability of .20 of missing ESTEEM, while participants from a two-parent family (including the intact families and families with one step- and one biological parent) had a probability of .02 of missing scores on ESTEEM. As the overall missing rate increased from 20% to 40% or 60%, the probability of missing on ESTEEM likewise increased. Furthermore, the probability of missing in LBEHRISK was conditioned on the value of EMORISK. Specifically, at the 20% overall missing rate, if EMORISK was at or below the first quartile, the probability of LBEHRISK missing was .00 (Table 1 ). If EMORISK was between the first and the third quartiles, the probability of LBEHRISK missing was .10 and an EMORISK at or above the third quartile resulted in LBEHRISK missing with a probability of .30. When the overall missing rate increased to 40% or 60%, the probabilities of missing LBEHRISK increased accordingly.
After generating three data sets with different overall missing rates, the regression model in Equation 18 was fitted to each data set using four methods (i.e., LD, MI, FIML, and EM) to deal with missing data. Since missing on LBEHRISK depended on EMORISK, EMORISK was used as an auxiliary variable in MI, EM, and FIML methods. All analyses were performed using SAS 9.3. For simplicity, we describe the data analysis for one of the three data sets, namely, the condition with an overall missing rate of 20%. Other data sets were analyzed similarly. Results are presented in Tables 2 and and3 3 .
Regression Coefficients from Four Missing Data Methods
Complete data | LD | MI | FIML | EM | |
---|---|---|---|---|---|
GENDER | -0.434*** | -0.412*** | -0.414*** | -0.421*** | -0.421*** |
(0.082) | (0.091) | (0.086) | (0.087) | (0.083) | |
DROPOUT | 1.172*** | 1.237*** | 1.266*** | 1.263*** | 1.263*** |
(0.125) | (0.142) | (0.132) | (0.132) | (0.126) | |
ESTEEM | -0.191*** | -0.213*** | -0.215*** | -0.212*** | -0.212*** |
(0.041) | (0.046) | (0.044) | (0.044) | (0.041) | |
FAMSTR | 0.367*** | 0.377*** | 0.365*** | 0.366*** | 0.366*** |
(0.087) | (0.101) | (0.096) | (0.092) | (0.088) | |
Actual | 432 | 349 | 432 | N/A | 414 |
GENDER | -0.434*** | -0.39** | -0.414*** | -0.413*** | -0.413*** |
(0.082) | (0.131) | (0.1) | (0.104) | (0.086) | |
DROPOUT | 1.172*** | 1.557*** | 1.559*** | 1.532*** | 1.562*** |
(0.125) | (0.209) | (0.17) | (0.158) | (0.131) | |
ESTEEM | -0.191*** | -0.193** | -0.217*** | -0.214** | -0.215*** |
(0.041) | (0.065) | (0.063) | (0.06) | (0.043) | |
FAMSTR | 0.367*** | 0.479* | 0.302* | 0.3** | 0.3** |
(0.087) | (0.192) | (0.116) | (0.111) | (0.091) | |
Actual | 432 | 171 | 432 | N/A | 367 |
Note . Standard error estimates in parentheses. MI results were based on 60 imputations. FIML results were obtained with EMORISK as an auxiliary variable in the model.
a The actual overall missing rate was 19.21%. b The actual overall missing rate was 60.42%.
* p < .05. ** p < .01. *** p < .001.
Percentage of Bias in Estimates
LD | MI | FIML | EM | |
---|---|---|---|---|
GENDER | 5.07 | 4.61 | 3.00 | 3.00 |
DROPOUT | 5.55 | 8.02 | 7.76 | 7.76 |
ESTEEM | -11.52 | -12.57 | -10.99 | -10.99 |
FAMSTR | 2.72 | -0.54 | -0.27 | -0.27 |
GENDER | 10.14 | 4.61 | 4.84 | 4.84 |
DROPOUT | 32.85 | 33.02 | 30.72 | 33.28 |
ESTEEM | -1.05 | -13.61 | -12.04 | -12.57 |
FAMSTR | 30.52 | -17.71 | -18.26 | -18.26 |
Note . Percentage of bias was calculated as the ratio of the difference between the incomplete data estimate and the complete data estimate divided by the complete data estimate.
The ld method.
The LD method was implemented as a default in PROC REG. To implement LD, we ran PROC REG without specifying any options regarding missing data method. The SAS system, by default, used cases with complete data to estimate the regression coefficients.
The second step in MI was to fit the regression model in Equation 18 to each imputed data set using PROC REG (see the Appendix for the SAS syntax). At the end of PROC REG, 60 sets of estimates of regression coefficients and their variance-covariance matrices were output to the third and final step in MI, namely, to pool these 60 estimates into one set. PROC MIANALYZE was invoked to combine these estimates and their variances/covariances into one set using the pooling formula in Equations 4 to 7 ( Rubin 1987 ). By default, PROC MIANALYZE uses ν m , defined in Equation 9 , for hypothesis testing. In order to specify the corrected degrees of freedom ν m * (as defined in Equation 10 ) for testing, we specified the “EDF=427” option, because 427 was the degrees of freedom based on the complete data.
The FIML method was implemented using PROC CALIS which is designed for structural equation modeling. Beginning with SAS 9.22, the CALIS procedure has offered an option to analyze data using FIML in the presence of missing data. The FIML method in the CALIS procedure has a variety of applications in path analyses, regression models, factor analyses, and others, as these modeling techniques are considered special cases of structural equation modeling ( Yung and Zhang 2011 ). For the current study, two models were specified using PROC CALIS: an ordinary least squares regression model without the auxiliary variable EMORISK, and a saturated correlates model that included EMORISK. For the saturated correlates model, EMORISK was specified to be correlated with the four covariates (GENDER, DROPOUT, ESTEEM, and FAMSTR) and the residual for LBEHRISK. Graham ( 2003 ) has shown that by constructing the saturated correlates model this way, one can include an auxiliary variable in the SEM model without affecting parameter estimate(s), or the model fit index for the model of substantive interest, which is Equation 18 in the current study.
The EM method was implemented using both PROC MI and PROC REG. As stated previously, the versatile PROC MI can be used for EM if the EM statement was specified. To include auxiliary variables in EM, one lists the auxiliary variables on the VAR statement of PROC MI (see the Appendix for the SAS syntax). The output data set of PROC MI with the EM specification is a data set containing the estimated variance-covariance matrix and the vector of means of all the variables listed on the VAR statement. The variance-covariance matrix and the means vector were subsequently input into PROC REG to be fitted by the regression model in Equation 18 . In order to compute the SE for the estimated regression coefficients, we specified a nominal sample size that was the average of available cases among all the variables. We decided on this strategy based on findings by Truxillo ( 2005 ). Truxillo ( 2005 ) compared three strategies for specifying sample sizes for hypothesis testing in discriminant function analysis using EM results. The three strategies were: (a) the minimum column-wise n (i.e., the smallest number of available cases among all variables), (b) the average column-wise n (i.e., the mean number of available cases among all the variables), and (c) the minimum pairwise n (the smallest number of available cases for any pair of variables in a data set). He found that the average column-wise n approach produced results closest to the complete-data results. It is worth noting that Truxillo ( 2005 )’s study was limited to discriminant function analysis and three sample size specifications. Additional research is needed in order to determine the best strategy to specify a nominal sample size for other statistical procedures.
Results derived from the 40% missing rate exhibited patterns between those obtained at 20% and 60% missing rates. Hence, they are presented in the Appendix. Table 2 presents estimates of regression coefficients and SE s derived from LD, MI, FIML and EM for the 20% and 60% missing data conditions. Table 3 presents the percent of bias in parameter estimates by the four missing data methods. The percentage of bias was defined and calculated as the ratio of the difference between the incomplete data estimate and the complete data estimate, divided by the complete data estimate. Any percentage of bias larger than 10% is considered substantial in subsequent discussions. The complete data results are included in Table 2 as a benchmark to which the missing data results are contrasted. The regression model based on the complete data explained 28.4% of variance (i.e., R adj 2 ) in LBEHRISK, RMSE = 0.846, and all four predictors were statistically significant at p < .001.
According to Table 2 , at 20% overall missing rate, estimates derived from the four missing data methods were statistically significant at p < .001, the same significance level as the complete data results. LD consistently resulted in larger SE , compared to the three principled methods, or the complete data set. The bias in estimates was mostly under 10%, except for estimates of ESTEEM by all four missing data methods (Table 3 ). The three principled methods exhibited similar biases and estimated FAMSTR accurately.
When the overall missing rate was 60% (Table 2 ), estimates derived from the four missing data methods showed that all four covariates were statistically significant at least at p < .05. LD consistently resulted in larger SE , compared to the three principled methods, or the complete data set. All four methods resulted in substantial bias for three of the four covariates (Table 3 ). The three principled methods once again yielded similar biases, whereas bias from LD was similar to these three only for DROPOUT. Indeed, DROPOUT was least accurately estimated by all four methods. LD estimated ESTEEM most accurately and better than the three principled methods. The three principled methods estimated GENDER most accurately and their estimates for FAMSTR were better than LD’s. Differences in absolute bias due to these four methods for ESTEEM or GENDER were actually quite small.
Compared to the complete data result, the three principled methods slightly overestimated SE s (Table 2 ), but not as badly as LD. Among the three methods, SE s obtained from EM were closer to those based on the complete data, than MI or FIML. This finding is to be expected because MI incorporates into SE the uncertainty associated with plausible missing data estimates. And the literature consistently documented the superior power of EM, compared to MI ( Collins et al. 2001 ; Graham et al. 2007 ; Schafer and Graham 2002 ).
In general, the SE and the bias increased as the overall missing rate increased from 20% to 60%. One exception to this trend was the bias in ESTEEM estimated by LD; they decreased instead, although the two estimates differed by a mere .02.
During the last decade, the missing data treatments reported in JEP have shown much improvement in terms of decreased use of ad hoc methods (e.g., LD and PD) and increased use of principled methods (e.g., FIML, EM, and MI). Yet several research practices still persisted including, not explicitly acknowledging the presence of missing data, not describing the approach used in dealing with missing data, not testing assumptions assumed. In this paper, we promote three principled missing data methods (i.e., MI, FIML, and EM) by discussing their theoretical framework, implementation, assumptions, and computing issues. All three methods were illustrated with an empirical Adolescent data set using SAS 9.3. Their performances were evaluated under three conditions. These three conditions were created from three missing rates (20%, 40%, and 60%). Each incomplete data set was subsequently analyzed by a regression model to predict adolescents’ behavioral risk score using one of the three principled methods or LD. The performance of the four missing data methods was contrasted with that of the complete data set in terms of bias and SE .
Results showed that the three principled methods yielded similar estimates at both missing data rates. In comparison, LD consistently resulted in larger SE s for regression coefficients estimates. These findings are consistent with those reported in the literature and thus confirm the recommendations of the three principled methods ( Allison 2003 ; Horton and Lipsitz 2001 ; Kenward and Carpenter 2007 ; Peng et al. 2006 ; Peugh and Enders 2004 ; Schafer and Graham 2002 ). Under the three missing data conditions, MI, FIML, and EM yielded similar estimates and SE s. These results are consistent with missing data theory that argues that MI and ML-based methods (e.g., FIML and EM) are equivalent ( Collins et al. 2001 ; Graham et al. 2007 ; Schafer and Graham 2002 ). In terms of SE , ML-based methods outperformed MI by providing slightly smaller SE s. This finding is to be expected because ML-based methods do not involve any randomness whereas MI does. Below we elaborate on features shared by MI and ML-based methods, choice between these two types of methods, and extension of these methods to multilevel research contexts.
First of all, these methods are based on the likelihood function of P ( Y obs , θ ) = ∫ P ( Y complete , θ ) dY mis . Because this equation is valid under MAR ( Rubin 1976 ), all three principled methods are valid under the MAR assumption. The two ML-based methods work directly with the likelihood function, whereas MI takes the Bayesian approach by imposing a prior distribution on the likelihood function. As the sample size increases, the impact of the specific prior distribution diminishes. It has been shown that,
If the user of the ML procedure and the imputer use the same set of input data (same set of variables and observational units), if their models apply equivalent distributional assumptions to the variables and the relationships among them, if the sample size is large, and if the number of imputations, M , is sufficiently large, then the results from the ML and MI procedures will be essentially identical. ( Collins et al. 2001 p. 336)
In fact, the computational details of EM and MCMC (i.e., data augmentation) are very similar ( Schafer 1997 ).
Second, both the MI and the ML-based methods allow the estimation/imputation model to be different from the analysis model—the model of substantive interest. Although it is widely known that the imputation model can be different from the analysis model for MI, the fact that ML-based methods can incorporate auxiliary variables (such as, EMORISK) is rarely mentioned in the literature, except by Graham ( 2003 ). As previously discussed, Graham ( 2003 ) suggested using the saturated correlates model to incorporate auxiliary variables into SEM. However, this approach results in a rapidly expanding model with each additional auxiliary variable; consequently, the ML-based methods may not converge. In this case, MI is the preferred method, especially when one needs to incorporate a large number of auxiliary variables into the model of substantive interest.
Finally, most statistical packages that offer the EM, FIML and/or MI methods assume multivariate normality. Theory and experiments suggest that MI is more robust to violation of this distributional assumption than ML-based methods ( Schafer 1997 ). As discussed previously, violation of the multivariate normality assumption may cause convergence problems for ML-based methods. Yet MI can still provide satisfactory results in the presence of non-normality (refer to the section titled MI Related Issues ). This is so because the posterior distribution in MI is approximated by a finite mixture of the normal distributions. MI therefore is able to capture non-normal features, such as, skewness or multiple modes ( Schafer 1999 ). At the present, the literature does not offer systematic comparisons of these two methods in terms of their sensitivity to the violation of the multivariate normality assumption.
The choice between MI and ML-based methods is not easy. On the one hand, ML-based methods offer the advantage of likelihood ratio tests so that nested models can be compared. Even though Schafer ( 1997 ) provided a way to combine likelihood ratio test statistics in MI, no empirical studies have evaluated the performance of this pooled likelihood ratio test under various data conditions (e.g., missing mechanism, missing rate, number of imputations, model complexity). And this test has not been incorporated into popular statistical packages, such as, SAS, SPSS. ML-based methods, in general, produce slightly smaller SE s than MI ( Collins et al. 2001 ; Schafer and Graham 2002 ). Finally, ML-based methods have greater power than MI ( Graham et al. 2007 ), unless imputations were sufficiently large, such as 100 or more.
On the other hand, MI has a clear advantage over ML-based methods when dealing with categorical variables ( Peng and Zhu 2008 ). Another advantage of MI over ML-based methods is its computational simplicity ( Sinharay et al. 2001 ). Once missing data have been imputed, fitting multiple models to a single data set does not require the repeated application of MI. Yet it requires multiple applications of ML-based methods to fit different models to the same data. As stated earlier, it is easier to include auxiliary variable in MI than in ML-based methods. In this sense, MI is the preferred method, if one wants to employ an inclusive strategy to selecting auxiliary variables.
The choice also depends on the goal of the study. If the aim is exploratory, or if the data are prepared for a number of users who may analyze the data differently, MI is certainly better than a ML-based method. For these purposes, a data analyst needs to make sure that the imputation model is general enough to capture meaningful relationships in the data set. If, however, a researcher is clear about the parameters to be estimated, FIML or EM is a better choice because they do not introduce randomness due to imputation into the data, and are more efficient than MI.
An even better way to deal with missing data is to apply MI and EM jointly. In fact, the application of MI can be facilitated by utilizing EM estimates as starting values for the data augmentation algorithm ( Enders 2010 ). Furthermore, the number of EM iterations needed for convergence is a conservative estimate for the number of burn-ins needed in data augmentation of MI, because EM converges slower than MI.
Many problems in education and psychology are multilevel in nature, such as students nested within classroom, teachers nested within school districts, etc. To adequately address these problems, multilevel model have been recommended by methodologists. For an imputation method to yield valid results, the imputation model must contain the same structure as the data. In other words, the imputation model should be multilevel in order to impute for missing data in a multilevel context ( Carpenter and Goldstein 2004 ). There are several ways to extend MI to deal with missing data when there are two levels. If missing data occur only at level 1 and the number of level 2 units is low, standard MI can be used with minor adjustments. For example, for a random-intercept model, one can dummy-code the cluster membership variable and include the dummy variables into the imputation model. In the case of a random slope and random intercepts model, one needs to perform multiple imputation separately within each cluster ( Graham 2009 ). When the number of level 2 units is high, the procedure just described is cumbersome. In this instance, one may turn to specialized MI programs, such as, the PAN library in the S-Plus program ( Schafer 2001 ), the REALCON-IMPUTE software ( Carpenter et al. 2011 ), and the R package mlmmm ( Yucel 2007 ). Unfortunately, ML-based methods have been extended to multilevel models only when there are missing data on the dependent variable, but not on the covariates at any level, such as student’s age at level 1 or school’s SES at level 2 ( Enders 2010 ).
In this paper, we discuss and demonstrate three principled missing data methods that are applicable for a variety of research contexts in educational psychology. Before applying any of the principled methods, one should make every effort to prevent missing data from occurring. Toward this end, the missing data rate should be kept at minimum by designing and implementing data collection carefully. When missing data are inevitable, one needs to closely examine the missing data mechanism, missing rate, missing pattern, and the data distribution before deciding on a suitable missing data method. When implementing a missing data method, a researcher should be mindful of issues related to its proper implementation, such as, statistical assumptions, the specification of the imputation/estimation model, a suitable number of imputations, and criteria of convergence.
Quality of research will be enhanced if (a) researchers explicitly acknowledge missing data problems and the conditions under which they occurred, (b) principled methods are employed to handle missing data, and (c) the appropriate treatment of missing data is incorporated into review standards of manuscripts submitted for publication.
This research was supported by the Maris M. Proffitt and Mary Higgins Proffitt Endowment Grant awarded to the second author. The opinions contained in this paper are those of the authors and do not necessarily reflect those of the grant administer—Indiana University, School of Education.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
YD did literature review on missing data methods, carried out software demonstration, and drafted the manuscript. CYJP conceived the software demonstration design, provided the empirical data, worked with YD collaboratively to finalize the manuscript. Both authors read and approved of the final manuscript.
Yiran Dong, Email: ude.anaidni@gnodiy .
Chao-Ying Joanne Peng, Email: ude.anaidni@gnep .
BMC Medical Research Methodology volume 12 , Article number: 96 ( 2012 ) Cite this article
39k Accesses
119 Citations
Metrics details
Retaining participants in cohort studies with multiple follow-up waves is difficult. Commonly, researchers are faced with the problem of missing data, which may introduce biased results as well as a loss of statistical power and precision. The STROBE guidelines von Elm et al. (Lancet, 370:1453-1457, 2007); Vandenbroucke et al. (PLoS Med, 4:e297, 2007) and the guidelines proposed by Sterne et al. (BMJ, 338:b2393, 2009) recommend that cohort studies report on the amount of missing data, the reasons for non-participation and non-response, and the method used to handle missing data in the analyses. We have conducted a review of publications from cohort studies in order to document the reporting of missing data for exposure measures and to describe the statistical methods used to account for the missing data.
A systematic search of English language papers published from January 2000 to December 2009 was carried out in PubMed. Prospective cohort studies with a sample size greater than 1,000 that analysed data using repeated measures of exposure were included.
Among the 82 papers meeting the inclusion criteria, only 35 (43%) reported the amount of missing data according to the suggested guidelines. Sixty-eight papers (83%) described how they dealt with missing data in the analysis. Most of the papers excluded participants with missing data and performed a complete-case analysis (n = 54, 66%). Other papers used more sophisticated methods including multiple imputation (n = 5) or fully Bayesian modeling (n = 1). Methods known to produce biased results were also used, for example, Last Observation Carried Forward (n = 7), the missing indicator method (n = 1), and mean value substitution (n = 3). For the remaining 14 papers, the method used to handle missing data in the analysis was not stated.
This review highlights the inconsistent reporting of missing data in cohort studies and the continuing use of inappropriate methods to handle missing data in the analysis. Epidemiological journals should invoke the STROBE guidelines as a framework for authors so that the amount of missing data and how this was accounted for in the analysis is transparent in the reporting of cohort studies.
Peer Review reports
A growing number of cohort studies are establishing protocols to re-contact participants at various times during follow-up. These waves of data collection provide researchers with the opportunity to obtain information regarding changes in the participants’ exposure and outcome measures. Incorporating the repeated measures of the exposure in the epidemiological analysis is especially important if the current exposure (or change in exposure) is thought to be more predictive of the outcome than the participants’ baseline measurement [ 1 ] or the researcher is interested in assessing the effect of a cumulative exposure [ 2 ]. The time frames for these follow-up waves of data collection can vary from one to two years up to 20 to 30 years or even longer post-baseline. Repeated ascertainment of exposure and outcome measures over time can lead to missing data for reasons such as participants not being traceable, too sick to participate, withdrawing from the study, refusing to respond to certain questions or death [ 3 , 4 ]. In this paper we focus on missing data in exposure measures that are made repeatedly in a cohort study because studies of this type (in which the outcome is often a single episode of disease or death obtained from a registry and therefore, known for all participants) are common and increasingly important in chronic disease epidemiology. Further research is needed on the consequences of and best methods for handling missing data in such study designs, but simulation and case studies have shown that missing covariate data can lead to biased results and there may be gains in precision of estimation of effects if multiple imputation is used to handle missing covariate data [ 5 – 7 ].
If participants with missing data and complete data differ with respect to exposure and outcome, estimates of association based on fully observed cases (known as a complete-case analysis) might be biased. Further, the estimates from these analyses will have less precision than an analysis of all participants in the absence of missing data. As well as complete-case analysis, there are other methods available for dealing with missing data in the statistical analysis [ 8 , 9 ]. These include ad hoc methods such as Last Observation Carried Forward and the missing indicator method, and more advanced approaches such as multiple imputation and likelihood-based formulations.
The STROBE guidelines for reporting of observational studies, published in 2007, state that the method for handling missing data should be addressed and furthermore, that the number of individuals used for analysis at each stage of the study should be reported accompanied by reasons for non-participation or non-response [ 10 , 11 ]. The guidelines published by Sterne et al. [ 12 ], an extension to the STROBE guidelines, provide general recommendations for the reporting of missing data in any study affected by missing data and specific recommendations for reporting the details of multiple imputation.
In this paper we: 1) give a brief review of the statistical methods that have been proposed for handling missing data and when they may be appropriate; 2) review how missing exposure data has been reported in large cohort studies with one or more waves of follow-up, where the repeated waves of exposures were incorporated in the statistical analyses; and 3) report how the same studies dealt with missing data in the statistical analyses.
Complete-case analysis only includes in the analysis participants with complete data on all waves of data collection, thereby potentially reducing the precision of the estimates of the exposure-outcome associations [ 2 ]. The advantage of using complete-case analysis is that it is easily implemented, with most software packages using this method as the default. The estimates of the associations of interest may be biased if the participants with missing data are not similar to those with complete data. To be valid, complete-case analyses must assume that participants with missing data can be thought of as a random sample of those that were intended to be observed (commonly referred to in the missing data nomenclature as missing completely at random (MCAR) [ 13 ]), or at least that the likelihood of exposure being missing is independent of the outcome given the exposures [ 5 ].
There are three commonly used ad hoc approaches for handling missing data, all of which can lead to bias [ 3 , 12 , 14 ]. The Last Observation Carried Forward (LOCF) method replaces the missing value in a wave of data collection with the non-missing value from the previous completed wave for the same individual. The assumption behind this approach is that the exposure status of the individual has not changed over time. The mean value substitution method replaces the missing value with the average value calculated over all the values available from the other waves of data collection for the same individual. Both LOCF and mean value substitution falsely increase the stated precision of the estimates by failing to account for the uncertainty due to the missing data and generally give biased results, even when the data are MCAR [ 7 , 15 ]. The Missing Indicator Method is applied to categorical exposures and includes an extra category of the exposure variable for those individuals with missing data. Indicator variables are created for the analysis, including an indicator for the missing data category [ 16 ]. This method is simple to implement, but also produces biased results in many settings, even when the data are MCAR [ 6 , 12 ].
Multiple Imputation (MI) begins by imputing values for the missing data multiple times by sampling from an imputation model (using either chained equations [ 17 , 18 ] or a multivariate normal model [ 19 ]). The imputation model should contain the variables that are to be included in the statistical model used for the epidemiological analysis, as well as auxiliary variables that may contain information about the missing data, and a “proper” imputation procedure incorporates appropriate variability in the imputed values. The imputation process creates multiple ‘completed’ versions of the datasets. These ‘completed datasets’ are analysed using the appropriate statistical model for the epidemiological analysis and the estimates obtained from each dataset are averaged to produce one overall MI estimate. The standard error for this overall MI estimate is derived using Rubin’s rules, which account for variability between-and within- the estimates obtained from the separate analyses of the ‘completed datasets’ [ 3 , 13 ]. By accounting for the variability between the completed (imputed) datasets, MI produces a valid estimate of the precision of the final MI estimate. When the imputation is performed using standard methods that are now available in many packages, with appropriate model specifications to reflect the structure of the data, the resulting MI estimate will be valid (unbiased parameter estimates with nominal confidence interval coverage) if the missing data are ‘Missing At Random’ (MAR) [ 5 ]. MAR describes a situation where the probability of being missing for a particular variable (e.g. waist circumference) can be explained by other observed variables in the dataset, but is (conditionally) independent of the variable itself (that is, waist circumference) [ 13 ]. On the other hand, MI may produce biased estimates if the data are ‘Missing Not At Random’ (MNAR), which occurs when the study participants with missing data differ from the study participants with complete data in a manner that cannot be explained by the observed data in the study [ 13 ].
MI is now implemented in many major statistical packages (including Stata [ 20 ] and SAS [ 21 ]) making it an easily accessible method. However, it can be a time-intensive process to impute multiple datasets, analyse the ‘completed datasets’ and combine the results; and the imputation model can be complex since it must contain the exposure and outcome variables included in the analysis model, auxiliary variables and any interactions that will be included in the final analysis model [ 22 , 23 ]. Sterne et al. [ 12 ] have described a number of pitfalls that can be encountered in the imputation procedure that might lead to biased results for the epidemiological analysis of interest.
Missing data can also be handled with the following more sophisticated methods: maximum likelihood-based formulations, fully Bayesian models and weighting methods. Likelihood-based methods use all of the available information (i.e. information from participants with both complete and incomplete data) to simultaneously estimate both the missing data model and the data analysis model, eliminating the need to handle the missing data directly [ 3 , 8 , 24 , 25 ], although in many cases the MAR assumption is also invoked to enable the missing data model to be ignored. Bayesian models also rely on a fully specified model that incorporates both the missingness process and the associations of interest [ 12 , 15 , 26 ]. Weighting methods apply weights that correspond to the inverse probability of a data observation being observed, to the observed data to account for the missing data [ 22 , 25 ]. These methods may improve the precision of the estimates compared with complete-case analysis. However, they are also dependent on assumptions about the missingness mechanism and in some cases on specifying the correct missingness model. In general, these methods require tailored programming which can be time consuming and requires specialist expertise [ 15 ].
For this review we selected prospective cohort studies that analysed exposure data collected after initial recruitment during the follow-up period (i.e. studies looking at a change in exposure or at a time varying covariate). We restricted our review to cohort studies with more than 1,000 participants, as we thought it was more likely for there to be more missing data in follow-up measurements of exposures in large cohort studies (typically population based studies) compared to small cohorts (often based on a specific clinical population). For cohort studies reported in multiple papers, we included only the most recent original research article. Studies that only used data collected at baseline or at one of the follow-up waves in the analysis, and studies that newly recruited participants at one of the waves after baseline were excluded. We did not place any restrictions on the types of exposures or outcomes studied or the type of statistical analysis performed.
PubMed was searched for English language papers published between January 2000 and December 2009. We chose January 2000 as a starting date because the first widely available statistical software package for implementing MI, the NORM package [ 27 ], was developed in 1997 and updated in 1999. Search terms included: “Cohort Studies”[MeSh] AND (“attrition” OR “drop out” OR “longitudinal change” OR “missing data” OR “multiple exposure” OR “multiple follow-up” OR “multiple waves” OR “repeated exposure” OR “repeated follow-up” OR “repeated waves” OR “repeated measures” OR “time dependent covariates” OR “time dependent” OR “time varying covariate” OR “cumulative average”).
We carried out a further search of cohort studies listed in the web appendix of the paper by Lewington et al. [ 28 ], to ensure that any known large cohort studies were not missed in the original PubMed search. These cohort studies were established in the 1970s and 1980s, allowing them time to measure repeated waves of exposure on their participants and to publish these results during our study period (i.e. between 2000 and 2009).
AK reviewed all articles; any uncertainties regarding the statistical method used to handle the missing data were resolved by discussion with JAS, and AK extracted the data. Additional tables and methods sections from journal websites were checked if referred to in the article.
Our aim was to assess the reporting of missing data and the methods used to handle the missing data according to the recommendations given by the STROBE guidelines [ 10 , 11 ] and Sterne et al. [ 12 ]. The information extracted is summarised in Tables 1 and 2 and Additional file 1 : Table S1.
We identified 4,277 articles via the keyword search. A total of 3,684 articles were excluded based on their title and abstract, leaving 543 articles for further evaluation. Of these, 471 articles were excluded and 72 articles were found to be appropriate for the review. A further ten studies were identified from the reference list of Lewington et al. [ 28 ] (Figure 1 ), giving 82 studies included in this review. The reasons for excluding studies are outlined in Figure 1 , the most common reasons were sample size of less than 1,000 participants (54%), study design was not a prospective cohort (19%), and did not report original research findings (13%).
Search results.
The characteristics of the 82 studies included are summarised in Table 1 and further details can be found in the additional table (see Additional file 1: Table S 1 ). The studies included ranged from smaller studies that recruited 1,000 to 2,000 participants at baseline to larger studies with more than 20,000 participants, and the number published annually increased steadily from two papers in 2000 to 16 papers in 2009. The majority of studies recruited their participants in the decades 1980 to 1989 (n = 25), and 1990 to 1999 (n = 30). Cox proportional hazards regression was the most common statistical method used for the epidemiological analysis (n = 37) to analyse the repeated measures of exposure, with 35 of these papers incorporating the repeated exposure(s) as a time varying covariate and the remaining two papers including a single measure of the covariate derived from repeated assessments. Generalised Estimating Equations with a logistic (n = 10) or linear regression (n = 3) and generalised linear mixed-effects models (logistic regression (n = 3) and linear regression (n = 13)) were the next most common epidemiological analyses used.
The methods used by the selected papers for handling missing data are summarised in Table 2 . Sixty-six papers (80%) commented on the amount of missing data at follow-up. Of these, only 35 papers provided information about the proportion of participants lost to follow-up at each wave. The remaining 31 papers provided incomplete details about the amount of missing data at each wave: 22 papers made a general comment about the amount of missing data; six papers reported the amount of missing data for the final wave but gave no detail regarding the number of participants available at previous waves of data collection (including baseline); and three papers only reported the amount of missing data for a few of the variables.
Of the 29 papers published after 2007, nine papers did not state the proportion of missing data at each follow up wave, three papers provided a comment as to why the data were missing and eight papers compared the baseline covariates for those with and without missing covariate data at the repeated waves of follow up.
Among those papers that provided information on missing data, the proportion of covariate data missing at any follow-up wave ranged from 2% to 65%. Twenty-six papers (32%) compared the key variables of interest for those who did and did not have data from post-baseline waves, but only six of these presented the results in detail while the rest commented briefly in the text on whether or not there was a difference.
The most common methods used to deal with missing data were complete-case analysis (n = 54), LOCF (n = 7) and MI (n = 5). Of the 54 papers that used complete-case analysis: 38 excluded participants who were missing exposure data at any of the waves of data collection from the analysis; one paper also excluded participants with any missing exposure data but used a weighted analysis to deal with the missing data; and the remaining 15 papers, where both the exposure and outcome measures were assessed repeatedly at each wave of data collection, excluded participant data records for waves where the exposure data were missing. Fourteen papers did not state the method used to deal with the missing data, although nine of these papers performed a Cox regression model using SAS [ 21 ] or Stata [ 20 ] and we therefore assumed that they used a complete-case analysis (Table 2 ). Both papers published in 2000 used complete-case analysis. From 2001 to 2009, the proportion of papers using complete-case analysis ranged from 25% to 65%. Methods known to produce biased results (i.e. LOCF, the missing indicator method and mean value substitution) continue to be used, with four papers using these methods in 2009.
Of the five papers that used MI [ 29 – 33 ], two papers [ 29 , 30 ] compared the characteristics of the participants with and without missing data. For the MI, three of the five papers [ 30 , 31 , 33 ] provided details of the imputation process including the number of imputations performed and the variables included in the imputation model, and compared the results from the MI analysis to results from complete-case analysis. The other two papers [ 29 , 32 ] provided details about the number of imputations performed but did not describe the variables included in their imputation model and did not compare the MI results to the complete-case analysis.
We identified 82 cohort studies of 1,000 or more participants that were published from 2000 to 2009 and which analysed exposure data collected from repeated follow-up waves. The reporting of missing data in these studies was found to be inconsistent and generally did not follow the recommendations set out by the STROBE guidelines [ 10 , 11 ] or the guidelines set out by Sterne et al. [ 12 ]. The STROBE guidelines recommend that authors report the number of participants who take part in each wave of the study and give reasons why participants did not attend a wave. Only three papers [ 30 , 34 , 35 ] followed the STROBE guidelines fully. The majority of papers did not provide a reason or comment for why study participants did not attend each wave of follow-up. Sterne et al. [ 12 ] recommend that the reasons for missing data be described with respect to other variables and that authors investigate potentially important differences between participants and non-participants.
The STROBE guidelines were published in 2007. Of the nine papers published after 2007, only one followed the STROBE guidelines fully. This suggests that either journal editors are not using these guidelines or authors are not considering the impact of missing covariate data in their research.
A review of missing data in cancer prognostic studies published in 2004 by Burton et al. [ 36 ] and a review of developmental psychology studies published in 2009 by Jelicic et al. [ 3 ] reported similar findings to ours. Burton et al. [ 36 ] found a deficiency in the reporting of missing covariate data in cancer prognostic studies. After reviewing 100 articles, they found that only 40% of articles provided information about the method used to handle missing covariate data and only 12 articles would have satisfied their proposed guidelines for the reporting of missing data. We observed in our review, of articles published from 2000 to 2009, that a larger proportion of articles reported the method used to handle the missing data in the analysis but that many articles were still not reporting the amount of missing data and the reasons for missingness.
The cohort studies we identified used numerous methods to handle missing data in the exposure-outcome analyses. Although some studies used advanced statistical modelling procedures (e.g. MI and Bayesian), the majority removed individuals with missing data and performed a complete-case analysis; a method that may produce biased results if the missing data are not MCAR. Jelicic et al. also found in their review that a large proportion of studies used complete-case analysis to handle their missing data [ 3 ]. For studies with a large proportion of missing data, excluding participants with missing data may also reduce the precision of the analysis substantially. Ad hoc methods (e.g. LOCF, the missing indicator method and mean value substitution), which are generally not recommended [ 16 , 25 ] because they fail to account for the uncertainty in the data and may produce biased estimates [ 12 ], continue to be used. Although MI is becoming more accessible, only five studies used this method. The reporting of the imputation procedure was inconsistent and often incomplete. This was also observed by two independent reviews of the reporting of MI in the medical journals: BMJ, JAMA Lancet and the New England Journal of Medicine [ 12 , 37 ]. Future studies should follow the recommendations outlined by Sterne et al. [ 12 ] to ensure that enough details are provided about the MI procedure, especially the implementation and details of the imputation modelling process.
We aimed to complete a comprehensive review of all papers published that analysed exposure variables measured at multiple follow-up waves. Several keywords were used in order to obtain as many articles as possible. The keyword search was then supplemented with cohort studies identified from a pooled analysis of 61 cohort studies. Although a large number of abstracts and studies were identified, some cohort studies might have been missed. If multiple papers were identified from one study, the most recent article was included in the review, which might have led us to omit papers from the same study that used a more appropriate missing data method. Our search criteria only included papers written in English and only PubMed was searched. Our search strategy was limited to articles published between 2000 and 2009. On average three papers of the type we focussed on were published each year from 2000 to 2002 and the number has increased since then, so it seems unlikely that many papers were published before this time. Also, MI was not as accessible prior to 1997, so papers published before 2000 were more likely to have used complete case analysis or other ad hoc methods.
With the increase in the number of cohort studies analysing data with multiple follow-up waves it is essential that authors follow the STROBE guidelines [ 10 , 11 ] in conjunction with the guidelines proposed by Sterne et al. [ 12 ] to report on the amount of missing data in the study and the methods used to handle the missing data in the analyses. This will ensure that missing data are reported with enough detail to allow readers to assess the validity of the results. Incomplete data and the statistical methods used to deal with the missing data can lead to bias, or be inefficient, and so authors should be encouraged to use online supplements (if necessary) as a way of publishing both the details of the missing data in their study and the details of the methods used to deal with the missing data.
Cupples LA, D’Agostino RB, Anderson K, Kannel WB: Comparison of baseline and repeated measure covariate techniques in the Framingham Heart Study. Stat Med. 1988, 7: 205-222. 10.1002/sim.4780070122.
CAS PubMed Google Scholar
Shortreed SM, Forbes AB: Missing data in the exposure of interest and marginal structural models: a simulation study based on the Framingham Heart Study. Stat Med. 2010, 29: 431-443.
PubMed Google Scholar
Jelicic H, Phelps E, Lerner RM: Use of missing data methods in longitudinal studies: the persistence of bad practices in developmental psychology. Dev Psychol. 2009, 45: 1195-1199.
Kurland BF, Johnson LL, Egleston BL, Diehr PH: Longitudinal data with follow-up truncated by death: match the analysis method to research aims. Stat Sci. 2009, 24: 211-10.1214/09-STS293.
PubMed PubMed Central Google Scholar
White IR, Carlin JB: Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Stat Med. 2010
Google Scholar
Knol MJ, Janssen KJ, Donders AR, Egberts AC, Heerdink ER, Grobbee DE, Moons KG, Geerlings MI: Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example. J Clin Epidemiol. 2010, 63: 728-736. 10.1016/j.jclinepi.2009.08.028.
Molenberghs G, Thijs H, Jansen I, Beunckens C, Kenward MG, Mallinckrodt C, Carroll RJ: Analyzing incomplete longitudinal clinical trial data. Biostatistics. 2004, 5: 445-464. 10.1093/biostatistics/kxh001.
Schafer JL, Graham JW: Missing data: our view of the state of the art. Psychol Methods. 2002, 7: 147-177.
Carpenter J, Kenward MG: A critique of common approaches to missing data. 2007, National Institute for Health Research, Birmingham, AL
von Elm E, Altman DG, Egger M, Pocock SJ, Gotzsche PC, Vandenbroucke JP: The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Lancet. 2007, 370: 1453-1457. 10.1016/S0140-6736(07)61602-X.
Vandenbroucke JP, von Elm E, Altman DG, Gotzsche PC, Mulrow CD, Pocock SJ, Poole C, Schlesselman JJ, Egger M: Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): explanation and elaboration. PLoS Med. 2007, 4: e297-10.1371/journal.pmed.0040297.
Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR: Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009, 338: b2393-10.1136/bmj.b2393.
Little RJA, Rubin DB: Statistical analysis with missing data. 2002, John Wiley & Sons, Inc, Hoboken, New Jersey
Rubin DB: Multiple imputation for nonresponse in surveys. 1987, John Wiley & Sons, New York
Buhi ER, Goodson P, Neilands TB: Out of sight, not out of mind: strategies for handling missing data. Am J Heal Behav. 2008, 32: 83-92.
Greenland S, Finkle WD: A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am J Epidemiol. 1995, 142: 1255-1264.
Azur MJ, Stuart EA, Frangakis C, Leaf PJ: Multiple imputation by chained equations: what is it and how does it work?. Int J Methods Psychiatr Res. 2011, 20: 40-49. 10.1002/mpr.329.
White IR, Royston P, Wood AM: Multiple imputation using chained equations: issues and guidance for practice. Stat Med. 2011, 30: 377-399. 10.1002/sim.4067.
Lee KJ, Carlin JB: Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation. Am J Epidemiol. 2010, 171: 624-632. 10.1093/aje/kwp425.
StataCorp: Stata Statistical Software: Release 11. 2009, StataCorp LP, College Station, TX
SAS Insitute Inc: SAS OnlineDoc, Version 8. 2000, SAS Institute, Inc., Cary, NC
Carpenter JR, Kenward MG, Vansteelandt S: A comparison of multiple imputation and doubly robust estimation for analyses with missing data. J R Stat Soc: Series A (Stat Soc). 2006, 169: 571-584.
Graham JW: Missing data analysis: making it work in the real world. Annu Rev Psychol. 2009, 60: 549-576. 10.1146/annurev.psych.58.110405.085530.
Enders CK: A primer on maximum likelihood algorithms available for use with missing data. Struct Equ Model. 2001, 8: 128-141. 10.1207/S15328007SEM0801_7.
Horton NJ, Kleinman KP: Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat. 2007, 61: 79-90. 10.1198/000313007X172556.
Baraldi AN, Enders CK: An introduction to modern missing data analyses. J Sch Psychol. 2010, 48: 5-37. 10.1016/j.jsp.2009.10.001.
Schafer J, Yucel R: PAN: Multiple imputation for multivariate panel data (Software). 1999, 86: 949-955.
Lewington S, Clarke R, Qizilbash N, Peto R, Collins R: Age-specific relevance of usual blood pressure to vascular mortality: a meta-analysis of individual data for one million adults in 61 prospective studies. Lancet. 2002, 360: 1903-1913.
Bond GE, Burr RL, McCurry SM, Rice MM, Borenstein AR, Larson EB: Alcohol and cognitive performance: a longitudinal study of older Japanese Americans. The Kame Project. Int Psychogeriatr. 2005, 17: 653-668. 10.1017/S1041610205001651.
Kivimaki M, Lawlor DA, Singh-Manoux A, Batty GD, Ferrie JE, Shipley MJ, Nabi H, Sabia S, Marmot MG, Jokela M: Common mental disorder and obesity: insight from four repeat measures over 19 years: prospective Whitehall II cohort study. BMJ. 2009, 339: b3765-10.1136/bmj.b3765.
McCormack VA, Dos Santos Silva I, De Stavola BL, Perry N, Vinnicombe S, Swerdlow AJ, Hardy R, Kuh D: Life-course body size and perimenopausal mammographic parenchymal patterns in the MRC 1946 British birth cohort. Br J Cancer. 2003, 89: 852-859. 10.1038/sj.bjc.6601207.
CAS PubMed PubMed Central Google Scholar
Sugihara Y, Sugisawa H, Shibata H, Harada K: Productive roles, gender, and depressive symptoms: evidence from a national longitudinal study of late-middle-aged Japanese. J Gerontol B Psychol Sci Soc Sci. 2008, 63: P227-P234. 10.1093/geronb/63.4.P227.
Wiles NJ, Haase AM, Gallacher J, Lawlor DA, Lewis G: Physical activity and common mental disorder: results from the Caerphilly study. Am J Epidemiol. 2007, 165: 946-954. 10.1093/aje/kwk070.
Fuhrer R, Dufouil C, Dartigues JF: Exploring sex differences in the relationship between depressive symptoms and dementia incidence: prospective results from the PAQUID Study. J Am Geriatr Soc. 2003, 51: 1055-1063. 10.1046/j.1532-5415.2003.51352.x.
Sogaard AJ, Meyer HE, Tonstad S, Haheim LL, Holme I: Weight cycling and risk of forearm fractures: a 28-year follow-up of men in the Oslo Study. Am J Epidemiol. 2008, 167: 1005-1013. 10.1093/aje/kwm384.
Burton A, Altman DG: Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines. Br J Cancer. 2004, 91: 4-8. 10.1038/sj.bjc.6601907.
Mackinnon A: The use and reporting of multiple imputation in medical research - a review. J Intern Med. 2010, 268: 586-593. 10.1111/j.1365-2796.2010.02274.x.
Agrawal A, Grant JD, Waldron M, Duncan AE, Scherrer JF, Lynskey MT, Madden PA, Bucholz KK, Heath AC: Risk for initiation of substance use as a function of age of onset of cigarette, alcohol and cannabis use: findings in a Midwestern female twin cohort. Prev Med. 2006, 43: 125-128. 10.1016/j.ypmed.2006.03.022.
Anstey KJ, Hofer SM, Luszcz MA: Cross-sectional and longitudinal patterns of dedifferentiation in late-life cognitive and sensory function: the effects of age, ability, attrition, and occasion of measurement. J Exp Psychol Gen. 2003, 132: 470-487.
Arifeen S, Black RE, Antelman G, Baqui A, Caulfield L, Becker S: Exclusive breastfeeding reduces acute respiratory infection and diarrhea deaths among infants in Dhaka slums. Pediatrics. 2001, 108: E67-10.1542/peds.108.4.e67.
Bada HS, Das A, Bauer CR, Shankaran S, Lester B, LaGasse L, Hammond J, Wright LL, Higgins R: Impact of prenatal cocaine exposure on child behavior problems through school age. Pediatrics. 2007, 119: e348-359. 10.1542/peds.2006-1404.
Beesdo K, Bittner A, Pine DS, Stein MB, Hofler M, Lieb R, Wittchen HU: Incidence of social anxiety disorder and the consistent risk for secondary depression in the first three decades of life. Arch Gen Psychiatry. 2007, 64: 903-912. 10.1001/archpsyc.64.8.903.
Berecki-Gisolf J, Begum N, Dobson AJ: Symptoms reported by women in midlife: menopausal transition or aging?. Menopause. 2009, 16: 1021-1029. 10.1097/gme.0b013e3181a8c49f.
Blazer DG, Sachs-Ericsson N, Hybels CF: Perception of unmet basic needs as a predictor of depressive symptoms among community-dwelling older adults. J Gerontol A Biol Sci Med Sci. 2007, 62: 191-195. 10.1093/gerona/62.2.191.
Bray JW, Zarkin GA, Ringwalt C, Qi J: The relationship between marijuana initiation and dropping out of high school. Health Econ. 2000, 9: 9-18. 10.1002/(SICI)1099-1050(200001)9:1<9::AID-HEC471>3.0.CO;2-Z.
Breslau N, Schultz LR, Johnson EO, Peterson EL, Davis GC: Smoking and the risk of suicidal behavior: a prospective study of a community sample. Arch Gen Psychiatry. 2005, 62: 328-334. 10.1001/archpsyc.62.3.328.
Brown JW, Liang J, Krause N, Akiyama H, Sugisawa H, Fukaya T: Transitions in living arrangements among elders in Japan: does health make a difference?. J Gerontol B Psychol Sci Soc Sci. 2002, 57: S209-220. 10.1093/geronb/57.4.S209.
Bruckl TM, Wittchen HU, Hofler M, Pfister H, Schneider S, Lieb R: Childhood separation anxiety and the risk of subsequent psychopathology: Results from a community study. Psychother Psychosom. 2007, 76: 47-56. 10.1159/000096364.
Cauley JA, Lui LY, Barnes D, Ensrud KE, Zmuda JM, Hillier TA, Hochberg MC, Schwartz AV, Yaffe K, Cummings SR, Newman AB: Successful skeletal aging: a marker of low fracture risk and longevity. The Study of Osteoporotic Fractures (SOF). J Bone Miner Res. 2009, 24: 134-143. 10.1359/jbmr.080813.
Celentano DD, Munoz A, Cohn S, Vlahov D: Dynamics of behavioral risk factors for HIV/AIDS: a 6-year prospective study of injection drug users. Drug Alcohol Depend. 2001, 61: 315-322. 10.1016/S0376-8716(00)00154-X.
Chao C, Jacobson LP, Tashkin D, Martinez-Maza O, Roth MD, Margolick JB, Chmiel JS, Holloway MN, Zhang ZF, Detels R: Recreational amphetamine use and risk of HIV-related non-Hodgkin lymphoma. Cancer Causes Control. 2009, 20: 509-516. 10.1007/s10552-008-9258-y.
Cheung YB, Khoo KS, Karlberg J, Machin D: Association between psychological symptoms in adults and growth in early life: longitudinal follow up study. BMJ. 2002, 325: 749-10.1136/bmj.325.7367.749.
Chien KL, Hsu HC, Sung FC, Su TC, Chen MF, Lee YT: Hyperuricemia as a risk factor on cardiovascular events in Taiwan: the Chin-Shan Community Cardiovascular Cohort Study. Atherosclerosis. 2005, 183: 147-155. 10.1016/j.atherosclerosis.2005.01.018.
Clays E, De Bacquer D, Leynen F, Kornitzer M, Kittel F, De Backer G: Job stress and depression symptoms in middle-aged workers–prospective results from the Belstress study. Scand J Work Environ Health. 2007, 33: 252-259. 10.5271/sjweh.1140.
Conron KJ, Beardslee W, Koenen KC, Buka SL, Gortmaker SL: A longitudinal study of maternal depression and child maltreatment in a national sample of families investigated by child protective services. Arch Pediatr Adolesc Med. 2009, 163: 922-930. 10.1001/archpediatrics.2009.176.
Cuddy TE, Tate RB: Sudden unexpected cardiac death as a function of time since the detection of electrocardiographic and clinical risk factors in apparently healthy men: the Manitoba Follow-Up Study, 1948 to 2004. Can J Cardiol. 2006, 22: 205-211. 10.1016/S0828-282X(06)70897-2.
Daniels MC, Adair LS: Growth in young Filipino children predicts schooling trajectories through high school. J Nutr. 2004, 134: 1439-1446.
de Mutsert R, Grootendorst DC, Boeschoten EW, Brandts H, van Manen JG, Krediet RT, Dekker FW: Subjective global assessment of nutritional status is strongly associated with mortality in chronic dialysis patients. Am J Clin Nutr. 2009, 89: 787-793. 10.3945/ajcn.2008.26970.
De Stavola BL, Meade TW: Long-term effects of hemostatic variables on fatal coronary heart disease: 30-year results from the first prospective Northwick Park Heart Study (NPHS-I). J Thromb Haemost. 2007, 5: 461-471. 10.1111/j.1538-7836.2007.02330.x.
Di Nisio M, Barbui T, Di Gennaro L, Borrelli G, Finazzi G, Landolfi R, Leone G, Marfisi R, Porreca E, Ruggeri M, et al: The haematocrit and platelet target in polycythemia vera. Br J Haematol. 2007, 136: 249-259. 10.1111/j.1365-2141.2006.06430.x.
Engberg J, Morral AR: Reducing substance use improves adolescents’ school attendance. Addiction. 2006, 101: 1741-1751. 10.1111/j.1360-0443.2006.01544.x.
Fergusson DM, Boden JM, Horwood LJ: The developmental antecedents of illicit drug use: evidence from a 25-year longitudinal study. Drug Alcohol Depend. 2008, 96: 165-177. 10.1016/j.drugalcdep.2008.03.003.
Fung TT, Malik V, Rexrode KM, Manson JE, Willett WC, Hu FB: Sweetened beverage consumption and risk of coronary heart disease in women. Am J Clin Nutr. 2009, 89: 1037-1042. 10.3945/ajcn.2008.27140.
Gallo WT, Bradley EH, Dubin JA, Jones RN, Falba TA, Teng HM, Kasl SV: The persistence of depressive symptoms in older workers who experience involuntary job loss: results from the health and retirement survey. J Gerontol B Psychol Sci Soc Sci. 2006, 61: S221-228. 10.1093/geronb/61.4.S221.
Gauderman WJ, Avol E, Gilliland F, Vora H, Thomas D, Berhane K, McConnell R, Kuenzli N, Lurmann F, Rappaport E, et al: The effect of air pollution on lung development from 10 to 18 years of age. N Engl J Med. 2004, 351: 1057-1067. 10.1056/NEJMoa040610.
Glotzer TV, Daoud EG, Wyse DG, Singer DE, Ezekowitz MD, Hilker C, Miller C, Qi D, Ziegler PD: The relationship between daily atrial tachyarrhythmia burden from implantable device diagnostics and stroke risk: the TRENDS study. Circ Arrhythm Electrophysiol. 2009, 2: 474-480. 10.1161/CIRCEP.109.849638.
Gunderson EP, Jacobs DR, Chiang V, Lewis CE, Tsai A, Quesenberry CP, Sidney S: Childbearing is associated with higher incidence of the metabolic syndrome among women of reproductive age controlling for measurements before pregnancy: the CARDIA study. Am J Obstet Gynecol. 2009, 201 (177): e171-179.
Haag MD, Bos MJ, Hofman A, Koudstaal PJ, Breteler MM, Stricker BH: Cyclooxygenase selectivity of nonsteroidal anti-inflammatory drugs and risk of stroke. Arch Intern Med. 2008, 168: 1219-1224. 10.1001/archinte.168.11.1219.
Hart CL, Hole DJ, Davey Smith G: Are two really better than one? Empirical examination of repeat blood pressure measurements and stroke risk in the Renfrew/Paisley and collaborative studies. Stroke. 2001, 32: 2697-2699. 10.1161/hs1101.098637.
Hogg RS, Bangsberg DR, Lima VD, Alexander C, Bonner S, Yip B, Wood E, Dong WW, Montaner JS, Harrigan PR: Emergence of drug resistance is associated with an increased risk of death among patients first starting HAART. PLoS Med. 2006, 3: e356-10.1371/journal.pmed.0030356.
Jacobs EJ, Thun MJ, Connell CJ, Rodriguez C, Henley SJ, Feigelson HS, Patel AV, Flanders WD, Calle EE: Aspirin and other nonsteroidal anti-inflammatory drugs and breast cancer incidence in a large U.S. cohort. Cancer Epidemiol Biomarkers Prev. 2005, 14: 261-264.
Jamrozik E, Knuiman MW, James A, Divitini M, Musk AW: Risk factors for adult-onset asthma: a 14-year longitudinal study. Respirology. 2009, 14: 814-821. 10.1111/j.1440-1843.2009.01562.x.
Jimenez M, Krall EA, Garcia RI, Vokonas PS, Dietrich T: Periodontitis and incidence of cerebrovascular disease in men. Ann Neurol. 2009, 66: 505-512. 10.1002/ana.21742.
Juhaeri , Stevens J, Chambless LE, Nieto FJ, Jones D, Schreiner P, Arnett D, Cai J: Associations of weight loss and changes in fat distribution with the remission of hypertension in a bi-ethnic cohort: the Atherosclerosis Risk in Communities Study. Prev Med. 2003, 36: 330-339. 10.1016/S0091-7435(02)00063-4.
Karlamangla A, Zhou K, Reuben D, Greendale G, Moore A: Longitudinal trajectories of heavy drinking in adults in the United States of America. Addiction. 2006, 101: 91-99. 10.1111/j.1360-0443.2005.01299.x.
Keller MC, Neale MC, Kendler KS: Association of different adverse life events with distinct patterns of depressive symptoms. Am J Psychiatry. 2007, 164: 1521-1529. 10.1176/appi.ajp.2007.06091564. quiz 1622
Kersting RC: Impact of social support, diversity, and poverty on nursing home utilization in a nationally representative sample of older Americans. Soc Work Health Care. 2001, 33: 67-87. 10.1300/J010v33n02_05.
Lacson E, Wang W, Lazarus JM, Hakim RM: Change in vascular access and mortality in maintenance hemodialysis patients. Am J Kidney Dis. 2009, 54: 912-921. 10.1053/j.ajkd.2009.07.008.
Lamarca R, Ferrer M, Andersen PK, Liestol K, Keiding N, Alonso J: A changing relationship between disability and survival in the elderly population: differences by age. J Clin Epidemiol. 2003, 56: 1192-1201. 10.1016/S0895-4356(03)00201-4.
Lawson DW, Mace R: Sibling configuration and childhood growth in contemporary British families. Int J Epidemiol. 2008, 37: 1408-1421. 10.1093/ije/dyn116.
Lee DH, Ha MH, Kam S, Chun B, Lee J, Song K, Boo Y, Steffen L, Jacobs DR: A strong secular trend in serum gamma-glutamyltransferase from 1996 to 2003 among South Korean men. Am J Epidemiol. 2006, 163: 57-65.
Lee DS, Evans JC, Robins SJ, Wilson PW, Albano I, Fox CS, Wang TJ, Benjamin EJ, D’Agostino RB, Vasan RS: Gamma glutamyl transferase and metabolic syndrome, cardiovascular disease, and mortality risk: the Framingham Heart Study. Arterioscler Thromb Vasc Biol. 2007, 27: 127-133. 10.1161/01.ATV.0000251993.20372.40.
Li G, Higdon R, Kukull WA, Peskind E, Van Valen Moore K, Tsuang D, van Belle G, McCormick W, Bowen JD, Teri L, et al: Statin therapy and risk of dementia in the elderly: a community-based prospective cohort study. Neurology. 2004, 63: 1624-1628. 10.1212/01.WNL.0000142963.90204.58.
Li LW, Conwell Y: Effects of changes in depressive symptoms and cognitive functioning on physical disability in home care elders. J Gerontol A Biol Sci Med Sci. 2009, 64: 230-236.
Limburg PJ, Anderson KE, Johnson TW, Jacobs DR, Lazovich D, Hong CP, Nicodemus KK, Folsom AR: Diabetes mellitus and subsite-specific colorectal cancer risks in the Iowa Women’s Health Study. Cancer Epidemiol Biomarkers Prev. 2005, 14: 133-137.
Luchenski S, Quesnel-Vallee A, Lynch J: Differences between women’s and men’s socioeconomic inequalities in health: longitudinal analysis of the Canadian population, 1994–2003. J Epidemiol Community Health. 2008, 62: 1036-1044. 10.1136/jech.2007.068908.
Melamed ML, Eustace JA, Plantinga L, Jaar BG, Fink NE, Coresh J, Klag MJ, Powe NR: Changes in serum calcium, phosphate, and PTH and the risk of death in incident dialysis patients: a longitudinal study. Kidney Int. 2006, 70: 351-357. 10.1038/sj.ki.5001542.
Menotti A, Lanti M, Kromhout D, Blackburn H, Jacobs D, Nissinen A, Dontas A, Kafatos A, Nedeljkovic S, Adachi H: Homogeneity in the relationship of serum cholesterol to coronary deaths across different cultures: 40-year follow-up of the Seven Countries Study. Eur J Cardiovasc Prev Rehabil. 2008, 15: 719-725. 10.1097/HJR.0b013e328315789c.
Michaelsson K, Olofsson H, Jensevik K, Larsson S, Mallmin H, Berglund L, Vessby B, Melhus H: Leisure physical activity and the risk of fracture in men. PLoS Med. 2007, 4: e199-10.1371/journal.pmed.0040199.
Michaud DS, Liu Y, Meyer M, Giovannucci E, Joshipura K: Periodontal disease, tooth loss, and cancer risk in male health professionals: a prospective cohort study. Lancet Oncol. 2008, 9: 550-558. 10.1016/S1470-2045(08)70106-2.
Mirzaei M, Taylor R, Morrell S, Leeder SR: Predictors of blood pressure in a cohort of school-aged children. Eur J Cardiovasc Prev Rehabil. 2007, 14: 624-629. 10.1097/HJR.0b013e32828621c6.
Mishra GD, McNaughton SA, Bramwell GD, Wadsworth ME: Longitudinal changes in dietary patterns during adult life. Br J Nutr. 2006, 96: 735-744.
Monda KL, Adair LS, Zhai F, Popkin BM: Longitudinal relationships between occupational and domestic physical activity patterns and body weight in China. Eur J Clin Nutr. 2008, 62: 1318-1325. 10.1038/sj.ejcn.1602849.
Moss SE, Klein R, Klein BE: Long-term incidence of dry eye in an older population. Optom Vis Sci. 2008, 85: 668-674. 10.1097/OPX.0b013e318181a947.
Nabi H, Consoli SM, Chastang JF, Chiron M, Lafont S, Lagarde E: Type A behavior pattern, risky driving behaviors, and serious road traffic accidents: a prospective study of the GAZEL cohort. Am J Epidemiol. 2005, 161: 864-870. 10.1093/aje/kwi110.
Nakano T, Tatemichi M, Miura Y, Sugita M, Kitahara K: Long-term physiologic changes of intraocular pressure: a 10-year longitudinal analysis in young and middle-aged Japanese men. Ophthalmology. 2005, 112: 609-616. 10.1016/j.ophtha.2004.10.046.
Nowicki MJ, Vigen C, Mack WJ, Seaberg E, Landay A, Anastos K, Young M, Minkoff H, Greenblatt R, Levine AM: Association of cells with natural killer (NK) and NKT immunophenotype with incident cancers in HIV-infected women. AIDS Res Hum Retrovir. 2008, 24: 163-168.
Ormel J, Oldehinkel AJ, Vollebergh W: Vulnerability before, during, and after a major depressive episode: a 3-wave population-based study. Arch Gen Psychiatry. 2004, 61: 990-996. 10.1001/archpsyc.61.10.990.
Rabbitt P, Lunn M, Wong D, Cobain M: Sudden declines in intelligence in old age predict death and dropout from longitudinal studies. J Gerontol B Psychol Sci Soc Sci. 2008, 63: P205-P211. 10.1093/geronb/63.4.P205.
Randolph JF, Sowers M, Bondarenko I, Gold EB, Greendale GA, Bromberger JT, Brockwell SE, Matthews KA: The relationship of longitudinal change in reproductive hormones and vasomotor symptoms during the menopausal transition. J Clin Endocrinol Metab. 2005, 90: 6106-6112. 10.1210/jc.2005-1374.
Rousseau MC, Abrahamowicz M, Villa LL, Costa MC, Rohan TE, Franco EL: Predictors of cervical coinfection with multiple human papillomavirus types. Cancer Epidemiol Biomarkers Prev. 2003, 12: 1029-1037.
Ryu S, Chang Y, Woo HY, Lee KB, Kim SG, Kim DI, Kim WS, Suh BS, Jeong C, Yoon K: Time-dependent association between metabolic syndrome and risk of CKD in Korean men without hypertension or diabetes. Am J Kidney Dis. 2009, 53: 59-69. 10.1053/j.ajkd.2008.07.027.
Seid M, Varni JW, Cummings L, Schonlau M: The impact of realized access to care on health-related quality of life: a two-year prospective cohort study of children in the California State Children’s Health Insurance Program. J Pediatr. 2006, 149: 354-361. 10.1016/j.jpeds.2006.04.024.
Silfverdal SA, Ehlin A, Montgomery SM: Protection against clinical pertussis induced by whole-cell pertussis vaccination is related to primo-immunisation intervals. Vaccine. 2007, 25: 7510-7515. 10.1016/j.vaccine.2007.08.046.
Spence SH, Najman JM, Bor W, O’Callaghan MJ, Williams GM: Maternal anxiety and depression, poverty and marital relationship factors during early childhood as predictors of anxiety and depressive symptoms in adolescence. J Child Psychol Psychiatry. 2002, 43: 457-469. 10.1111/1469-7610.00037.
Stewart R, Xue QL, Masaki K, Petrovitch H, Ross GW, White LR, Launer LJ: Change in blood pressure and incident dementia: a 32-year prospective study. Hypertension. 2009, 54: 233-240. 10.1161/HYPERTENSIONAHA.109.128744.
Strasak AM, Kelleher CC, Klenk J, Brant LJ, Ruttmann E, Rapp K, Concin H, Diem G, Pfeiffer KP, Ulmer H: Longitudinal change in serum gamma-glutamyltransferase and cardiovascular disease mortality: a prospective population-based study in 76,113 Austrian adults. Arterioscler Thromb Vasc Biol. 2008, 28: 1857-1865. 10.1161/ATVBAHA.108.170597.
Strawbridge WJ, Cohen RD, Shema SJ: Comparative strength of association between religious attendance and survival. Int J Psychiatry Med. 2000, 30: 299-308.
Sung M, Erkanli A, Angold A, Costello EJ: Effects of age at first substance use and psychiatric comorbidity on the development of substance use disorders. Drug Alcohol Depend. 2004, 75: 287-299. 10.1016/j.drugalcdep.2004.03.013.
Tehard B, Lahmann PH, Riboli E, Clavel-Chapelon F: Anthropometry, breast cancer and menopausal status: use of repeated measurements over 10 years of follow-up-results of the French E3N women’s cohort study. Int J Cancer. 2004, 111: 264-269. 10.1002/ijc.20213.
Vikan T, Johnsen SH, Schirmer H, Njolstad I, Svartberg J: Endogenous testosterone and the prospective association with carotid atherosclerosis in men: the Tromso study. Eur J Epidemiol. 2009, 24: 289-295. 10.1007/s10654-009-9322-2.
Wang NY, Young JH, Meoni LA, Ford DE, Erlinger TP, Klag MJ: Blood pressure change and risk of hypertension associated with parental hypertension: the Johns Hopkins Precursors Study. Arch Intern Med. 2008, 168: 643-648. 10.1001/archinte.168.6.643.
The pre-publication history for this paper can be accessed here: http://www.biomedcentral.com/1471-2288/12/96/prepub
Download references
This work was supported by the National Health & Medical Research Council Grants Number 60740 and Number 251533.
Authors and affiliations.
Cancer Epidemiology Centre, Cancer Council Victoria, Carlton, VIC, Australia
Amalia Karahalios, Laura Baglietto, Dallas R English & Julie A Simpson
Centre for Molecular, Environmental, Genetic, and Analytic Epidemiology, School of Population Health, The University of Melbourne, Parkville, VIC, Australia
Amalia Karahalios, Laura Baglietto, John B Carlin, Dallas R English & Julie A Simpson
Clinical Epidemiology and Biostatistics Unit, Murdoch Children’s Research Institute, Parkville, VIC, Australia
John B Carlin
You can also search for this author in PubMed Google Scholar
Correspondence to Julie A Simpson .
Competing interests.
The authors declare that they have no competing interests.
AK drafted the protocol for the review, reviewed the articles and drafted the manuscript. JAS conceived of the review, resolved any discrepancies encountered by AK when reviewing the articles and helped with drafting the manuscript. LB, JBC and DRE provided feedback on the design of the protocol and drafts of the manuscript. All authors read and approved the final manuscript.
12874_2011_746_moesm1_esm.doc.
Additional file 1: Table S1. Detailed characteristics of the studies included in the systematic review. Details of studies included in the systematic review and the corresponding reference list [ 29 – 35 , 38 – 112 ]. (DOC 212 KB)
Below are the links to the authors’ original submitted files for images.
Rights and permissions.
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Reprints and permissions
Cite this article.
Karahalios, A., Baglietto, L., Carlin, J.B. et al. A review of the reporting and handling of missing data in cohort studies with repeated assessment of exposure measures. BMC Med Res Methodol 12 , 96 (2012). https://doi.org/10.1186/1471-2288-12-96
Download citation
Received : 12 December 2011
Accepted : 11 July 2012
Published : 11 July 2012
DOI : https://doi.org/10.1186/1471-2288-12-96
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
ISSN: 1471-2288
IMAGES
COMMENTS
Missing Data in One Shot Response Based Power System Control. Major Professor: Steven Rovnyak. The thesis extends the work done in [1] [2] by Rovnyak, et al. where the authors ... in this thesis assuming di erent missing data scenarios. In addition to CC1, the chapter also describes another set of control combination (CC2) whose performance ...
Missing data is a recurrent issue in many fields of medical research, particularly in questionnaires. The aim of this article is to describe and compare six conceptually different multiple imputation methods, alongside the commonly used complete case analysis, and to explore whether the choice of methodology for handling missing data might impact clinical conclusions drawn from a regression ...
Missing data imputation. Missing values complicate the analysis of large-scale observational datasets such as electronic health records. Our work has developed several foundational new models for missing value imputation, including low rank models and Gaussian copula models. We have also demonstrated improved methods to handle missing-not-at ...
Missing Data Problems in Machine Learning Benjamin M. Marlin Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2008 Learning, inference, and prediction in the presence of missing data are pervasive problems in machine learning and statistical data analysis. This thesis focuses on the problems of collab-
The missing rate is one of the crucial metrics to measure the number of missing values in a dataset. The pattern of missing data and the proportion of missing data, particularly when the percentage of missing data surpasses 40%, have a considerable negative impact on the accuracy of prediction (or imputation), according to Song et al. [26].
Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest. The problem of missing data is relatively common in almost all research and can have a significant effect on the conclusions that can be drawn from the data [].Accordingly, some studies have focused on handling the missing data, problems caused by missing data, and ...
Missing data, or missing values, occur when you don't have data stored for certain variables or participants. Data can go missing due to incomplete data entry, equipment malfunctions, lost files, and many other reasons. In any dataset, there are usually some missing data. In quantitative research, missing values appear as blank cells in your ...
Missing data and outliers can significantly impact the results of your analysis. For missing data, several strategies can be employed, such as deletion (removing incomplete cases), mean imputation (replacing missing values with the mean), or more advanced techniques like multiple imputation. The choice of method depends on the proportion and ...
Missing data are very common in research studies, but ignoring these cases can lead to invalid and misleading conclusions being drawn. This course provides guidance on how to deal with missing values and the best ways of analysing a dataset that is incomplete. The course covers the following topics: Reasons for missing data. Types of missing data.
The impact of missing data on quantitative research can be serious, leading to biased estimates of parameters, loss of information, decreased statistical power, increased standard errors, and weakened generalizability of findings. In this paper, we discussed and demonstrated three principled missing data methods: multiple imputation, full information maximum likelihood, and expectation ...
Missing data are a common challenge encountered in research which can compromise the results of statistical inference when not handled appropriately. This paper aims to introduce basic concepts of missing data to a non-statistical audience, list and compare some of the most popular approaches for handling missing data in practice and provide ...
Missing survey data occur for three reasons: (1) noncoverage—the observation fell outside of the sample, (2) total nonresponse—the would-be respondent failed to respond to the survey, and (3) item nonresponse—the respondent skipped a particular survey item (Brick and Kalton 1996).Although data missing as a result of these different causes present distinct challenges for the researcher ...
The starting point for dealing with the problem of missing data is to understand why it is a problem. One of the fundamental ideas in statistics is that the accuracy of an estimate is a function of two properties: precision and bias. Missing data is potentially disastrous for both of these properties.
1 Introduction. Missing data can significantly influence results and conclusions that can be drawn from data as incomplete data and inappropriate statistical methods for handling missing data can produce bias (Karahalios et al. Citation 2012).For each study, researchers should indicate the amount of missing data for each key variable and address how missing data was handled (Vandenbroucke et ...
This thesis develops models and methods for collaborative prediction with non-random missing data by combining standard models for complete data with models of the missing data process and describes several strategies for classification with missing features including the use of generative classifiers. Learning, inference, and prediction in the presence of missing data are pervasive problems ...
Missing data methods, maximum likelihood estimation (MLE) and multiple imputation (MI), for longitudinal questionnaire data were investigated via simulation. Predictive mean matching (PMM) was applied at both item and scale levels, logistic regression at item level and multivariate normal imputation at scale level.
research. Missing values can occur at any level in multilevel data, but guidance on multiple imputation in data with more than two levels is currently an open research question. This thesis implements and extends the Gelman and Hill approach for imputation of missing data at higher levels by including aggregate forms of individual-level mea-
Chapter 1 provides an introduction to the problem of missing data and how they may arise and a description of the Gateshead Millennium Study data, to which all the missing data methods will be applied. It concludes by giving the aims of this thesis. Chapter 2 provides an in depth review of various missing data approaches and indicates which ...
Ignoring the missing data mechanism The likelihood function ignoring the missing data mechanism is L ign(θ|y obs) ∝ f(y obs |θ) = Z f(y obs,y mis |θ)dy mis. (2) When is L∝ L ign so the missing data mechanism can be ignored for further analysis? This is true if: 1. The data are MAR; 2. The parameters ηgoverning the missingness are
In this thesis I present such an application applied to handling missing data. The motivation for creating a quantum algorithm for missing data has two parts: (1) The problem of missing data is large and extends many disciplines. If not handled correctly, it can lead to insufficient analysis of the data. It is an important problem to tackle.
Principled missing data methods for researchers. Missing data are a rule rather than an exception in quantitative research. Enders ( 2003) stated that a missing rate of 15% to 20% was common in educational and psychological studies.Peng et al. ( 2006) surveyed quantitative studies published from 1998 to 2004 in 11 education and psychology journals.. They found that 36% of studies had no ...
Retaining participants in cohort studies with multiple follow-up waves is difficult. Commonly, researchers are faced with the problem of missing data, which may introduce biased results as well as a loss of statistical power and precision. The STROBE guidelines von Elm et al. (Lancet, 370:1453-1457, 2007); Vandenbroucke et al. (PLoS Med, 4:e297, 2007) and the guidelines proposed by Sterne et ...
When running a simulation, the Message Board displays "Error 4401: missing upstream/downstream connection data for pump/valve elements" and "Error: could not create ...