Doubly Robust Imputation of Incomplete Binary Longitudinal Data

Estimation in binary longitudinal data by using generalized estimating equation (GEE) becomes complicated in the presence of missing data because standard GEEs are only valid under the restrictive missing completely at random assumption. Weighted GEE has therefore been proposed to allow the validity of GEE's under the weaker missing at random assumption. Multiple imputation offers an attractive alternative, by which the incomplete data are pre-processed, and afterwards the standard GEE can be applied to the imputed data. Nevertheless, the imputation methodology requires correct specification of the imputation model. Dual imputation method provides a new way to increase the robustness of imputations with respect to model misspecification. The method involves integrating the so-called doubly robust ideas into the imputation model. Focusing on incomplete binary longitudinal data, we combine DIM and GEE (DIM-GEE) and study the relative performance of the new method in a case study of obesity among children, as well as a simulation study. Journal of Biometrics & Biostatistics J o u rn al of Bio metrics & Bistatis t i c s


Introduction
In a longitudinal study, each subject or unit is measured at several time points, thereby allowing the direct study of change over time. In many biomedical applications, the longitudinal response is binary, or in general non-Gaussian, for instance, the presence or absence of illness in an intervention study for a period of seven weeks. The generalized linear mixed model is a widely used approach with binary longitudinal responses [1,2]. In many practical settings with moderate to large length, however, these models imply complex and hard to manipulate likelihoods, for example, in the presence of missing data. An alternative modeling approach is generalized estimating equations [3]. This method essentially allows confining attention to the mean structure provided that one is willing to adopt working' assumptions about the association structure.
When data are incomplete, GEE suffers from its frequentist nature and is only valid under the restrictive missing completely at random (MCAR) assumption, where the missingness is independent of both unobserved and observed data [4]. For this reason, Robins et al. [5] have developed a class of GEE methods, the so-called weighted GEE (WGEE), that allows for the weaker missing at random (MAR) assumption, where the missingness is independent of the unobserved data given the observed data [4,6]. WGEE methods use the inverse of the subject's probability of being observed as a weight contributed in the estimating equation to reduce possible bias in the regression parameter estimates.
More recently, WGEE methods have been extended to the socalled doubly robust (DR) estimating equations, where the weighting idea is integrated with the use of a predictive model for the missing data given the observed data. The DR methods provide consistent estimates of the parameters given correct specification of either the weights or the predictive model, but not necessarily both. Excellent reviews can be found in Bang and Robins [7] and Rotnitzky [8].
The idea of doubly protection (or doubly robustness) is advantageous because it provides the analyst two routes to valid inferences, rather than just one. Nevertheless, the DR methods can be unstable in practice when both models are misspecified [9], or they can be disastrous when the propensity scores (i.e., the probabilities of being observed) are close to zero [9,10]. Moreover, these methods lack generalization to intermittent missing data, where the subjects return to the study after skipping one or more visits.
A viable alternative approach is multiple imputation [11,12]. Standard MI requires MAR to hold, even though extensions exist. Missing values are imputed several times, and then the resulting completed data sets are analyzed using a standard method like GEE. Beunckens et al. [13], among others, combined MI and GEE such that the missing data are multiply imputed, and then inferences are obtained by GEE, and combined into a single summary using Rubin's pooling rules (MI-GEE). However, this method, like the other imputation approaches, needs correct specification of the imputation model.
Jolani et al. [14] combined DR ideas with MI and constructed an imputation model with a doubly protected property, the so-called dual imputation method (DIM). This method makes use of the weighting idea within the imputation model. More specifically, a function of the propensity scores (e.g., the inverse of the propensity score) is included into the imputation model with the aim of increasing robustness of imputations against misspecification of the imputation model. Also, DIM can handle the problem of intermittent as well as monotone (or dropout) incomplete longitudinal data.
Until now, DIM has only been tried for continuous data. In this paper, we extend the methodology to binary data. Our focus is thus on the combination of DIM and GEE (DIM-GEE) for incomplete longitudinal binary data when the pattern of missing data is general. This involves multiply imputing binary responses by means of DIM and then applying GEE to the completed data sets. DIM-GEE is a new imputation method that makes it possible to model incomplete longitudinal binary data under the MAR assumption. This paper is organized as follows. In GEE for binary longitudinal data, an overview of GEE for analyzing longitudinal binary data is given. MI is briefly outlined in multiple imputation. The new imputation method is presented in dual imputation based GEE. In case study, DIM-GEE is used to analyze a case study of obesity among children and to compare with MI-GEE. A simulation study comparing DIM-GEE with MI-GEE was conducted and results are presented in simulation study.

GEE for Binary Longitudinal Data
Suppose the random variable Y ij denotes a sequence of binary measurements at time j, j = 1,…,N for subject i, i = 1,…,n. The observed value y ij is a realization of the binary response variable Y ij , and we assume independence across subjects. The focus of this study is on the marginal models that describe the binary outcome vector, given a set of predictor variables. The association structure (correlation among the components) is captured by an assumed model. Let π ij denote the marginal probability of observing a 'success' for subject i at time j, i.e., be standardized deviation between the data and the model predictor for subject i at time j, and 1 2 be associations among responses. Following Bahadur [15], the model can be represented by where y i = (y i1 ,…,y iN ) is a vector of measurements for subject i, and The joint probability mass function is thus the product of individual mass functions and the correlation factor c(y i ). The later can be viewed as a model for overdispersion.
The use of full likelihood-based methods for the above marginal model can be unattractive due to prohibitive computational requirements. Therefore, alternative methods such as GEE have been proposed. GEE is very useful in marginal models because by adopting working assumptions about the association structure, one only needs correctly specifying the univariate marginal distributions.
For a binary response Y ij , suppose x ij is a p-dimensional vector of complete covariates. Assuming the logit link function, the mean structure of the binary model can be expressed as where β is the vector of model parameters. The classical GEE can thus take the form diagonal matrix with the marginal variances, and C i is the marginal correlation matrix for the repeated measures. The correlation matrix C i is typically expressed in terms of a vector of nuisance parameters that needs to be replaced by a consistent estimate, e.g., a momentbased estimator [3]. Given a correct specified marginal mean π i , it can be shown, under mild regularity conditions, the estimate of β is asymptotically normal with mean vector β and covariance matrix .
When the working correlation structure is misspecified there is no price to pay in terms of consistency of the asymptotic normality of β . However, this misspecification may result in loss of efficiency. Because GEE is not a likelihood based approach, it suffers from its frequentist nature in the presence of missing data. Therefore, GEE is only valid under MCAR.

Multiple Imputation
The idea of multiple imputation is to replace each missing value with a set of M plausible values drawn from the conditional distribution of the missing values given the observed data. M imputed data sets are then analyzed using standard methods. The final step is to combine the results into a single summary using Rubin's rule [11].
A popular approach to create imputed datasets is multiple imputation by chained equations [16][17][18]. The basic idea is to specify a set of imputation models, one model for each variable with missing values, and then impute data on a variable-by-variable basis. We briefly outline the MICE algorithm for the case of binary longitudinal responses. Suppose, for each subject i, the vector of measurements Y i =(Y i1 ,…,Y iN ) has missing values in an arbitrary pattern. We drop i for notation convenience. All missing values are initially filled in at random. The first incomplete measurement, say Y 1 , is regressed (here a logistic regression) on the other measurements Y 2 ,…,Y N and possibly covariates x restricted to subjects with observed Y 1 . Missing values in Y 1 are then imputed using the posterior predictive distribution of Y 1 given Y 2 ,…,Y N and x. The missing values in the second incomplete measurement, say Y 2 , are then imputed by measurements Y 1 , Y 3 ,…,Y N and x. The process is repeated for all other measurements with missing values in turn. The cycle is repeated for several times (say 10 or 20) to produce a single imputed dataset. The whole procedure is repeated M times from different seeds, thus producing M completed data sets. The resulting completed data sets are finally used to estimate β using standard methods.

Dual Imputation Based GEE
In this section we show how to impute the missing measurements in binary longitudinal data using DIM methodology, when the missingness mechanism is MAR. The key idea is to incorporate a function of the propensity scores into the imputation model [14]. The aim of including this function (the inverse of the propensity scores) into the model is to reduce the effect of a possible misspecified imputation model.
Suppose Y ij can be observed or missing. Let R ij denote a binary response indictor for subject i at time j; that is, R ij =1 if Y ij is observed and R ij =0 otherwise. For each subject i, r ij is a realisation of R ij . Suppose the probability of being observed (i.e., the propensity score) for subject i at time j follows the logistic model where w ij is a q-dimensional vector of covariates associated with the unknown parameters α j , typically including the other outcomes y is (s ≠ j) and covariates x ij . We further allow that the missingness can happen in an arbitrary pattern (e.g., an intermittent pattern).
We first consider an unrealistic but pedagogical case where all propensity scores τ ij = P(R ij = 1|w ij ,α j ) are assumed to be known. Then, inclusion of 1 ij τ − in the imputation model is a sufficient condition to obtain a DR estimator of β [19]. In what follows it is convenient to define v ij = (x ij ,y i(-j) )', where y i(-j) includes all outcome variables excluding y ij .
Therefore, for each incomplete variable Y ij , the dual imputation model fits the following model restricted to its observed part where γ j is a vector of parameters in the imputation model corresponding to v ij , and δ j is a regression coefficient for the new predictor 1 ij τ − . A random draw ( ) * * , j j γ δ is generated from its posterior distribution, and then the missing values of the j th incomplete variable are imputed using the drawn values of the parameters. After all incomplete variables are imputed in turn, and the cycle is repeated for an adequate number, a completed data set will be produced. Each completed data set then is analyzed using the conventional GEE, and the results are pooled by Rubin's rule into a single inference.
The propensity scores often are unknown so need to be estimated. Estimation, however, is not straightforward when the pattern of missing data is intermittent. Because estimation of the propensity scores in a particular time depends on the other time points that might be incomplete. For the continuous case, Jolani et al. [14] have developed an extension of MICE algorithm that successively estimates the propensity scores and imputes the missing values for each incomplete variable.
Here we outline the algorithm in detail. Initially, all missing values are filled in at random. Suppressing i form the notation, for each incomplete variable Y j , j=1,…,N, the propensity score model 2 is used to draw a random value of α j and to estimate the propensity score 1 j − τ based on the drawn value. The imputation model 3 then generates imputations for the missing part of Y j . Cycling through all the models, posterior draws of the parameters are made given current values of the other variables. More specifically, steps of the DIM are: 1. Impute initially missing data by taking a random draw from the observed data. (f) Impute impute the missing values in the j th incomplete variable using the drawn values in the previous step 3. Return to step 2 to repeat the algorithm a small number of times, say 10 or 20.
The algorithm is a possibly incompatible Gibbs sampler. Although there is no guarantee for the existence of the joint distribution from which the values are drawn, experience has shown that it often leads to valid statistical inferences in a variety of cases Van Buuren et al. [16]; Gelman and Raghunathan [20]; Van Buuren [21]; Lee and Carlin [22]; White et al. [23].

Case Study
The data used in this paper were obtained from the Muscatine Coronary Risk Factor study [24], a longitudinal survey of school-age children in Muscatine, Iowa. The aim of the study was to examine the development and persistence of risk factors for coronary disease in children. In total, 4856 children (boys and girls) were followed biennially from 1977 to 1981, resulting in 3 measurements per child. The outcome of interest was the status of obesity, coded as 0 (nonobese) or 1 (obese), which was obtained on the basis of a comparison of their weight to age-gender specific norms. One objective was whether the risk of obesity increases with age and whether patterns of change in obesity are the same for boys and girls.
Due to many reasons, the child's obesity status could not be measured on all scheduled time points. Fewer than 40% of the children provided complete data at all three measurements. The patterns of missingness were displayed in Table 1 along with their corresponding frequency and percent of missing data for boys and girls separately. We see that the occurrence of missingness is similar in both groups.
The rate of children classified as obese at each of the three measurement occasions is also depicted in Figure 1. These percentages were calculated based on the complete case analysis at each occasion for both boys and girls. The graph indicates that the rates of obesity were increased for boys over time. For girls, the rates of obesity were increased first, but declined thereafter. The graph also shows that the rates of obesity were higher for girls at all occasions.
The marginal probability of obesity is modeled as a logistic function with time, sex and their interaction as covariates: where Y ij =1 if the i th child at the j th occasion is classified as obese, and Y ij =0 otherwise; sex i =1 if the i th child is girl, and sex i =0 if the i th child is boy; time ij =j, j=1, 2, 3, represents time at each occasion; β=(β 0 , β 1 , β 2 , β 3 )' is a vector of parameters 'in which we are interested.
Missing values pose a problem in this study, and, unfortunately, an intermittent pattern of missingness makes the analysis of the data even more complicated. Standard GEE may produce biased results because   rate of obesity was. Nevertheless, the estimated time effect by standard GEE where complete cases were used only was more than twice as large as those of the imputation methods (0.097 versus 0.042 and 0.039) showing a possible overestimation of this effect in standard GEE.
An interaction between time and sex was not significant. This implies that although the rate of obesity was increased with age, this increment did not statistically differ among boys and girls.
Apart from standard GEE, the parameter estimates were very similar based on MI-GEE and DIM-GEE. However, standard errors in DIM-GEE were marginally lower than MI-GEE. Thus, while the substantive conclusion did not differ, imputation using DIM-GEE provided the most efficient estimates. To further investigate the relative merits of both imputation methods, we conducted a limited simulation study aiming at a comparison between DIM-GEE and MI-GEE in the next section.

Simulation Study
This section reports the results of a simulation study comparing DIM-GEE and MI-GEE. We consider a situation where the imputation model for both methods is correctly specified. For DIM-GEE, we also consider a correct propensity score model. Simulation studies comparing misspecified models (in a slightly different setting with continuous outcomes) were reported in Jolani et al. [14]. Since complete case analysis is known to be biased, we concentrated on the comparison between DIM-GEE and MI-GEE.
For the simulation study, we generated data by mimicking the obesity case study. A total of n = 4856 subjects was initially divided into two groups of equal size representing their sex. Then, the binary outcome at three time points was generated based on the Bahadur model formulation 1 with ij j j ρ = respectively. The latter defines an exchangeable correlation structure. The outcome Y ij represents the obesity status (1 coded as obese) for subject i, i=1,…,n, at time j, j = 1,2,3; time ij =j represents time at each occasion; and sex i = 1 if the subject is a girl and zero otherwise.
We assumed the missing data process is MAR and adopted the general methodology proposed by Van Buuren et al. [18] for creating intermittent missing data under MAR. Similar to the obesity data, we specified six missing data patterns (Table 1). Then, the missing data were created such that they formed an approximation of the missing data percentages presented in Table 1. We created missing values in each pattern conditional on the observed data. For instance, the missing data in pattern {OOM} were conditioned on Y 1 ,Y 2 . A full description of this procedure can be found in Van Buuren et al. [18].
The incomplete data sets afterwards were multiply imputed and analyzed by DIM-GEE and MI-GEE methods respectively. For MI-GEE, the imputation model included sex and obesity status at other time points as covariates. For DIM-GEE, we first obtained the propensity scores by fitting the logistic model 2, and then included the inverse of the propensities as an additional covariate into the imputation model. The number of imputations was set to 10 with 1,000 Monte Carlo simulations. All calculations were done in R 3.0.2 using MICE [25].
Several measures were computed to investigate the performance of both methods. First, we defined the relative bias (RB) it is very hard to verify an MCAR assumption. Moreover, complete case analysis is wasteful due to a large fraction of missing data. WGEE cannot also be per-formed because of an intermittent pattern of missingness. Performing an imputation strategy is therefore a reasonable solution to estimate the parameters of interest.
We applied our proposed method (DIM-GEE) as follows. First, a propensity score (i.e., the probability of being observed) was estimated from model 2 for every child at each occasion. Background variables sex and age (mid-point of age group) were considered as covariates in the propensity score model. Second, we included sex, time and their interaction into the imputation model plus the inverse of the propensity scores. The latter variable aims at correcting for possible biases in the imputation model. We created 20 multiply imputed data sets. Each imputed data set was then analyzed using standard GEE. The final results were pooled to obtain a single inference. Table 2 shows results based on three approaches: Standard GEE, MI-GEE, and DIM-GEE. Standard GEE uses the complete case only. MI-GEE multiply imputes the missing values using the standard MI procedure and then performs GEE for each imputed data set. For this the number of imputations was also 20.
In line with research questions, the effect of time was significant in all methods indicating that the risk of obesity was increased with age. This implies that the older the age of the children was, the higher the  Moreover, a 95% confidence interval width (CIW), as well as the coverage of a 95% confidence interval (COV) were computed. The results for the parameter estimates based on these methods are presented in Table 3.
The relative bias was negligible for both methods showing asymptotically unbiased parameter estimates. However, the RMSE based on DIM-GEE was marginally smaller than that of MI-GEE, pointing a greater efficiency of the estimators by the former method. In addition, the confidence interval width was always shorter for DIM-GEE, and the empirical coverage rates were very close to the nominal level. In contrast, the 95% coverage rates of MI-GEE were lower. In sum, although both methods were performed equally well in terms of bias, the newly developed method provided more efficient parameter estimates.

Concluding Remarks
We have presented a version of generalized estimating equations for in-complete binary longitudinal data under MAR. This extension is based on the principals of multiple imputation, inverse probability weighting and its doubly robustness counterpart, and GEE. Our particular attention was on the extension of dual imputation method [14] to incomplete binary measurements. The proposed method facilitates computational intricacy ofWGEEs in complex patterns of missing data, and is easy to implement in existing software.
In view of previous work on the comparison between WGEE and MI, Clayton et al. [26] and Beunckens et al. [13], among others, provided evidence on preference of MI over WGEE in longitudinal binary data. The simulation studies by Beunckens et al. [13] provided insight about the efficiency of MI based GEE, the so-called MI-GEE, over WGEE particularly in small samples. Nevertheless, misspecification of imputation model can-not be disregarded in practice, and biased results can be expected when the imputation model is incorrect [23,27].
For incomplete binary longitudinal data, the new imputation method (DIM-GEE) was particularly designed to increase the robustness of imputations. By adopting the doubly robust property into the imputation model, one might expect improvement under doubly protected imputation methods.
In this paper, we have compared versions of generalized estimating equations in a real life example with missing data, as well as a simulation study. The results revealed that DIM-GEE produced parameter estimates with smaller estimated variances than the other methods. Note: MI-GEE is standard multiple imputation based GEE, and DIM-GEE is dual imputation based GEE. RB% is the relative bias percent; RMSE is the root mean squared error; COV is the 95% confidence interval coverage; CIW is the 95% confidence interval width.