From: "Paul S. Collier" <p.collier@qub.ac.uk>
Subject: missing data items
Date: Mon, 11 Sep 2000 15:11:25 +0100

Is there a way in which I can code for missing data relating to a possible covariate? An example would be where I have recorded Serum K for the majority of my subjects and wish to investigate whether or not this influences a p'kinetic parameter. How do I code for the fact that the data item is missing in a few individuals? If I use "0" or "." I assume that NONMEM reads this as a zero value rather than as a missing value.

Paul

**********************************************
Dr P.S. Collier
School of Pharmacy
The Queen's University of Belfast
97 Lisburn Road
Belfast BT9 7BL
N. Ireland, U.K

Tel: +44 (0)28 90 272009
FAX: +44 (0)28 90 247794
Email: p.collier@qub.ac.uk
http://www.qub.ac.uk/pha/index.html
**********************************************


*****


From: LSheiner <lewis@c255.ucsf.edu>
Subject: Re: missing data items
Date: Mon, 11 Sep 2000 08:48:18 -0700

The problem is unfortunately much more complex than finding a code ... It involves the answer to the following question: What should the estimation scheme do with someone who has a missing value?

This is one version of a classical statistical issue ("missing data"), for which there are many proposed solutions. For example:

1. delete all cases with missing data
2. integrate the likelihood across the missing data to use
a marginal likelihood for those individuals with missing data
3. simply impute the value (e.g., set it to the population mean)
4. more sophisticatedly impute the value (from all other data values,
using a regression formula derived from the complete cases)
5. multiply impute the value as in 4.

Only numbers 2 and 5 are completely satisfactory, and even those require, in general, that the missingness mechanism be "ignorable."

None of these approaches are implemented by default in NONMEM, so that coding the missing value would accomplish nothing.

When you decide how you want to deal with the missing data, it will be possible to write your NONMEM control file to implement that decision, and when you do, you can, of course, choose any coding you want for the missing values since it will be your computer code that will recognize and deal with them.

LBS.

--
_/ _/ _/_/ _/_/_/ _/_/_/ Lewis B Sheiner, MD (lewis@c255.ucsf.edu)
_/ _/ _/ _/_ _/_/ Professor: Lab. Med., Bioph. Sci., Med.
_/ _/ _/ _/ _/ Box 0626, UCSF, SF, CA, 94143-0626
_/_/ _/_/ _/_/_/ _/ 415-476-1965 (v), 415-476-2796 (fax)


*****


From: Mats Karlsson <Mats.Karlsson@biof.uu.se>
Subject: Re: missing data items
Date: Mon, 11 Sep 2000 20:47:48 +0200

Dear Paul and Lewis,

In addition to Lewis 5 options I think one more can be added. We can create a model for the covariate. To do this we should have repeated measures of the covariate or knowledge of its inter or intraindividual variability. For the case there are systematic changes in the covariate with time, a model for these would also be required. If we think of PK/PD modelling, the drug concentration is only a covariate for the pharmacodynamic response, and we are seldom bothered by missing PK data when we performe a PK/PD modelling where concentrations and effects are simultaneously modelled. Similarly drug concentration is a covariate for metabolite concentration, plasma concentrations a covariate for urine output, etc. So we have extensive experience of handling missing covariate values when we have a model for the covariate. It is often not difficult to create a model for many of the covariates we apply. We either can set the intraindividual error to minimal (AGE, HT, WT) or we know the interindividual distribution characteristics well (SEX, GENO). The problem mainly lies in that there are so many covariates that doing this is a pain. Additional problems are that NONMEM haven't easily handled the combination of (continuous) PK or PD data with categorical (covariate -SEX,GENO) data.

Best regards,
Mats

--
Mats Karlsson, PhD
Professor of Biopharmaceutics and Pharmacokinetics
Div. of Biopharmaceutics and Pharmacokinetics
Dept of Pharmacy
Faculty of Pharmacy
Uppsala University
Box 580
SE-751 23 Uppsala
Sweden
phone +46 18 471 4105
fax +46 18 471 4003
mats.karlsson@biof.uu.se


*****


From: Nick Holford <n.holford@auckland.ac.nz>
Subject: Missing data values
Date: Tue, 12 Sep 2000 09:21:18 +1200

Paul,

There are 2 main kinds of approach available to you for dealing with missing covariates. The most commonly used is to replace the missing values with the median (or similar) of the non-missing values. You can do this in the data file or use a missing value code in the data file and use code in NM-TRAN to substitute the median eg

IF (K.EQ.-1) THEN
MYK=4.0
ELSE
MYK=K
ENDIF

The other more sophisticated method is to impute the missing K value. Do this by putting all the non-missing K values in the DV column and add an extra column DVID which distinguishes the original DV from the DV which is a K value eg. 1=Original DV, 2=K.

Then you can do this:

MYK=THETA(1) + ETA(1); THETA(1) is pop median value for K; ETA(1) is its variability ;use MYK as you wish as covariate
IF (DVID.EQ.1) THEN
Y=F+EPS(1) ; or whatever you want for your original DV
ELSE
Y=MYK+EPS(2) ; FIX EPS(2) to a small value e.g. 0.0001
ENDIF

The model for MYK can be as complex as you wish eg. if you think Na is a predictor of K you can put measured Na in a covariate in the MYK expression.

See : Karlsson M, Jonsson E, Wiltse C, Wade J. Assumption testing in population pharmacokinetic models: illustrated with an analysis of moxonidine data from congestive heart failure patients. J Pharmacokinet Biopharm 1998;26(2):207-46.
--
Nick Holford, Division of Pharmacology & Clinical Pharmacology
University of Auckland, Private Bag 92019, 85 Park Road, Auckland, NZ
email: n.holford@auckland.ac.nz tel:+64(9)373-7599x6730 fax:373-7556
http://www.phm.auckland.ac.nz/Staff/NHolford/nholford.htm


*****


From: LSheiner <lewis@c255.ucsf.edu>
Subject: Re: missing data items
Date: Mon, 11 Sep 2000 16:49:00 -0700

All,

Mats' suggsetion is quite close to my #2 (integrate out the missing data), which is easier to implement in NONMEM (modulo the difficulties Mats notes) than in more restricted environments. There, it requires a lot of math to figure out what the marginal likelihood is.

Taking a very simple example where the primary response is a linear function of a covariate, and they were not measured at the same time (as sometimes is the case in PK/PD), the code for Mats' suggestion might look like the followIng (the data item TYPE identifies the DV as being the primary response (TYPE=1) or the covariate (TYPE=2))

$DATA ID TIME DV TYPE
;.............DV = Y WHEN TYPE=1; DV = RX WHEN TYPE = 2
$PRED
; "MODEL" THE COVARIATE
XX = THETA(1) + ETA(1)
YY = THETA(2) + ETA(3)*XX
IF(TYPE.EQ.1) THEN
Y = XX + EPS(1)
ELSE
Y = YY + EPS(2)
ENDIF

Them above is equivalent to what Nick offered, and, contrary to his note, is not equivcalent to imputing the data. It is joint modeling.

Imputing means substituting somes prediction of the covariate for the covariate wherever it is missing, and then using the prediction as though it had been measured. The control file would then look just as it would if the covariate hadn't ever been missed.

Although popular, this method can produce bias (as can any missing data method if the missingness mechanism is not ignorable), and will always produce somewhat wrong standard errors of parameter estimates, as one is making up some data but not acknowledging it. There is a way to use this method that avoids the latter problem, called multiple imputation. It involves 2 differences. First, the expected value of the covariate (e.g, median of observed values, or mean, or a regression prediciton) is not filled-in, but rather that value *plus a random error* with variance equal to the residual error variance of the model used to predict the covariate. Second, one does the analysis K=5 (or so) times, each with a different set of imputed values for the missing data, and computes the final parameter estimates from the set of K estimates and standard errors from the K analyses. A reference is Rubin, D. B. (1996). Multiple imputation after 18+ years. J Amer Stat Assoc 91: 473-489.

The method I suggested as my #2 differs from the above slightly in that it attempts to conditionon the oberved values where they exist, and only integrate where necessary. For the same example as above, the code to do it that way, might look like this:

$DATA ID TIME RX DV TYPE
;.............RX IS THE VALUE X IF MEASURED OR 999 IF NOT
;................DV = Y WHEN TYPE=1; DV = RX WHEN TYPE = 2
;*** NOTE: AN RX ENTRY OF SOME KIND APPEARS ON ALL RECORDS, WHILE
;*** DV = RX APPEARS ONLY ON SOME RECORDS (ONE FOR EACH TIME X IS MEASURED)

$PRED
; INTEGRATE OUT THE MISSING COVARIATE
IF(RX.EQ.999.OR.TYPE.EQ.2) THEN
XX = THETA(1) + ETA(1)
ELSE
XX = RX
ENDIF
YY = THETA(2) + ETA(3)*XX
IF(TYPE.EQ.1) THEN
Y = YY + EPS(1)
ELSE
Y = XX + EPS(2)
ENDIF

LBS.
--
_/ _/ _/_/ _/_/_/ _/_/_/ Lewis B Sheiner, MD (lewis@c255.ucsf.edu)
_/ _/ _/ _/_ _/_/ Professor: Lab. Med., Bioph. Sci., Med.
_/ _/ _/ _/ _/ Box 0626, UCSF, SF, CA, 94143-0626
_/_/ _/_/ _/_/_/ _/ 415-476-1965 (v), 415-476-2796 (fax)


*****


From: "Piotrovskij, Vladimir [JanBe]" <VPIOTROV@janbe.jnj.com>
Subject: An approach for imputing missing independent variable (covariate)
Date: Wed, 20 Sep 2000 15:10:22 +0200

Dear Colleagues,
The recent discussion regarding missing covariate values was stimulating, and catalized my thinking on how to solve the problem without assuming any explicit model for a covariate. I did some simulation/fitting exercises, and the results you may find attached. Any comments are appreciated.

Best regards,
Vladimir
----------------------------------------------------------------------
Vladimir Piotrovsky, Ph.D.
Janssen Research Foundation
Clinical Pharmacokinetics (ext. 5463)
B-2340 Beerse
Belgium
Email: vpiotrov@janbe.jnj.com

An approach for imputation of missing independent variable.doc


*****


From: "Gibiansky, Leonid" <gibianskyl@globomax.com>
Subject: RE: An approach for imputing missing independent variable (covariate)
Date: Wed, 20 Sep 2000 09:41:19 -0400

Vladimir,

Essentially, you allow your covariate to "float" so that the imputed missing value would not "disturb" your model. My impression is that it is the same as use NEWCOV instead of COV where

NEWCOV = COV (if COV is not missing)
NEWCOV = THETA(10)+ETA(10)

$THETA
...
0 ; or any reasonable initial value and range

$OMEGA
....
HUGE FIXED; to allow any value that is convenient for the model

Is there any difference with your approach ? POSTHOC value for NEWCOV should be equal to the result of your iteration scheme. Alternatively, you may first model the covariate distribution independently (approximate it by the normal distribution, if possible, and find mean and variance), and then fix thata(10) and omega(10) at those values. In this case, you place some restrictions on the missing covariate value by using distribution of the not-missing covariate values.

Leonid


*****


From: LSheiner <lewis@c255.ucsf.edu>
Subject: Re: An approach for imputing missing independent variable (covariate)
Date: Wed, 20 Sep 2000 09:45:11 -0700

First, let me ask Vladimir why he says his method operates "without assuming any explicit model for a covariate" The inverted model for the DV is an explicit model for the covariate, is it not?

More importantly, however, Vladimir's approach has at least two problems: (i) it is non-convergent: each data imputation at step 4 generates a different data set, which will yield a different estimate at step 5. This will never stop. (ii) Even if it converges "well enough" to a "region", it will not yield correct standard errors.

To see (ii), imagine the (absurd) situation that all but two data points from one individual were missing: the algorithm would wind up filling in all missing data points from the line defined by the two actual observations (without any error) and would eventually report perfect precision for the estimate of the slope and intercept defined by the two observations. This is not to say that anyone would try such an analysis; it merely points out that the method fails as it approaches a limit, which should make one suspect that it will have problems, perhaps of lesser severity, away from that limit. The reason for the problem is that uncertainty in the (posthoc) parameter estimates is ignored (more on this below).

The more difficult issue, though, is how to compute standard errors? The standard errors from the last step of Vladimir's last iteration can't be right, as these are conditional on the imputed data, treating them as known, when in fact they are unknown.

A simpler method, which doesn't require an invertable function such as Valdimir's, and which is theoretically sound (i.e. gives unbiased estimates and correct standard errors) is multiple imputation.

This method requires the ability to draw samples of the missing data from their posterior distribution. Fitting a population model with the IDV as the DV and then proceeding to get post-hoc parameters and simulating as Vladimir does in his steps 3 and 4 is close, although, as noted, this procedure ignores parameter uncertainty. But again, perhaps doing so won't do too much harm to the eventual standard errors (this depends on the relative magnitude of posterior parameter uncertainty to residual error).

Multiple imputation is so simple it can be described easily:

1. Estimate a distribution from which the missing data can be drawn. This can be entirely empirical and should use all the observed data (DV as well as IDV in Valdimir's example).

2. For m = 1,5
Impute the missing data using the distribution in (1). Analyze the competed data as usual to estimate parameters Pm and covariance matrix of estimate Cm.
End loop

3. P-hat = average(Pm)

4. Covariance(P-hat) = Covariance(Pm) + average(Cm)

This area is one that has received a great deal of attention from some first rate statisticians and we REALLY should follow their lead, or present a very compelling reason not to do so ...

For anyone wishing to pursue these matters further, I strongly recommend starting with:

1. Rubin "Multiple Imputation for Non-response in Surveys", Wiley, NY, 1987.
2. Tanner "Tools for Statistical Inference" Springer-Verlag, NY, 1993.

Some more ecent references to mult. imput. are:
1: Barnard J, Meng XL. Applications of multiple imputation in medical studies: from AIDS to NHANES. Stat Methods Med Res. 1999 Mar;8(1):17-36.
2. Schafer JL. Multiple imputation: a primer. Stat Methods Med Res. 1999 Mar;8(1):3-15. Review.

--
_/ _/ _/_/ _/_/_/ _/_/_/ Lewis B Sheiner, MD (lewis@c255.ucsf.edu)
_/ _/ _/ _/_ _/_/ Professor: Lab. Med., Bioph. Sci., Med.
_/ _/ _/ _/ _/ Box 0626, UCSF, SF, CA, 94143-0626
_/_/ _/_/ _/_/_/ _/ 415-476-1965 (v), 415-476-2796 (fax)


*****


From: LSheiner <lewis@c255.ucsf.edu>
Subject: Re: An approach for imputing missing independent variable (covariate)
Date: Wed, 20 Sep 2000 10:00:19 -0700

"Gibiansky, Leonid" wrote:
>
> Vladimir,
>
> Essentially, you allow your covariate to "float" so that the imputed missing
> value would not "disturb" your model. My impression is that it is the same
> as use NEWCOV instead of COV where
>
> NEWCOV = COV (if COV is not missing)
> NEWCOV = THETA(10)+ETA(10)
>
> $THETA
> ...
> 0 ; or any reasonable initial value and range
>
> $OMEGA
> ....
> HUGE FIXED; to allow any value that is convenient for the model
>
> Is there any difference with your approach ?

Yes theren is, see my note of earlier today.

What Leonid has here is one of the methods I discussed in my first response on this issue: it effectively integrates the likelihood across the missing data, and is formally correct.

And again the caution: if the data are missing non-ignorably, bias can result (see: Little & Rubin, Statistical analysis with missing data, NY Wiley, 1987)

> POSTHOC value for NEWCOV should
> be equal to the result of your iteration scheme. Alternatively, you may
> first model the covariate distribution independently (approximate it by the
> normal distribution, if possible, and find mean and variance), and then fix
> thata(10) and omega(10) at those values. In this case, you place some
> restrictions on the missing covariate value by using distribution of the
> not-missing covariate values.

Strictly speaking, the model for the missing data should be (as my other note of today indicates) based on ALL the observed data, not just that of the non missing covariate. That is, it should use the information in DV as well. To see this, consider the case of modeling y based on the single covariate x. Imagine, unknown to the analyst, y = x exactly. If some x's are missing, and the model that missing x = (mean of observed x's) +/- (std dev of observed x's) is used to marginalize the likelihood, then the apparent correlatikon between x and y will not be perfect. Again, what's wrong in the limit is likely wrong away from the limit.

--
_/ _/ _/_/ _/_/_/ _/_/_/ Lewis B Sheiner, MD (lewis@c255.ucsf.edu)
_/ _/ _/ _/_ _/_/ Professor: Lab. Med., Bioph. Sci., Med.
_/ _/ _/ _/ _/ Box 0626, UCSF, SF, CA, 94143-0626
_/_/ _/_/ _/_/_/ _/ 415-476-1965 (v), 415-476-2796 (fax)


*****


From: "Piotrovskij, Vladimir [JanBe]" <VPIOTROV@janbe.jnj.com>
Subject: RE: An approach for imputing missing independent variable (covariate)
Date: Thu, 21 Sep 2000 09:55:51 +0200

Leonid,
If I understood your suggestion correctly it will perhaps work using the first-order conditional estimation, but most probably will not work with the FO method.

Best regards,
Vladimir


*****


From: "Piotrovskij, Vladimir [JanBe]" <VPIOTROV@janbe.jnj.com>
Subject: RE: An approach for imputing missing independent variable (covariate)
Date: Thu, 21 Sep 2000 14:34:03 +0200

>First, let me ask Vladimir why he says his method operates "without assuming any
>explicit model for a covariate" The inverted model for the DV is an explicit model
>for the covariate, is it not?

Sorry, my phrasing was indeed ambiguous. What I meant saying "explicit model" was a model like THETA(.) + ETA(..) where we explicitly assume normal distribution for a covariate.

>More importantly, however, Vladimir's approach has
>at least two problems: (i) it is non-convergent: each data imputation
>at step 4 generates a different data set, which will yield
>a different estimate at step 5. This will never stop. (ii) Even
>if it converges "well enough" to a "region",
>it will not yield correct standard errors.

I believe the algorithm will converge, however, I don't think I will have time to check this and also to assess the magnitude of the bias (unless I will do myself modeling of data with missing covariate values; currently I do not have such a problem). The data set remained essentially unchanged except missing IDV are substituted by estimates obtained at the previous iteration. I presume this will work nicely if the proportion of missing values is small (20 % as in my example, or less). I believe "correct standard errors" is a kind of unachievable ideal even if there are no missing predictors at all.

>To see (ii), imagine the (absurd) situation that all but two data
>points from one individual were missing: the algorithm would wind up filling in
>all missing data points from the line defined by the two actual observations
>(without any error) and would eventually
>report perfect precision for the estimate of the slope
>and intercept defined by the two observations.
>This is not to say that anyone would try such an analysis; it merely
>points out that the method fails as it approaches a limit, which
>should make one suspect that it will have problems,
>perhaps of lesser severity, away from that limit. The reason for
>the problem is that uncertainty in the (posthoc) parameter
>estimates is ignored (more on this below).

With this absurd situation no imputation can be made at all. Multiple imputation will probably fail as well.

>The more difficult issue, though,
>is how to compute standard errors? The standard errors
>from the last step of Vladimir's last iteration
>can't be right, as these are conditional on the imputed data,
>treating them as known, when in fact they are unknown.

Missing values are unknown by definition, and I am not sure multiple imputation may change this.

>A simpler method, which doesn't require an invertable
>function such as Valdimir's, and which is
>theoretically sound (i.e. gives unbiased estimates and
>correct standard errors) is multiple imputation.
>This method requires the ability to draw samples of the missing
>data from their posterior distribution.

This is what I wanted to avoid: sampling covariates from (unknown) distribution.

Best regards,
Vladimir


*****


From: LSheiner <lewis@c255.ucsf.edu>
Subject: Re: An approach for imputing missing independent variable (covariate)
Date: Thu, 21 Sep 2000 10:04:52 -0700

I may have misunderstood Vladimir's method, but if I did not, then the missing values are simulated with epsilon errors, and the method cannot formally converge. One could substitute the expected values (no noise), and then the algorithm would have the flavor of the EM algorithm, and should converge, but it is not EM, and I am not sure whether it is unbiased. In any event, it would still have the standard error problem.

Despite Vladimir's concerns, it is a fact that correctly done multiple imputation can deal with any amount of missing data, and gives unbiased estimates of standard errors because it explicitly measures and adds in the variability caused by not knowing the missing data. Recall that the mult. imput. algorithm computes the estimation variance (square of standard error) as the sum of (i) the covariance of the estimates across the different imputations and (ii) the average covariance from the estimations using the imputed data. The uncertainty due to the lack of the missing data is captured by (i); if one uses the standard error from the last step of the last iteration of Vladimir's method (which I suggested as a possible choice, not he; he made no suggestion regarding SE's), then one would have the too small value, (ii).

I again advise anyone facing a serious missing data problem to read the relevant and extensive statistical literature.

LBS.
--
_/ _/ _/_/ _/_/_/ _/_/_/ Lewis B Sheiner, MD (lewis@c255.ucsf.edu)
_/ _/ _/ _/_ _/_/ Professor: Lab. Med., Bioph. Sci., Med.
_/ _/ _/ _/ _/ Box 0626, UCSF, SF, CA, 94143-0626
_/_/ _/_/ _/_/_/ _/ 415-476-1965 (v), 415-476-2796 (fax)


*****


From: "Piotrovskij, Vladimir [JanBe]" <VPIOTROV@janbe.jnj.com>
Subject: RE: An approach for imputing missing independent variable (covariate)
Date: Fri, 22 Sep 2000 15:11:58 +0200

There is no doubts multiple imputation is superior. However, in real life population pharmacokineticists doing data analysis do not use it. They have a dilemma: either to exclude records with missing covariates (usually just a few) or to perform a reasonable imputation. Most often means or medians of available values are used that is clearly not the best solution as it introduces a bias into the final model parameter estimates. My approach was aimed to reduce that bias by using information contained in DV. However, the risk of a bias still remains because the final estimates are affected by imputed values which are indeed considered as true covariates.

What I would suggest as a palliative:
1. Develop a model using available covariates only (some information is lost, but no bias, and SEs are OK)
2. Generate estimates for missing covariates by inverting regression equations as I siggested, but DO NOT perform interations.
3. Do posthoc step only to obtain conditional estimates for individuals with missing covariates.

Vladimir