```
From:  "Gordi, Toufigh" Toufigh.Gordi@cvt.com
Subject: [NMusers] Simulation vs. actual data
Date: Tue, June 14, 2005 2:27 pm

Dear all,

I have a trivial question I hope I can get help with. I have data
from different doses of a compound and want to show a plot of
concentration mean (+/-SD) values vs. time and a fit of the model
curve that, hopefully, goes through the values. To get the model
fitting curve, I plan to simulate a typical subject per dose group
receiving doses used in the study. My question is what simulated
values to use for the curve. A population curve does not seem to
be a good choice: the observed data contains known and unknown
errors, whereas a population curve is the true value of the
population and thus the fit might not include all the observed
values. The simulated IPRED or DV for a single subject per dose
group is certainly not the curve either.

The only way I can think of is to simulate a large number of subjects
per dose group and generate a mean curve from the simulations. If
this is the correct way, I would use the simulated DV values to get
the mean values, since DV values include all the errors associated
with the real observations and should give the best curve of the
observed data.

Any comments?

Toufigh Gordi
_______________________________________________________

From: "Nick Holford" n.holford@auckland.ac.nz
Subject: Re: [NMusers] Simulation vs. actual data
Date:  Tue, June 14, 2005 2:48 pm

Toufigh,

I think you are describing a visual predictive check i.e. simulate from your final
model with say 1000 subjects, then calculate the average and 90% empirical
prediction interval from the resulting 1000 predictions at each time point.

You can plot the average and 90% PI over time and overlay the observed
concentrations. About 90% of your observations should lie within the 90% PI. The
average vs time can be considered the population behaviour while the 90% PI shows
the extent of variability including random effects in parameters and random effects
due to residual error.

If you think you have linear kinetics you can dose normalize the observed
concentrations to some standard dose and do the simulation with this standard dose.
This let's you use one plot to display the results and expected profile for all
doses you have studied.

Nick
_______________________________________________________

From: "Liping Zhang" ZHANG_LIPING@lilly.com
Subject: Re: [NMusers] Simulation vs. actual data
Date: Tue, June 14, 2005 2:57 pm

I totally support your thoughts. If you want, other than the mean, you can also
get the confidence band (or like you said +/-SD) around the simulated curve.

Depend on how precise and what the simulation will be used for, you can
elaborate the simulation":

you can simulate use the estimated typical structure parameter value, with
inter-subj variation and random error, so a large number of subjects within
one simulation will do. The simulation does not account the uncertainty in
thetas and the band reflects inter-subject variability and the size of random
error.

You can also simulate different populations. For each simulation, first sample
the parameters from their variance-covariance metrics, and then simulate using
the sampled subjects. You then do somekind of summary statistic on the simulated
results. It needs large number of simulations and large number of subjects
within each simulation. It is a labor-intensive process but gives you a more
realistic view of potential outcomes.

Best regards, Liping

Liping Zhang, PhD

Global PK/PD Modeling and Trial Simulation
ELi Lilly & Company
Work: 317-277-8687       Fax: 317-433-6661
DC 0734
Email: zhang_liping@lilly.com
_______________________________________________________

From: "Kowalski, Ken" Ken.Kowalski@pfizer.com
Subject: RE: [NMusers] Simulation vs. actual data
Date: Wed, June 15, 2005 4:49 pm

NMusers,

We seem to be using terms like confidence bands and prediction intervals
somewhat loosely and interchangeably.  To provide some clarity here are
some definitions I use for the different types of statistical intervals
that are often constructed for different purposes:

Tolerance Interval - If we simulate DVs from our model and use the
mean +/- some multiple of the SD (across subjects at a particular time
point) or use the percentile method to obtain a lower and upper bound
(of the individual responses at a particular time point), the resulting
interval is something akin to what is known in the statistical literature
as a "tolerance interval".  Here the interest is in characterizing the
interval that contains a certain percentage of the individual observations.
However, to be a truly valid tolerance interval, such an interval should
also take into account the uncertainty in the parameter estimates and a
confidence level is associated with the interval.  Tolerance intervals
are used to makes statements like "I'm 90% confident that the interval
(LTL, UTL) will contain 80% of the individual observations (i.e, if we
repeat the study an infinite number of times, 90% of the intervals should
contain 80% of the individual observations)."

Prediction Interval - If we simulate DVs from our model and calculate the
mean (across subjects) conditioning on some specific design (with a specified
number of subjects, n) and repeat this process for N simulated datasets where
we use a different set of population estimates based on the parameter
uncertainty for each of the N simulated datasets, then the grand mean (of
the N means) +/- some multiple of the SD (across the N means) or use the
corresponding percentile method to obtain a lower and upper bound, the resulting
interval is a prediction interval on the future mean response of n subjects.
A valid prediction interval takes into account the parameter uncertainty as
well as the sampling variation in Omega (sampling of subjects) and Sigma
(sampling of observations).    Prediction intervals are used to make statements
like "I'm 90% confident that the interval (LPL, UPL) contains the future mean
response of n subjects (i.e., if we repeat the study an infinite number of times,
90% of the intervals will contain the future mean response of n subjects)."

Confidence Interval - If we simulate DVs from our model in a similar fashion as
for the prediction interval but where we choose n to be infinitely large in
computing the mean across the n subjects then we are effectively averaging out
the sampling variation in Omega and Sigma and the resulting interval only reflects
the uncertainty in the parameter estimates of the model.  Confidence intervals are
used to make statements like "I'm 90% confident that the interval (LCL, UCL) contains
the true population mean response (i.e., if we repeat the study an infinite number
of times, 90% of the intervals will contain the true population mean response).

In general confidence intervals have the shortest width followed by prediction
intervals and then tolerance intervals.  However, prediction intervals can be
wider than tolerance intervals when n (number of future subjects) is small.
For example, when n=1 where we want to predict the value of a future single
observation the prediction interval is typically wider than a tolerance interval.
For more information on different types of statistical intervals, see

Hahn, "Statistical Intervals for a Normal Population, Part I.  Tables, Examples
and Applications", J. of Quality Technology, 1970; 2:115-125.

I agree with Liping that taking into account parameter uncertainty by assuming
that the parameters estimates come from a multivariate normal distribution using
the population estimates and corresponding covariance matrix of the estimates
can be labor-intensive.  However, this does not mean that it is computationally-intensive,
in fact quite the contrary.  One can generate an N=1000 sample of population parameters
(thetas, omegas, and sigmas) from the multivariate normal distribution in a matter of
minutes even for mean parameter vectors and covariance matrices of the dimensions typical
of a pop PK or pop PK/PD model.  The laborious aspect of the work comes from the fact
that we don't have automated utilities to do this work and so we have to do a lot of
custom coding (pre- and post-processing) to generate the sample parameter estimate
vectors (thetas, omegas, and sigmas), pass them in to NONMEM to perform the simulations,
and then post-process the simulated results to calculate the various intervals of interest.
However, the process can also be computationally-intensive if the distribution of the
parameter estimates does not follow a multivariate normal distribution.  In this setting
we may have to perform non-parametric bootstrapping (sample with replacement of subjects
from the observed dataset) to get the N=1000 sample of population parameters from the
empirical bootstrap distribution from fitting the model to each of 1000 bootstrap datasets.
Unless, I'm already performing nonparametric bootstrapping for other purposes, I typically
assume the multivariate normal distribution when taking into account parameter uncertainty
simply because it is computationally less intensive.  My philosophy is that it is better
to do something to take into account parameter uncertainty rather than to completely ignore it.

Ken
_______________________________________________________

From: "Nick Holford" n.holford@auckland.ac.nz
Subject: Re: [NMusers] Simulation vs. actual data
Date: Sat, June 25, 2005 2:53 am

Ken,

Thanks you for offering some definitions for various statistical intervals.

I offered the following to Toufigh to describe what I have been calling a prediction
interval.

"There are two ways to define distributions. The most common is to assume a theoretical
distribution e.g. normal. If you make this assumption you can then use the estimates
of the distribution parameters to predict intervals which include 90% of the
distribution e.g. 1.65*StDev. Note that you make a big assumption that the values are
normally distributed and in practice this is often not the case e.g. if your clearance
is log normally distributed then the distribution of predicted Css will not be
normally distributed. I prefer not to make the normal assumption.

An alternative is to use an empirical distribution based on data. By simulating 1000
observations at a particular time you have an empirical distribution (emp_dist) of the
predicted values. If you find the percentile(emp_dist,0.05) and
percentile(emp_dist,0.95) then you define the empirical 90% prediction interval. I use
the percentile() function in Excel to do this."

It seems that the method I describe comes closest to what you call a tolerance
interval i.e. generate an empirical distribution by simulating DVs from the model
using the design of the data used to estimate the model parameters. The method might
be called a degenerate tolerance interval (following the terminology of Yano, Beal &
Sheiner 2001) because it does not account for uncertainty in the parameters. Would
you agree?

I am not clear about what the distinction is between your tolerance and predictive
intervals. It seems that predictive intervals are obtained with "specific designs"
perhaps different from the original data. Is this the only difference between a
tolerance and a predictive interval ie. the design used to generate the simulations?

Confidence intervals seem to be the same as tolerance intervals but use a large
number of simulated DVs to get some kind of asymptotic behaviour. You don't say how
many replications are used to generate tolerance and predicitive intervals. Am I
correct in saying that confidence intervals and tolerance intervals differ only in
the number of replications used to generate the empirical distribution? Does the
method I describe to Toufigh correspond more to a confidence interval than a
tolerance interval because of the large number (1000) replications?

Yano Y, Beal SL, Sheiner LB. Evaluating pharmacokinetic/pharmacodynamic models using
the posterior predictive check. J Pharmacokinet Pharmacodyn 2001;28(2):171-92
--
Nick Holford, Dept Pharmacology & Clinical Pharmacology
University of Auckland, 85 Park Rd, Private Bag 92019, Auckland, New Zealand
email:n.holford@auckland.ac.nz tel:+64(9)373-7599x86730 fax:373-7556
http://www.health.auckland.ac.nz/pharmacology/staff/nholford/
_______________________________________________________

From:  "Kowalski, Ken" Ken.Kowalski@pfizer.com
Subject: RE: [NMusers] Simulation vs. actual data
Date: Tue, July 5, 2005 12:30 pm

Hi Nick,

Yes, the method you describe comes closest to what I describe as a tolerance
interval in that you are trying to characterize the variability
(distribution) of individual responses.  Whereas confidence intervals try to
characterize the variability (uncertainty) in the population mean response.
In the univariate normal case the distinction boils down to whether you
construct intervals of the form

Xbar +/- k*SD   or   Xbar +/- k*SE   (where SE = SD/sqrt(n))

where the former characterizes the variability in the individual
observations (SD) while the latter characterizes the variability or
uncertainty in the mean response (SE).  If Yano et al. define a "degenerate"
tolerance interval as one that does not attach any confidence statement to
the interval because parameter uncertainty is not taken into account I
suppose I can accept that definition.

Prediction intervals are more closely related to confidence intervals in
that we are often trying to characterize uncertainty in some kind of mean
response (not necessarily the population mean response) rather than the
distribution of individual responses.  The mean response in this case is
conditioned on some design or sample of a fixed number of future (predicted)
observations.  When this fixed number of predicted observations approaches
infinity then the mean we are characterizing is the population mean and the
prediction interval (for a mean of an infinite number of predicted
observations) collapses to a confidence interval on the population mean.
When the fixed number of observations is small, then the distribution of the
mean of n predicted responses will be considerably wider than the
distribution of the population mean prediction.  Moreover, when n=1 then the
prediction interval is on a single observation which can be considerably
wider than tolerance intervals as I noted in my previous message.

I hope this helps clarify the distinctions between these types of intervals.

Ken
_______________________________________________________

From: "Nick Holford" n.holford@auckland.ac.nz
Subject: Re: [NMusers] Simulation vs. actual data
Date:  Tue, July 12, 2005 1:50 pm

Ken,

Thanks for your further elaboration. Your boiled down example is perhaps somewhat
over simple for nmusers because I think we need to consider 3 kinds of random
effects when constructing intervals for model qualification and prediction.

1. Population Parameter Variability (PPV). The sum of between (BSV) and within (WSV)
subject variability in model parameters such as Emax and EC50.

2. Residual Unidentified Variability (RUV). The residual error describing the
difference between individual subject predictions and observations.

3. Parameter Uncertainty. This is often described by the parameter standard error
but might be more reliably described by an empirical distribution of parameter
estimates obtained by a non-parametric bootstrap.

Your example only includes one level of random effect (which could be either PPV or
RUV depending on the context) and derives intervals by the commonly used asymptotic
assumption.

The kind of parametric simulation procedure that nmusers might use has been
described by Yano et al. (2001) in their investigation of the posterior predictive
check (PPC). As in your own example this study was limited to only one level of
random effect (they used RUV) and examined what might be learned from including
parameter uncertainty.  They concluded "No clear adûantage for one or another method
of approximating the posterior distribution on model parameters is found." i.e.
including uncertainty in parameters did not improve the power of the PPC to aid in
evaluating model performance. Dropping the 'posterior' i.e. the uncertainty in the
parameter distribution and just sampling from what they call the 'degenerate
posterior' may be adequate. Yano et al discuss why this might be the typical case
for many NONMEM type analyses.

On the basis of this rather limited study one should be aware that in other
situations the inclusion of uncertainty could be quite important but the relevance
to typical NONMEM analyses is currently unclear. At least in some settings the
simple predictive check (SPC; Gobburu et al. 2000) which uses the final point
estimates of the parameters (the degenerate posterior distribution) for simulation
without including uncertainty can give useful diagnostic information.

In trying to use a consistent terminology for the various intervals used to describe
the time course of response I wonder if you would accept the following:

Confidence Interval: Describes the uncertainty in the mean response. It could be
constructed by a non-parametric bootstrap and using the resulting parameters for
each bootstrap run to predict the population response at each time point. The
distribution of these population responses obtained from say 1000 bootstrap runs can
be used to define the confidence interval. The confidence interval says nothing
about PPV or RUV but reflects only the estimation uncertainty in the population
parameters. I am not aware of any published use of this kind of interval applied to
NONMEM analyses but would be interested to hear of this application.

Prediction Interval: Decribes the variation in individual response which is
attributable to PPV and RUV. It may be obtained by a parametric bootstrap (e.g.
using \$SIM with NSUBPROBLEMS=1000 in NONMEM) based on the final parameter estimates
for the model that is being evaluated. A 90% interval constructed from the empirical
distribution of individual predicted responses (with residual error) at each time
should contain 90% of the observed responses at that time. This is the interval and
method that is used for the visual predictive check (VPC; Holford 2005). The
procedure is the same as the SPC and degenerate PPC. It has been frequently referred
to as a posterior predictive check (e.g. Duffull 2000).

Tolerance Interval: Describes the uncertainty in the prediction interval by
including uncertainty in the parameter estimates. This could be done using the same
procedure as the SPC but sampling from the covariance matrix of the estimate in
addition to the variance-covariance matrix for OMEGA and SIGMA. I am not aware of
anyone who has done this with NONMEM with both PPV and RUV but would be interested
if someone could report any such experience.

In their definition of PPC, Yano et al. did not include the generation of an interval:
"The PPC compares a statistic (T) computed on the observed data to the distribution
of that statistic under a candidate model fitted to the data to derive a p value,
which we denote by pPPC."
However, it is implicit in their methodology for calculating the probability of a
response (pPPC).

I would suggest that it might be better if the term PPC was restricted to the case
where parameter uncertainty is included in the simulation process because this
explicitly recognizes the role of the non-degenerate posterior distribution.

I think that an interval which describes the variability in individual responses
(prediction interval) is more commonly of interest than variability in the
population response (confidence interval). A tolerance interval has some theoretical
advantage over the prediction interval by being a bit more conservative (i.e. wider
intervals) but most of the merits of this kind of model qualification approach will
be seen in the computationally more convenient prediction interval. It is directly
applicable for evaluating the performance of a model to describe existing
observations and for illustrating to non-pharmacometricians what might be expected
in a typical patient population. The nomenclature could stand some improvement so
that we can use these terms more precisely.

Nick

Duffull SB, Chabaud S, Nony P, Laveille C, Girard P, Aarons L. A pharmacokinetic
simulation model for ivabradine in healthy volunteers. Eur J Pharm Sci
2000;10(4):285-94

Gobburu J, Holford N, Ko H, Peck C. Two-step model evaluation (tsme) for model
qualification. In: American Society for Clinical Pharmacology and Therapeutics.
Annual Meeting; 2000; Los Angeles, CA, USA; 2000. Abstract.

Holford NHG. The Visual Predictive Check – Superiority to Standard Diagnostic
(Rorschach) Plots PAGE 2005.
http://www.page-meeting.org/default.asp?id=26&keuze=abstract-view&goto=abstracts&orderby=author&abstract_id=738

Yano Y, Beal SL, Sheiner LB. Evaluating pharmacokinetic/pharmacodynamic models using
the posterior predictive check. J Pharmacokinet Pharmacodyn 2001;28(2):171-92
_______________________________________________________

From:  "Perez Ruixo, Juan Jose [PRDBE]" JPEREZRU@PRDBE.jnj.com
Subject: RE: [NMusers] Simulation vs. actual data
Date:  Tue, July 12, 2005 2:52 pm

Dears,

Interesting discussion.

For the tolerance interval, I think it would be more accurate to sample
(THETAs, OMEGAs and SIGMAs) from the non-parametric bootstrap replicates.
For each replicate, NSUBPROBLEMS=100 should be implemented. Then, the 90%
tolerance interval can be calculated from the empirical distribution of all
individual predicted responses (with residual error) at each time. That's
how I implemented it previously for a simulation exercise (see reference
below).

The tolerance interval constructed as described above should contain 90% of
the observed responses at each time. In my experience, the prediction
interval as Nick described could be too narrow to include the 90% of the
observed response at each time, specially if uncertainty is high (relative
to PPV and RUV).

Reference: Jolling K, Perez-Ruixo JJ, Hemeryck A, Vermeulen V, Greway T.
Mixed-effects Modelling of the Interspecies Pharmacokinetic Scaling of
Pegylated Human Erythropoietin. Eur J Pharm Sci 2005; 24: 465-475.

Regards,
Juanjo.
_______________________________________________________

From:  "Nick Holford" n.holford@auckland.ac.nz
Subject: Re: [NMusers] Simulation vs. actual data
Date:  Tue, July 12, 2005 3:59 pm

Juanjo,

Thanks for your comments and for the reference to your paper on pegylated EPO PK. In
this paper you describe the process for generating a tolerance interval as follows:

"In addition, the final model was used to simulate the pharmacokinetic profile of
PEG-EPO after i.v. and s.c. administration of a single dose of 1000 ug to 100 male
subjects...Uncertainty in fixed and random parameter estimates was considered during
the simulations by replicating the above-mentioned process 30 times using different
values of fixed and random parameters, randomly selected from the estimates of the
bootstrap
replicates."

If I understand you correctly this means simulating profiles in 3000 subjects with
each block of 100 subjects having a different set of fixed and random effect
parameters sampled from the non-parametric bootstrap. You then used the empirical
distribution of 3000 responses at each time point to construct the interval.

The experience you mention below agrees with the theoretical expectations I
mentioned for the tolerance interval compared with the prediction interval but I
cannot find anything in your paper to support your statement experimentally. Your
Figure 4 shows 90% tolerance intervals for extrapolations to humans from non-human
species. It has no observations for a visual predictive check. Did you try
generating tolerance and prediction intervals for the species you had data for? If
so, what fraction of observed values would lie within the tolerance interval you
generated and what fraction would lie within a prediction interval generated with a
single set of fixed and random effects parameters e.g. using the final model
parameter estimates? I am interested in getting a concrete example of just how much
different the tolerance and prediction intervals might be for a real NONMEM
analysis.

On another topic - I note your final model used an empirical allometric model using
a combination of total weight and brain weights. The brain weights were in fact just
total weight multiplied by an empirical adjustment factor and not truly individual
brain weights. This makes it impossible to assert that brain weight itself is an
independent covariate for the prediction of clearance. As there is no plausible
mechanistic reason to believe that the brain is an important clearance organ for EPO
I would think the apparent benefit of adding 'brain weight' to the model is an
illustration of selection bias (Ribbing J, Jonsson EN. Power, Selection Bias and
Predictive Performance of the Population Pharmacokinetic Covariate Model. Journal of
Pharmacokinetics and Pharmacodynamics 2004;31(2):109-134)

You did not report results for a simple allometric model based on body weight alone
with the theoretical exponents of 3/4 for CL and Q and 1 for V1 and V2. xYou report
a reference model with estimated allometric exponents. Did you try using a simple
allometric model and test if the CI for the exponents were different from the
theoretical values?

Nick
_______________________________________________________

From: "Perez Ruixo, Juan Jose [PRDBE]" JPEREZRU@PRDBE.jnj.com
Subject:  RE: [NMusers] Simulation vs. actual data
Date: Wed, July 13, 2005 7:40 am

Nick,

Your interpretation of the process described to generate a tolerance
interval is correct. In theory, I think it's better approach to sample from
a non-parametric bootstrap distribution than just "sampling from the
covariance matrix of the estimate in addition to the variance-covariance
matrix for OMEGA and SIGMA".

I agree it's easier to use the prediction interval than the tolerance
interval. Also, the theoretical advantage of the tolerance interval in some
cases could not be relevant (low uncertainty), but in some other it could be
the only choice (high uncertainty). My suggestion is to test (on case by
case basis) both intervals, and if both give similar results then move
forward with the prediction interval, otherwise use the tolerance interval.

At risk of getting me out of the tolerance interval of my wife :-), I did
some simulations at home to answer your questions. However, I used other
datasets where I had a model with between and within subject variability in
several parameters. The model was built with the dataset A, and the 80%
prediction and tolerance intervals were calculated for each concentration in
the dataset B. The prediction and the tolerance intervals contained 81.4%
and 79.4% of the observations, respectively. Probably both numbers are very
similar because RSE for fixed and random effects are lower than 15% and 40%,
respectively. Therefore, in this case (low uncertainty) I would continue the
simulation and/or evaluation work without considering the uncertainty. At
this stage, I don't have any example where I can show you the tolerance
interval is superior to prediction interval in terms of predictive
performance. Perhaps, anyone else may have it and would be very interesting
to see the results and the consequences.

One interesting thing I learned from this exercise is that tolerance
interval can be narrower than prediction interval. One potential reason is
that the estimates for two random effects fall in the upper part of the
non-parametric bootstrap distribution for the same parameter, but below the
90% confidence interval. So, when the uncertainty is considered, more
subproblems are simulated with lower variability and as a consequence, the
tolerance interval is narrower.

Finally, I also wonder if anyone in the nmuser list would like to share any
experience with prediction/tolerance intervals for categorical data. I guess
the way to calculate those intervals is a bit more complex

With respect to your comment on the allometric scaling, as you know
allometric model are empirical and not all equations relate directly to
physiology. In fact, body weight and brain weight has been commonly used to
predict the clearance of drugs in humans (Mahmood I, et al. Xenobiotica
1996). In particular, body weigh and brain weight has been recently used to
predict from animal to human the clearance of protein drugs, such rhuEPO and
EPO-beta (Mahmood I. J Pharm Sci. 2004). I agree that brain is not an
important clearance organ for EPO, however brain weight was tested on the
basis of Sacher equation, which relates body weight and brain weight to the
maximum lifetime potential (MLP). MLP is a measurement of the chronological
time for a particular species, necessary for a particular physiological
event to occur. The shorter the MLP, the faster the biological cycles occur.
One may for instance consider the drug elimination as the physiological
event to occur and, then MLP (or brain weight) could be used to explain the
difference in drug clearance across species with similar body weight. In
fact, the brain weight in rabbit (0.56% of body weight) is lower than the
brain weight in monkey (1.80% of body weight). So, given the same body
weight for both species (see figure 3 of the paper), then MLP in rabbits is
shorter relative to monkeys (0.76 x 105 h versus 1.62 x 105 h) and,
therefore, the PEG-EPO clearance in rabbits is faster as compared to
monkeys.

The reference model we reported is a simple allometric model based on body
weight alone. From the RSE, you can see that 95%CI were not different from
the theoretical value. Even in the final model you can derive the "real"
exponent of body weight for CL. In order to do that, it is needed to
consider the effect of brain weight because of its proportionality to body
weight, within a particular species. Therefore, 1.030 (apparent exponent of
weight) cannot be directly compared to 0.75, without taking into account the
exponent of brain weight. Doing so, 1.030 - 0.345 = 0.685 is obtained as the
"real" exponent of weight, which is very similar to the expected 0.75. I
understand the real exponent of body weight is 0.75.

Regards,
Juanjo.
_______________________________________________________

From:  "Kowalski, Ken" Ken.Kowalski@pfizer.com
Subject: RE: [NMusers] Simulation vs. actual data
Date: Thu, July 14, 2005 1:22 pm

Nick,

To make distinctions between the 3 different types of statistical intervals
we only need consider two major classes of variability: sampling variability
and parameter uncertainty.  Here sampling variability can have multiple sources
of variation due to sampling of subjects (BSV), different occasions within a
subject (WSV), and sampling of observations (RUV).  Moreover, parameter uncertainty
can be multivariate and can be thought of as the trial-to-trial variation in that
we would get a different set of population parameter estimates each time we repeat
the trial.

I still like my simple univariate normal mean model example which has only one
source of sampling variability and one parameter to illustrate the differences
in these intervals.  In the univariate normal example the sampling variability
is estimated as the sample standard deviation (SD), the population mean parameter
is estimated by the sample mean (Ybar), and the parameter uncertainty is the
standard error of the mean (SE=SD/sqrt(N)).   With these definitions I'll take
another stab at defining the general forms of the 3 types of statistical intervals
for the univariate normal mean model case and draw analogies to the more complex
population modeling setting.

Degenerative Tolerance Interval:  Ybar +/- k*SD.  This is the form of a degenerate
tolerance interval where we assume the population mean (mu) and standard deviation
(sigma) are known without uncertainty given by the estimates Ybar and SD, respectively.
It is degenerative in that we put all the probability mass on our point estimates (Ybar
and SD) where a degenerative distribution is one in which the distribution has only
one value with probability=1.  The inference for this interval is on the individual
Y's since the SD is the standard deviation of the individual Y's.  If we assume the
Y's are normal with Ybar and SD known (without uncertainty), we know from our first
course in probability and statistics that Ybar+/-1*SD, Ybar+/-2*SD, and Ybar+/-3*SD
contain approximately 68, 95, and 99.7% of the individual Y values, respectively.

Degenerative Tolerance Interval:  Population PK Example.  To draw an analogy to
population modeling and construction of degenerative tolerance intervals consider
the following example.  Suppose we conduct a single dose PK study with N=12 healthy
subjects at each of 5 doses with dense sampling of n time points (observations) per
subject.  If we fit a population PK model to estimate thetas, Omega, and Sigma, then
use the model (and estimates without uncertainty) to simulate individual Y values for
say 1000 hypothetical subjects at each dose for each time point we can then calculate
Ybar and SD at each dose/time point across the 1000 hypothetical subjects and construct
degenerative tolerance intervals of the form Ybar +/- k*SD.  However, typically what
we do is compute Ybar as our point estimate and construct an asymmetric tolerance
interval using the percentile method (i.e., instead of using +/- k*SD to determine
the endpoints of the interval the appropriate lower and upper quantiles from the
empirical distribution of the simulated Y values are used).

Prediction Interval for a Single Future Observation:  Ybar +/- k*SD*sqrt(1+1/N).  To
calculate the variability in a single future observation (Ynew) we add the sampling
variance (SD^2) for the new observation and parameter uncertainty variance (SE^2) to
obtain Var(Ynew) = SD^2 + SE^2 = (SD^2)*(1+1/N) and hence SD(Ynew) = SD*sqrt(1+1/N).
Note that when we ignore parameter uncertainty to obtain a degenerative prediction
interval or when N is very large such than SE is near 0 the resulting interval collapses
to the same as a degenerative tolerance interval since SD(Ynew) = SD in this setting.
This may in part explain some of the confusion we have with the terminology.  The
distinctions between these intervals collapse when we don't take into account parameter
uncertainty.

Prediction Interval for a Single Future Observation:  Population PK Example.  To draw an
analogy to population modeling using the population PK example I just described, consider
taking into account parameter uncertainty by assuming the posterior distribution of the
parameter estimates follow a multivariate normal distribution with mean vector of parameter
estimates and covariance matrix of the estimates from the COV step output.  Suppose now we
obtain 1000 sets of population parameters (thetas, Omegas, and Sigmas) from a parametric
simulation from this multivariate normal distribution.  Furthermore, if we simulate a
single hypothetical subject's profile for each set of 1000 population parameters we now
have a set of 1000 individual Y's at each dose/time point.  If we calculate the SD(Ynew)
across the 1000 individual values we have taken into account both sampling variation
(Omegas and Sigma) and parameter uncertainty (multivariate normal posterior distribution).
The resulting SD(Ynew) will be larger than the SD calculated for the degenerate tolerance
interval example.  However, suppose for the 1000 set of population parameters we set them all
to the same estimates obtained from the PK model fit (i.e., ignore parameter uncertainty).  If
we now simulate a single hypothetical subjects's profile for each of the 1000 sets of exactly
the same population parameters the degenerative prediction interval results in SD(Ynew) = SD
since it only takes into account the sampling variation in Omega and Sigma the same as for
the degenerative tolerance interval.

Prediction Interval for a Mean of M Future Observations:  Ybar +/- k*SD*sqrt(1/M + 1/N).  To
calculate the variability in the mean of M future observations
(Ybar-new = Sum(Ynew-i)/M, i=1,...,M) we note that the sampling variance for
Ybar-new is (SD^2)/M.  To obtain the variance of Ybar-new taking into account this
sampling variance as well as the uncertainty in Ybar we have that
Var(Ybar-new) = (SD^2)/M + (SE^2) = (SD^2)*(1/M + 1/N) and hence, SD(Ybar-new) = SD*sqrt(1/M + 1/N).
The inference  for this statistical interval is on the mean of a fixed number (M) of future
values rather than on individual observations.  However, when M=1, this interval reduces to
the prediction interval on individual observations.  When we perform a PPC where the statistic, T,
is a mean of the individual observations we are usually conditioning on some fixed known value of M.

Prediction Interval for a Mean of M Future Observations:  Population PK Example.  Going back to
the population PK example, suppose we want to perform a PPC where we use our single dose
population PK model to predict the mean PK profile where we have M=12 subjects per each of
5 dose groups (i.e., an internal PPC where we condition on the design of the study that we
used to develop the model).  To perform the PPC taking into account parameter uncertainty we
obtain 1000 sets of population parameters (thetas, Omegas, and Sigmas) sampling from the
multivariate normal posterior distribution.  Our statistic (T) of interest is the mean of the
Y's for each dose/time point across the M=12 subjects.  Thus, for each of the 1000 sets of
population parameter values we simultate M=12 individual Y's for each dose/time point.  We
then calculate the mean across the M=12 individual Y values where T = Ybar-new, and we then
have 1000 samples of the posterior distribution of Ybar-new.  If we calculate the sample
standard deviation of the means, i.e., SD(Ybar-new), we can construct a prediction interval
for Ybar-new and overlay the observed statistic, Ybar (N=12).  In practice we typically use
the percentile method to construct the prediction interval and specifically for PPC we might
show the histogram for the posterior predictive distribution of the statistic and overlay
the observed statistic as a reference line on the histogram.  If we don't take into account
parameter uncertainty then the resulting prediction interval and PPC can be thought of as
degenerative and will necessarily result in a narrower distribution of the statistic.

Confidence Interval on the Population Mean:  Ybar +/- k*SD*sqrt(1/N) = Ybar +/- k*SE.  The
confidence interval on the population mean can be thought of as a prediction interval where
M is infinitely large such that 1/M goes to zero.  In this setting the only uncertainty
that is reflected is the parameter uncertainty in Ybar (i.e., SE).

Confidence Interval on the Population Mean:  Population PK Example.  To illustrate the
construction of a confidence interval using the population PK example we can repeat the
process as above for the prediction interval for M observations where we take M to be
arbitrarily large, say M=2000 hypothetical subjects.  Again we simulate 1000 sets of
parameters values from the multivariate normal posterior distribution and for each set
we simulate M=2000 individual Y's for each dose/time point.  We then compute the sample
mean (Ybar-pop) of M=2000 subjects for each of the 1000 sets of population parameters.
We can then construct a confidence interval for the population mean by computing the
sample standard deviation of the set of 1000 Ybar-pop means (averaged over M=2000 hypothetical
subjects) or use the percentile method.  Note that in the simulations we take into account
the sampling variation (Omegas and Sigmas) of the individual Y's but when we average
over a sufficiently large M we effectively average out this sampling variation such
that the only uncertainty reflected in SD(Ybar-pop) is due to parameter uncertainty.
Also note that the population mean (Ybar-pop) and the typical individual prediction
(Yhat | ETAs=0) are not the same thus, if we don't take into account the sampling variation
and instead calculate the typical individual prediction for each of the 1000 sets of
population parameter values, the resulting confidence interval of the mean of the typical
individual predictions is not the same as the confidence interval of the population mean
prediction.  This is a particularly important distinction when working with noncontinuous
models.

With regards to whether we should take into account parameter uncertainty, I agree that
there are many instances where parameter uncertainty is inconsequential relative to
sampling variation in constructing tolerance and prediction intervals from our population
models.  Particularly, when all of the parameters and variance components are well-estimated
with good precision (small standard errors).  The problem comes in when we fit models to
"small-ish" datasets  or datasets that cover a limited range (e.g., limited range of doses
or exposures) such that one or more of the parameters are estimated with poor precision (large
standard errors).  Given the complexity of our population models and the multivariate nature
of the posterior distribution it is difficult to always know when parameter uncertainty is
large relative to sampling variation and should be accounted for when constructing tolerance
and prediction intervals.

I know of no other way to truly assess the importance of parameter uncertainty other than to
do it both ways (with and without parameter uncertainty).  Of course, if you gone through
the trouble of doing it both ways the purist in me says to report out the interval that
includes parameter uncertainty even if it doesn't make much difference.

If you are willing to assume that the posterior distribution of the parameters is multivariate
normal with mean vector corresponding to the parameter estimates and covariance matrix of the
estimates from the COV step output, then parametric simulations taking into account parameter
uncertainty are not computationally burdensome to perform.  However, taking into account
parameter uncertainty from a parametric sampling from the multivariate normal to construct
tolerance and prediction intervals may require additional custom coding to do so since we
don't have automated utilities that do this...at least not that I'm aware of.  On the other
hand, if you are unwilling to make the parametric assumption that the posterior distribution
is multivariate normal and thus are faced with performing nonparametric bootstrapping to
construct an empirical posterior distribution as Juanjo recommends, then this will be
considerably more computationally burdensome because of the need to re-estimate the parameters
for each bootstrap dataset.  Nevertheless, if one is planning to go ahead and perform
nonparametric bootstrapping anyway to construct bootstrap confidence intervals for the
parameter estimates, then you've already done the hard work of generating the empirical
posterior distribution of the parameters so why not use it to account for parameter
uncertainty when constructing tolerance and prediction intervals?

One of the reasons that I like to take into account parameter uncertainty when performing a
PPC is that when I do a degenerative posterior predictive check, the empirical distribution
may underestimate the variation in the statistic (T) such that the observed statistic may
fall outside the distribution.  If parameter uncertainty is substantial, taking it into
account will result in a wider empirical distribution of the statistic such that the
observed statistic may now lie reasonably within the distribution and thus I might be
more willing to accept the model as having adequate predictive performance for this statistic.

My apologies for such a long message and kudos to everyone who read it this far.

Ken
_______________________________________________________

From:  "Perez Ruixo, Juan Jose [PRDBE]" JPEREZRU@PRDBE.jnj.com
Subject: RE: [NMusers] Simulation vs. actual data
Date: Thu, July 14, 2005 7:31 pm

Ken,

Thanks for your well detailed explanation.

I understand to get N sets of population parameters (thetas, Omegas, and Sigmas) from
a non-parametric BS in order to simulate N clinical studies is computationally burdensome,
specially for large N (let say 2500).

I think the key question is how large N should be to account properly for the uncertainty.
The purist way says that we should run SUBPROBLEM=1 for each set of population parameters.
However, if a smaller N is able to account properly for the uncertainty then the
computational burdensome is less. Suppose, 50 non-parametric BS replicates are good
enough to account for parameter uncertainty, then would you consider to use 50 sets of
population parameters and run SUBPROBLEM=50 in order to get 2500 trial replicates?

Regards,
Juanjo.
_______________________________________________________

From: "Nick Holford" n.holford@auckland.ac.nz
Subject: Re: [NMusers] Simulation vs. actual data
Date: Thu, July 14, 2005 10:09 pm

Ken,

Thanks for your comprehensive and comprehensible response. I distinguished 2 levels
of random effects (PPV and RUV) because there are two kinds of interval that are of
interest depending on what random effects are used:
a) PPV only -- the 'true' variability in response independent of RUV
b) PPV and RUV -- the variability in observations of the response
Type a) is what would want to see for a prediction of the true response while type
b) is what you want to see when comparing the preidiction with an observation.

I can understand almost everything you say except:

1. What does M mean with respect to future observations? Your example suggests that
M is the number of subjects rather than the number of observations. But perhaps you
mean M is the number of observations conditioned on time and design within a single
trial?
2. Why do you use 'degenerative' instead of 'degenerate'?

I agree that the non degenerate tolerance interval is preferable over the degenerate
case but the dilemma comes when NONMEM spins the dice and won't reveal the results
of \$COV and the runtimes preclude a non-parametric bootstrap (a common occurrence
for me). In this case the degenerate tolerance interval (DTI) seems to be helpful.
If approx 90% of observations lie within the 90% DTI then one has a computationally
easy way to confirm model performance as a description of the data.

In my rather limited experience of visual predictive checks it is common to see more
than 90% of observations lying within the 90% DTI which points to a model failure to
capture the random effects properly. I have yet to see an example of more than 10%
of the data lying outside the 90% DTI which is what you would expect if uncertainty
is high.

Nick
_______________________________________________________

From: "Kowalski, Ken" Ken.Kowalski@pfizer.com
Subject: RE: [NMusers] Simulation vs. actual data
Date: Fri, July 15, 2005 8:28 am

Nick,

You wrote:

>1. What does M mean with respect to future observations? Your example
suggests that M is the number of subjects rather than >the number of
observations. But perhaps you mean M is the number of observations
conditioned on time and design within a
>single trial?

Yes, in my example it is the number of observations conditioned on time and
dose such that I'm averaging one observation per subject and hence M also
refers to the number of subjects (it is our primary sampling unit in
population modeling that we want to average over).  Note instead of means of
observations as my statistic, T, of interest, I could have used another
statistic such as a noncompartmental estimate of AUC which is based on n
observations (time points) per subject.   In which case I have M=12
subject's AUCs that I would average (e.g., geometric mean) as my PPC
statistic.

>2. Why do you use 'degenerative' instead of 'degenerate'?

My apologies, it should be 'degenerate'.

Ken
_______________________________________________________

From: "Kowalski, Ken" Ken.Kowalski@pfizer.com
Subject: RE: [NMusers] Simulation vs. actual data
Date: Sat, July 16, 2005 9:56 am

Juanjo,

If you perform 50 bootstraps but you are going to perform 2500 trial replicates
you shouldn't just replicate the 50 bootstrap sets 50 times to get 2500...that's
not random.  I suppose you could bootstrap with replacement the 50 bootstrap sets
to generate 2500 sets of population parameters, however, I wouldn't advise it.  I
you're going to generate 2500 trial replicates presumably it is because you are
interested in getting accurate tail probabilities.  50 bootstraps is not enough to
accurately nail down the tail probabilities and a "double bootstrap" procedure to
"artificially" generate 2500 bootstrap sets from an original 50 doesn't help (i.e.,
the 2500 are ultimately based on only 50 unique sets).  However, the 50 bootstrap
sets are probably good enough to estimate the sample covariance matrix from which
you can compare to the covariance matrix of the parameter estimates from the NONMEM
output (\$COV step).  In fact, this would be a good diagnostic to decide whether or
not to use the parametric multivariate normal assumption for the posterior distribution
when taking into account parameter uncertainty.

If the sample covariance matrix from 50 bootstraps is in fairly close agreement to
the covariance matrix reported by NONMEM then I would use the parametric multivariate
normal distribution using the covariance matrix estimated by NONMEM to account for
parameter uncertainty (i.e., generate 2500 unique sets of population parameters from
the multivariate normal posterior distribution).  On the other hand, if the sample
covariance matrix from the 50 bootstraps is substantially different than that estimated
by NONMEM, then you may be in a situation where it is not reasonable to assume that the
posterior distribution is multivariate normal and you may need to generate considerably
more than 50 bootstrap sets from a nonparametric procedure although I don't think you
necessarily need 2500 bootstraps...1000 bootstraps is probably sufficient if you want
to estimate the 5th and 95th percentiles.

Note that when the multivariate normal assumption doesn't hold it is often because the
distribution is asymmetric. This is why nonparametric bootstrap confidence intervals
on the parameter estimates can be quite asymmetric when they do differ from Wald-based
intervals using the NONMEM standard errors.  To really nail down this asymmetry in the
distribution to get accurate tail probabilities really requires a lot more than 50
bootstraps.  So, if you routinely advocate doing nonparametric bootstrapping to account
for parameter uncertainty out of concern for this potential asymmetry then I think
you've got to bite the bullet and do more than 50 bootstraps.

Your suggestion of doing 50 bootstraps prompts me to respond to the issue Nick raised
regarding what to do when one is willing to accept a model where the COV step fails
but long runtimes prohibit performing the nonparametric bootstrap procedure to
generate 500-1000 sets of parameters from the empirical posterior distribution.  If
the runtimes are not too long such that 50 bootstraps are feasible, then I would
advocate calculating the sample covariance matrix and assume the multivariate normal
distribution holds to generate 1000 sets of population parameters to take into account
parameter uncertainty.  Basically, I think it is better to do something to take into
account parameter uncertainty than to do nothing at all...we don't really know when
parameter uncertainty may or may not be important unless we account for it and see
how much impact it has.  Nevertheless, I'm sure that situations arise where runtimes
are so long that even performing 50 bootstraps is prohibitive.

Ken
_______________________________________________________

```