From: "Gordi, Toufigh" Toufigh.Gordi@cvt.com Subject: [NMusers] Simulation vs. actual data Date: Tue, June 14, 2005 2:27 pm Dear all, I have a trivial question I hope I can get help with. I have data from different doses of a compound and want to show a plot of concentration mean (+/-SD) values vs. time and a fit of the model curve that, hopefully, goes through the values. To get the model fitting curve, I plan to simulate a typical subject per dose group receiving doses used in the study. My question is what simulated values to use for the curve. A population curve does not seem to be a good choice: the observed data contains known and unknown errors, whereas a population curve is the true value of the population and thus the fit might not include all the observed values. The simulated IPRED or DV for a single subject per dose group is certainly not the curve either. The only way I can think of is to simulate a large number of subjects per dose group and generate a mean curve from the simulations. If this is the correct way, I would use the simulated DV values to get the mean values, since DV values include all the errors associated with the real observations and should give the best curve of the observed data. Any comments? Toufigh Gordi _______________________________________________________ From: "Nick Holford" n.holford@auckland.ac.nz Subject: Re: [NMusers] Simulation vs. actual data Date: Tue, June 14, 2005 2:48 pm Toufigh, I think you are describing a visual predictive check i.e. simulate from your final model with say 1000 subjects, then calculate the average and 90% empirical prediction interval from the resulting 1000 predictions at each time point. You can plot the average and 90% PI over time and overlay the observed concentrations. About 90% of your observations should lie within the 90% PI. The average vs time can be considered the population behaviour while the 90% PI shows the extent of variability including random effects in parameters and random effects due to residual error. If you think you have linear kinetics you can dose normalize the observed concentrations to some standard dose and do the simulation with this standard dose. This let's you use one plot to display the results and expected profile for all doses you have studied. Nick _______________________________________________________ From: "Liping Zhang" ZHANG_LIPING@lilly.com Subject: Re: [NMusers] Simulation vs. actual data Date: Tue, June 14, 2005 2:57 pm I totally support your thoughts. If you want, other than the mean, you can also get the confidence band (or like you said +/-SD) around the simulated curve. Depend on how precise and what the simulation will be used for, you can elaborate the simulation": you can simulate use the estimated typical structure parameter value, with inter-subj variation and random error, so a large number of subjects within one simulation will do. The simulation does not account the uncertainty in thetas and the band reflects inter-subject variability and the size of random error. You can also simulate different populations. For each simulation, first sample the parameters from their variance-covariance metrics, and then simulate using the sampled subjects. You then do somekind of summary statistic on the simulated results. It needs large number of simulations and large number of subjects within each simulation. It is a labor-intensive process but gives you a more realistic view of potential outcomes. Best regards, Liping Liping Zhang, PhD Global PK/PD Modeling and Trial Simulation ELi Lilly & Company Work: 317-277-8687 Fax: 317-433-6661 DC 0734 Email: zhang_liping@lilly.com _______________________________________________________ From: "Kowalski, Ken" Ken.Kowalski@pfizer.com Subject: RE: [NMusers] Simulation vs. actual data Date: Wed, June 15, 2005 4:49 pm NMusers, We seem to be using terms like confidence bands and prediction intervals somewhat loosely and interchangeably. To provide some clarity here are some definitions I use for the different types of statistical intervals that are often constructed for different purposes: Tolerance Interval - If we simulate DVs from our model and use the mean +/- some multiple of the SD (across subjects at a particular time point) or use the percentile method to obtain a lower and upper bound (of the individual responses at a particular time point), the resulting interval is something akin to what is known in the statistical literature as a "tolerance interval". Here the interest is in characterizing the interval that contains a certain percentage of the individual observations. However, to be a truly valid tolerance interval, such an interval should also take into account the uncertainty in the parameter estimates and a confidence level is associated with the interval. Tolerance intervals are used to makes statements like "I'm 90% confident that the interval (LTL, UTL) will contain 80% of the individual observations (i.e, if we repeat the study an infinite number of times, 90% of the intervals should contain 80% of the individual observations)." Prediction Interval - If we simulate DVs from our model and calculate the mean (across subjects) conditioning on some specific design (with a specified number of subjects, n) and repeat this process for N simulated datasets where we use a different set of population estimates based on the parameter uncertainty for each of the N simulated datasets, then the grand mean (of the N means) +/- some multiple of the SD (across the N means) or use the corresponding percentile method to obtain a lower and upper bound, the resulting interval is a prediction interval on the future mean response of n subjects. A valid prediction interval takes into account the parameter uncertainty as well as the sampling variation in Omega (sampling of subjects) and Sigma (sampling of observations). Prediction intervals are used to make statements like "I'm 90% confident that the interval (LPL, UPL) contains the future mean response of n subjects (i.e., if we repeat the study an infinite number of times, 90% of the intervals will contain the future mean response of n subjects)." Confidence Interval - If we simulate DVs from our model in a similar fashion as for the prediction interval but where we choose n to be infinitely large in computing the mean across the n subjects then we are effectively averaging out the sampling variation in Omega and Sigma and the resulting interval only reflects the uncertainty in the parameter estimates of the model. Confidence intervals are used to make statements like "I'm 90% confident that the interval (LCL, UCL) contains the true population mean response (i.e., if we repeat the study an infinite number of times, 90% of the intervals will contain the true population mean response). In general confidence intervals have the shortest width followed by prediction intervals and then tolerance intervals. However, prediction intervals can be wider than tolerance intervals when n (number of future subjects) is small. For example, when n=1 where we want to predict the value of a future single observation the prediction interval is typically wider than a tolerance interval. For more information on different types of statistical intervals, see Hahn, "Statistical Intervals for a Normal Population, Part I. Tables, Examples and Applications", J. of Quality Technology, 1970; 2:115-125. I agree with Liping that taking into account parameter uncertainty by assuming that the parameters estimates come from a multivariate normal distribution using the population estimates and corresponding covariance matrix of the estimates can be labor-intensive. However, this does not mean that it is computationally-intensive, in fact quite the contrary. One can generate an N=1000 sample of population parameters (thetas, omegas, and sigmas) from the multivariate normal distribution in a matter of minutes even for mean parameter vectors and covariance matrices of the dimensions typical of a pop PK or pop PK/PD model. The laborious aspect of the work comes from the fact that we don't have automated utilities to do this work and so we have to do a lot of custom coding (pre- and post-processing) to generate the sample parameter estimate vectors (thetas, omegas, and sigmas), pass them in to NONMEM to perform the simulations, and then post-process the simulated results to calculate the various intervals of interest. However, the process can also be computationally-intensive if the distribution of the parameter estimates does not follow a multivariate normal distribution. In this setting we may have to perform non-parametric bootstrapping (sample with replacement of subjects from the observed dataset) to get the N=1000 sample of population parameters from the empirical bootstrap distribution from fitting the model to each of 1000 bootstrap datasets. Unless, I'm already performing nonparametric bootstrapping for other purposes, I typically assume the multivariate normal distribution when taking into account parameter uncertainty simply because it is computationally less intensive. My philosophy is that it is better to do something to take into account parameter uncertainty rather than to completely ignore it. Ken _______________________________________________________ From: "Nick Holford" n.holford@auckland.ac.nz Subject: Re: [NMusers] Simulation vs. actual data Date: Sat, June 25, 2005 2:53 am Ken, Thanks you for offering some definitions for various statistical intervals. I offered the following to Toufigh to describe what I have been calling a prediction interval. "There are two ways to define distributions. The most common is to assume a theoretical distribution e.g. normal. If you make this assumption you can then use the estimates of the distribution parameters to predict intervals which include 90% of the distribution e.g. 1.65*StDev. Note that you make a big assumption that the values are normally distributed and in practice this is often not the case e.g. if your clearance is log normally distributed then the distribution of predicted Css will not be normally distributed. I prefer not to make the normal assumption. An alternative is to use an empirical distribution based on data. By simulating 1000 observations at a particular time you have an empirical distribution (emp_dist) of the predicted values. If you find the percentile(emp_dist,0.05) and percentile(emp_dist,0.95) then you define the empirical 90% prediction interval. I use the percentile() function in Excel to do this." It seems that the method I describe comes closest to what you call a tolerance interval i.e. generate an empirical distribution by simulating DVs from the model using the design of the data used to estimate the model parameters. The method might be called a degenerate tolerance interval (following the terminology of Yano, Beal & Sheiner 2001) because it does not account for uncertainty in the parameters. Would you agree? I am not clear about what the distinction is between your tolerance and predictive intervals. It seems that predictive intervals are obtained with "specific designs" perhaps different from the original data. Is this the only difference between a tolerance and a predictive interval ie. the design used to generate the simulations? Confidence intervals seem to be the same as tolerance intervals but use a large number of simulated DVs to get some kind of asymptotic behaviour. You don't say how many replications are used to generate tolerance and predicitive intervals. Am I correct in saying that confidence intervals and tolerance intervals differ only in the number of replications used to generate the empirical distribution? Does the method I describe to Toufigh correspond more to a confidence interval than a tolerance interval because of the large number (1000) replications? Yano Y, Beal SL, Sheiner LB. Evaluating pharmacokinetic/pharmacodynamic models using the posterior predictive check. J Pharmacokinet Pharmacodyn 2001;28(2):171-92 -- Nick Holford, Dept Pharmacology & Clinical Pharmacology University of Auckland, 85 Park Rd, Private Bag 92019, Auckland, New Zealand email:n.holford@auckland.ac.nz tel:+64(9)373-7599x86730 fax:373-7556 http://www.health.auckland.ac.nz/pharmacology/staff/nholford/ _______________________________________________________ From: "Kowalski, Ken" Ken.Kowalski@pfizer.com Subject: RE: [NMusers] Simulation vs. actual data Date: Tue, July 5, 2005 12:30 pm Hi Nick, Yes, the method you describe comes closest to what I describe as a tolerance interval in that you are trying to characterize the variability (distribution) of individual responses. Whereas confidence intervals try to characterize the variability (uncertainty) in the population mean response. In the univariate normal case the distinction boils down to whether you construct intervals of the form Xbar +/- k*SD or Xbar +/- k*SE (where SE = SD/sqrt(n)) where the former characterizes the variability in the individual observations (SD) while the latter characterizes the variability or uncertainty in the mean response (SE). If Yano et al. define a "degenerate" tolerance interval as one that does not attach any confidence statement to the interval because parameter uncertainty is not taken into account I suppose I can accept that definition. Prediction intervals are more closely related to confidence intervals in that we are often trying to characterize uncertainty in some kind of mean response (not necessarily the population mean response) rather than the distribution of individual responses. The mean response in this case is conditioned on some design or sample of a fixed number of future (predicted) observations. When this fixed number of predicted observations approaches infinity then the mean we are characterizing is the population mean and the prediction interval (for a mean of an infinite number of predicted observations) collapses to a confidence interval on the population mean. When the fixed number of observations is small, then the distribution of the mean of n predicted responses will be considerably wider than the distribution of the population mean prediction. Moreover, when n=1 then the prediction interval is on a single observation which can be considerably wider than tolerance intervals as I noted in my previous message. I hope this helps clarify the distinctions between these types of intervals. Ken _______________________________________________________ From: "Nick Holford" n.holford@auckland.ac.nz Subject: Re: [NMusers] Simulation vs. actual data Date: Tue, July 12, 2005 1:50 pm Ken, Thanks for your further elaboration. Your boiled down example is perhaps somewhat over simple for nmusers because I think we need to consider 3 kinds of random effects when constructing intervals for model qualification and prediction. 1. Population Parameter Variability (PPV). The sum of between (BSV) and within (WSV) subject variability in model parameters such as Emax and EC50. 2. Residual Unidentified Variability (RUV). The residual error describing the difference between individual subject predictions and observations. 3. Parameter Uncertainty. This is often described by the parameter standard error but might be more reliably described by an empirical distribution of parameter estimates obtained by a non-parametric bootstrap. Your example only includes one level of random effect (which could be either PPV or RUV depending on the context) and derives intervals by the commonly used asymptotic assumption. The kind of parametric simulation procedure that nmusers might use has been described by Yano et al. (2001) in their investigation of the posterior predictive check (PPC). As in your own example this study was limited to only one level of random effect (they used RUV) and examined what might be learned from including parameter uncertainty. They concluded "No clear adûantage for one or another method of approximating the posterior distribution on model parameters is found." i.e. including uncertainty in parameters did not improve the power of the PPC to aid in evaluating model performance. Dropping the 'posterior' i.e. the uncertainty in the parameter distribution and just sampling from what they call the 'degenerate posterior' may be adequate. Yano et al discuss why this might be the typical case for many NONMEM type analyses. On the basis of this rather limited study one should be aware that in other situations the inclusion of uncertainty could be quite important but the relevance to typical NONMEM analyses is currently unclear. At least in some settings the simple predictive check (SPC; Gobburu et al. 2000) which uses the final point estimates of the parameters (the degenerate posterior distribution) for simulation without including uncertainty can give useful diagnostic information. In trying to use a consistent terminology for the various intervals used to describe the time course of response I wonder if you would accept the following: Confidence Interval: Describes the uncertainty in the mean response. It could be constructed by a non-parametric bootstrap and using the resulting parameters for each bootstrap run to predict the population response at each time point. The distribution of these population responses obtained from say 1000 bootstrap runs can be used to define the confidence interval. The confidence interval says nothing about PPV or RUV but reflects only the estimation uncertainty in the population parameters. I am not aware of any published use of this kind of interval applied to NONMEM analyses but would be interested to hear of this application. Prediction Interval: Decribes the variation in individual response which is attributable to PPV and RUV. It may be obtained by a parametric bootstrap (e.g. using $SIM with NSUBPROBLEMS=1000 in NONMEM) based on the final parameter estimates for the model that is being evaluated. A 90% interval constructed from the empirical distribution of individual predicted responses (with residual error) at each time should contain 90% of the observed responses at that time. This is the interval and method that is used for the visual predictive check (VPC; Holford 2005). The procedure is the same as the SPC and degenerate PPC. It has been frequently referred to as a posterior predictive check (e.g. Duffull 2000). Tolerance Interval: Describes the uncertainty in the prediction interval by including uncertainty in the parameter estimates. This could be done using the same procedure as the SPC but sampling from the covariance matrix of the estimate in addition to the variance-covariance matrix for OMEGA and SIGMA. I am not aware of anyone who has done this with NONMEM with both PPV and RUV but would be interested if someone could report any such experience. In their definition of PPC, Yano et al. did not include the generation of an interval: "The PPC compares a statistic (T) computed on the observed data to the distribution of that statistic under a candidate model fitted to the data to derive a p value, which we denote by pPPC." However, it is implicit in their methodology for calculating the probability of a response (pPPC). I would suggest that it might be better if the term PPC was restricted to the case where parameter uncertainty is included in the simulation process because this explicitly recognizes the role of the non-degenerate posterior distribution. I think that an interval which describes the variability in individual responses (prediction interval) is more commonly of interest than variability in the population response (confidence interval). A tolerance interval has some theoretical advantage over the prediction interval by being a bit more conservative (i.e. wider intervals) but most of the merits of this kind of model qualification approach will be seen in the computationally more convenient prediction interval. It is directly applicable for evaluating the performance of a model to describe existing observations and for illustrating to non-pharmacometricians what might be expected in a typical patient population. The nomenclature could stand some improvement so that we can use these terms more precisely. Nick Duffull SB, Chabaud S, Nony P, Laveille C, Girard P, Aarons L. A pharmacokinetic simulation model for ivabradine in healthy volunteers. Eur J Pharm Sci 2000;10(4):285-94 Gobburu J, Holford N, Ko H, Peck C. Two-step model evaluation (tsme) for model qualification. In: American Society for Clinical Pharmacology and Therapeutics. Annual Meeting; 2000; Los Angeles, CA, USA; 2000. Abstract. Holford NHG. The Visual Predictive Check – Superiority to Standard Diagnostic (Rorschach) Plots PAGE 2005. http://www.page-meeting.org/default.asp?id=26&keuze=abstract-view&goto=abstracts&orderby=author&abstract_id=738 Yano Y, Beal SL, Sheiner LB. Evaluating pharmacokinetic/pharmacodynamic models using the posterior predictive check. J Pharmacokinet Pharmacodyn 2001;28(2):171-92 _______________________________________________________ From: "Perez Ruixo, Juan Jose [PRDBE]" JPEREZRU@PRDBE.jnj.com Subject: RE: [NMusers] Simulation vs. actual data Date: Tue, July 12, 2005 2:52 pm Dears, Interesting discussion. For the tolerance interval, I think it would be more accurate to sample (THETAs, OMEGAs and SIGMAs) from the non-parametric bootstrap replicates. For each replicate, NSUBPROBLEMS=100 should be implemented. Then, the 90% tolerance interval can be calculated from the empirical distribution of all individual predicted responses (with residual error) at each time. That's how I implemented it previously for a simulation exercise (see reference below). The tolerance interval constructed as described above should contain 90% of the observed responses at each time. In my experience, the prediction interval as Nick described could be too narrow to include the 90% of the observed response at each time, specially if uncertainty is high (relative to PPV and RUV). Reference: Jolling K, Perez-Ruixo JJ, Hemeryck A, Vermeulen V, Greway T. Mixed-effects Modelling of the Interspecies Pharmacokinetic Scaling of Pegylated Human Erythropoietin. Eur J Pharm Sci 2005; 24: 465-475. Regards, Juanjo. _______________________________________________________ From: "Nick Holford" n.holford@auckland.ac.nz Subject: Re: [NMusers] Simulation vs. actual data Date: Tue, July 12, 2005 3:59 pm Juanjo, Thanks for your comments and for the reference to your paper on pegylated EPO PK. In this paper you describe the process for generating a tolerance interval as follows: "In addition, the final model was used to simulate the pharmacokinetic profile of PEG-EPO after i.v. and s.c. administration of a single dose of 1000 ug to 100 male subjects...Uncertainty in fixed and random parameter estimates was considered during the simulations by replicating the above-mentioned process 30 times using different values of fixed and random parameters, randomly selected from the estimates of the bootstrap replicates." If I understand you correctly this means simulating profiles in 3000 subjects with each block of 100 subjects having a different set of fixed and random effect parameters sampled from the non-parametric bootstrap. You then used the empirical distribution of 3000 responses at each time point to construct the interval. The experience you mention below agrees with the theoretical expectations I mentioned for the tolerance interval compared with the prediction interval but I cannot find anything in your paper to support your statement experimentally. Your Figure 4 shows 90% tolerance intervals for extrapolations to humans from non-human species. It has no observations for a visual predictive check. Did you try generating tolerance and prediction intervals for the species you had data for? If so, what fraction of observed values would lie within the tolerance interval you generated and what fraction would lie within a prediction interval generated with a single set of fixed and random effects parameters e.g. using the final model parameter estimates? I am interested in getting a concrete example of just how much different the tolerance and prediction intervals might be for a real NONMEM analysis. On another topic - I note your final model used an empirical allometric model using a combination of total weight and brain weights. The brain weights were in fact just total weight multiplied by an empirical adjustment factor and not truly individual brain weights. This makes it impossible to assert that brain weight itself is an independent covariate for the prediction of clearance. As there is no plausible mechanistic reason to believe that the brain is an important clearance organ for EPO I would think the apparent benefit of adding 'brain weight' to the model is an illustration of selection bias (Ribbing J, Jonsson EN. Power, Selection Bias and Predictive Performance of the Population Pharmacokinetic Covariate Model. Journal of Pharmacokinetics and Pharmacodynamics 2004;31(2):109-134) You did not report results for a simple allometric model based on body weight alone with the theoretical exponents of 3/4 for CL and Q and 1 for V1 and V2. xYou report a reference model with estimated allometric exponents. Did you try using a simple allometric model and test if the CI for the exponents were different from the theoretical values? Nick _______________________________________________________ From: "Perez Ruixo, Juan Jose [PRDBE]" JPEREZRU@PRDBE.jnj.com Subject: RE: [NMusers] Simulation vs. actual data Date: Wed, July 13, 2005 7:40 am Nick, Your interpretation of the process described to generate a tolerance interval is correct. In theory, I think it's better approach to sample from a non-parametric bootstrap distribution than just "sampling from the covariance matrix of the estimate in addition to the variance-covariance matrix for OMEGA and SIGMA". I agree it's easier to use the prediction interval than the tolerance interval. Also, the theoretical advantage of the tolerance interval in some cases could not be relevant (low uncertainty), but in some other it could be the only choice (high uncertainty). My suggestion is to test (on case by case basis) both intervals, and if both give similar results then move forward with the prediction interval, otherwise use the tolerance interval. At risk of getting me out of the tolerance interval of my wife :-), I did some simulations at home to answer your questions. However, I used other datasets where I had a model with between and within subject variability in several parameters. The model was built with the dataset A, and the 80% prediction and tolerance intervals were calculated for each concentration in the dataset B. The prediction and the tolerance intervals contained 81.4% and 79.4% of the observations, respectively. Probably both numbers are very similar because RSE for fixed and random effects are lower than 15% and 40%, respectively. Therefore, in this case (low uncertainty) I would continue the simulation and/or evaluation work without considering the uncertainty. At this stage, I don't have any example where I can show you the tolerance interval is superior to prediction interval in terms of predictive performance. Perhaps, anyone else may have it and would be very interesting to see the results and the consequences. One interesting thing I learned from this exercise is that tolerance interval can be narrower than prediction interval. One potential reason is that the estimates for two random effects fall in the upper part of the non-parametric bootstrap distribution for the same parameter, but below the 90% confidence interval. So, when the uncertainty is considered, more subproblems are simulated with lower variability and as a consequence, the tolerance interval is narrower. Finally, I also wonder if anyone in the nmuser list would like to share any experience with prediction/tolerance intervals for categorical data. I guess the way to calculate those intervals is a bit more complex With respect to your comment on the allometric scaling, as you know allometric model are empirical and not all equations relate directly to physiology. In fact, body weight and brain weight has been commonly used to predict the clearance of drugs in humans (Mahmood I, et al. Xenobiotica 1996). In particular, body weigh and brain weight has been recently used to predict from animal to human the clearance of protein drugs, such rhuEPO and EPO-beta (Mahmood I. J Pharm Sci. 2004). I agree that brain is not an important clearance organ for EPO, however brain weight was tested on the basis of Sacher equation, which relates body weight and brain weight to the maximum lifetime potential (MLP). MLP is a measurement of the chronological time for a particular species, necessary for a particular physiological event to occur. The shorter the MLP, the faster the biological cycles occur. One may for instance consider the drug elimination as the physiological event to occur and, then MLP (or brain weight) could be used to explain the difference in drug clearance across species with similar body weight. In fact, the brain weight in rabbit (0.56% of body weight) is lower than the brain weight in monkey (1.80% of body weight). So, given the same body weight for both species (see figure 3 of the paper), then MLP in rabbits is shorter relative to monkeys (0.76 x 105 h versus 1.62 x 105 h) and, therefore, the PEG-EPO clearance in rabbits is faster as compared to monkeys. The reference model we reported is a simple allometric model based on body weight alone. From the RSE, you can see that 95%CI were not different from the theoretical value. Even in the final model you can derive the "real" exponent of body weight for CL. In order to do that, it is needed to consider the effect of brain weight because of its proportionality to body weight, within a particular species. Therefore, 1.030 (apparent exponent of weight) cannot be directly compared to 0.75, without taking into account the exponent of brain weight. Doing so, 1.030 - 0.345 = 0.685 is obtained as the "real" exponent of weight, which is very similar to the expected 0.75. I understand the real exponent of body weight is 0.75. Regards, Juanjo. _______________________________________________________ From: "Kowalski, Ken" Ken.Kowalski@pfizer.com Subject: RE: [NMusers] Simulation vs. actual data Date: Thu, July 14, 2005 1:22 pm Nick, To make distinctions between the 3 different types of statistical intervals we only need consider two major classes of variability: sampling variability and parameter uncertainty. Here sampling variability can have multiple sources of variation due to sampling of subjects (BSV), different occasions within a subject (WSV), and sampling of observations (RUV). Moreover, parameter uncertainty can be multivariate and can be thought of as the trial-to-trial variation in that we would get a different set of population parameter estimates each time we repeat the trial. I still like my simple univariate normal mean model example which has only one source of sampling variability and one parameter to illustrate the differences in these intervals. In the univariate normal example the sampling variability is estimated as the sample standard deviation (SD), the population mean parameter is estimated by the sample mean (Ybar), and the parameter uncertainty is the standard error of the mean (SE=SD/sqrt(N)). With these definitions I'll take another stab at defining the general forms of the 3 types of statistical intervals for the univariate normal mean model case and draw analogies to the more complex population modeling setting. Degenerative Tolerance Interval: Ybar +/- k*SD. This is the form of a degenerate tolerance interval where we assume the population mean (mu) and standard deviation (sigma) are known without uncertainty given by the estimates Ybar and SD, respectively. It is degenerative in that we put all the probability mass on our point estimates (Ybar and SD) where a degenerative distribution is one in which the distribution has only one value with probability=1. The inference for this interval is on the individual Y's since the SD is the standard deviation of the individual Y's. If we assume the Y's are normal with Ybar and SD known (without uncertainty), we know from our first course in probability and statistics that Ybar+/-1*SD, Ybar+/-2*SD, and Ybar+/-3*SD contain approximately 68, 95, and 99.7% of the individual Y values, respectively. Degenerative Tolerance Interval: Population PK Example. To draw an analogy to population modeling and construction of degenerative tolerance intervals consider the following example. Suppose we conduct a single dose PK study with N=12 healthy subjects at each of 5 doses with dense sampling of n time points (observations) per subject. If we fit a population PK model to estimate thetas, Omega, and Sigma, then use the model (and estimates without uncertainty) to simulate individual Y values for say 1000 hypothetical subjects at each dose for each time point we can then calculate Ybar and SD at each dose/time point across the 1000 hypothetical subjects and construct degenerative tolerance intervals of the form Ybar +/- k*SD. However, typically what we do is compute Ybar as our point estimate and construct an asymmetric tolerance interval using the percentile method (i.e., instead of using +/- k*SD to determine the endpoints of the interval the appropriate lower and upper quantiles from the empirical distribution of the simulated Y values are used). Prediction Interval for a Single Future Observation: Ybar +/- k*SD*sqrt(1+1/N). To calculate the variability in a single future observation (Ynew) we add the sampling variance (SD^2) for the new observation and parameter uncertainty variance (SE^2) to obtain Var(Ynew) = SD^2 + SE^2 = (SD^2)*(1+1/N) and hence SD(Ynew) = SD*sqrt(1+1/N). Note that when we ignore parameter uncertainty to obtain a degenerative prediction interval or when N is very large such than SE is near 0 the resulting interval collapses to the same as a degenerative tolerance interval since SD(Ynew) = SD in this setting. This may in part explain some of the confusion we have with the terminology. The distinctions between these intervals collapse when we don't take into account parameter uncertainty. Prediction Interval for a Single Future Observation: Population PK Example. To draw an analogy to population modeling using the population PK example I just described, consider taking into account parameter uncertainty by assuming the posterior distribution of the parameter estimates follow a multivariate normal distribution with mean vector of parameter estimates and covariance matrix of the estimates from the COV step output. Suppose now we obtain 1000 sets of population parameters (thetas, Omegas, and Sigmas) from a parametric simulation from this multivariate normal distribution. Furthermore, if we simulate a single hypothetical subject's profile for each set of 1000 population parameters we now have a set of 1000 individual Y's at each dose/time point. If we calculate the SD(Ynew) across the 1000 individual values we have taken into account both sampling variation (Omegas and Sigma) and parameter uncertainty (multivariate normal posterior distribution). The resulting SD(Ynew) will be larger than the SD calculated for the degenerate tolerance interval example. However, suppose for the 1000 set of population parameters we set them all to the same estimates obtained from the PK model fit (i.e., ignore parameter uncertainty). If we now simulate a single hypothetical subjects's profile for each of the 1000 sets of exactly the same population parameters the degenerative prediction interval results in SD(Ynew) = SD since it only takes into account the sampling variation in Omega and Sigma the same as for the degenerative tolerance interval. Prediction Interval for a Mean of M Future Observations: Ybar +/- k*SD*sqrt(1/M + 1/N). To calculate the variability in the mean of M future observations (Ybar-new = Sum(Ynew-i)/M, i=1,...,M) we note that the sampling variance for Ybar-new is (SD^2)/M. To obtain the variance of Ybar-new taking into account this sampling variance as well as the uncertainty in Ybar we have that Var(Ybar-new) = (SD^2)/M + (SE^2) = (SD^2)*(1/M + 1/N) and hence, SD(Ybar-new) = SD*sqrt(1/M + 1/N). The inference for this statistical interval is on the mean of a fixed number (M) of future values rather than on individual observations. However, when M=1, this interval reduces to the prediction interval on individual observations. When we perform a PPC where the statistic, T, is a mean of the individual observations we are usually conditioning on some fixed known value of M. Prediction Interval for a Mean of M Future Observations: Population PK Example. Going back to the population PK example, suppose we want to perform a PPC where we use our single dose population PK model to predict the mean PK profile where we have M=12 subjects per each of 5 dose groups (i.e., an internal PPC where we condition on the design of the study that we used to develop the model). To perform the PPC taking into account parameter uncertainty we obtain 1000 sets of population parameters (thetas, Omegas, and Sigmas) sampling from the multivariate normal posterior distribution. Our statistic (T) of interest is the mean of the Y's for each dose/time point across the M=12 subjects. Thus, for each of the 1000 sets of population parameter values we simultate M=12 individual Y's for each dose/time point. We then calculate the mean across the M=12 individual Y values where T = Ybar-new, and we then have 1000 samples of the posterior distribution of Ybar-new. If we calculate the sample standard deviation of the means, i.e., SD(Ybar-new), we can construct a prediction interval for Ybar-new and overlay the observed statistic, Ybar (N=12). In practice we typically use the percentile method to construct the prediction interval and specifically for PPC we might show the histogram for the posterior predictive distribution of the statistic and overlay the observed statistic as a reference line on the histogram. If we don't take into account parameter uncertainty then the resulting prediction interval and PPC can be thought of as degenerative and will necessarily result in a narrower distribution of the statistic. Confidence Interval on the Population Mean: Ybar +/- k*SD*sqrt(1/N) = Ybar +/- k*SE. The confidence interval on the population mean can be thought of as a prediction interval where M is infinitely large such that 1/M goes to zero. In this setting the only uncertainty that is reflected is the parameter uncertainty in Ybar (i.e., SE). Confidence Interval on the Population Mean: Population PK Example. To illustrate the construction of a confidence interval using the population PK example we can repeat the process as above for the prediction interval for M observations where we take M to be arbitrarily large, say M=2000 hypothetical subjects. Again we simulate 1000 sets of parameters values from the multivariate normal posterior distribution and for each set we simulate M=2000 individual Y's for each dose/time point. We then compute the sample mean (Ybar-pop) of M=2000 subjects for each of the 1000 sets of population parameters. We can then construct a confidence interval for the population mean by computing the sample standard deviation of the set of 1000 Ybar-pop means (averaged over M=2000 hypothetical subjects) or use the percentile method. Note that in the simulations we take into account the sampling variation (Omegas and Sigmas) of the individual Y's but when we average over a sufficiently large M we effectively average out this sampling variation such that the only uncertainty reflected in SD(Ybar-pop) is due to parameter uncertainty. Also note that the population mean (Ybar-pop) and the typical individual prediction (Yhat | ETAs=0) are not the same thus, if we don't take into account the sampling variation and instead calculate the typical individual prediction for each of the 1000 sets of population parameter values, the resulting confidence interval of the mean of the typical individual predictions is not the same as the confidence interval of the population mean prediction. This is a particularly important distinction when working with noncontinuous models. With regards to whether we should take into account parameter uncertainty, I agree that there are many instances where parameter uncertainty is inconsequential relative to sampling variation in constructing tolerance and prediction intervals from our population models. Particularly, when all of the parameters and variance components are well-estimated with good precision (small standard errors). The problem comes in when we fit models to "small-ish" datasets or datasets that cover a limited range (e.g., limited range of doses or exposures) such that one or more of the parameters are estimated with poor precision (large standard errors). Given the complexity of our population models and the multivariate nature of the posterior distribution it is difficult to always know when parameter uncertainty is large relative to sampling variation and should be accounted for when constructing tolerance and prediction intervals. I know of no other way to truly assess the importance of parameter uncertainty other than to do it both ways (with and without parameter uncertainty). Of course, if you gone through the trouble of doing it both ways the purist in me says to report out the interval that includes parameter uncertainty even if it doesn't make much difference. If you are willing to assume that the posterior distribution of the parameters is multivariate normal with mean vector corresponding to the parameter estimates and covariance matrix of the estimates from the COV step output, then parametric simulations taking into account parameter uncertainty are not computationally burdensome to perform. However, taking into account parameter uncertainty from a parametric sampling from the multivariate normal to construct tolerance and prediction intervals may require additional custom coding to do so since we don't have automated utilities that do this...at least not that I'm aware of. On the other hand, if you are unwilling to make the parametric assumption that the posterior distribution is multivariate normal and thus are faced with performing nonparametric bootstrapping to construct an empirical posterior distribution as Juanjo recommends, then this will be considerably more computationally burdensome because of the need to re-estimate the parameters for each bootstrap dataset. Nevertheless, if one is planning to go ahead and perform nonparametric bootstrapping anyway to construct bootstrap confidence intervals for the parameter estimates, then you've already done the hard work of generating the empirical posterior distribution of the parameters so why not use it to account for parameter uncertainty when constructing tolerance and prediction intervals? One of the reasons that I like to take into account parameter uncertainty when performing a PPC is that when I do a degenerative posterior predictive check, the empirical distribution may underestimate the variation in the statistic (T) such that the observed statistic may fall outside the distribution. If parameter uncertainty is substantial, taking it into account will result in a wider empirical distribution of the statistic such that the observed statistic may now lie reasonably within the distribution and thus I might be more willing to accept the model as having adequate predictive performance for this statistic. My apologies for such a long message and kudos to everyone who read it this far. Ken _______________________________________________________ From: "Perez Ruixo, Juan Jose [PRDBE]" JPEREZRU@PRDBE.jnj.com Subject: RE: [NMusers] Simulation vs. actual data Date: Thu, July 14, 2005 7:31 pm Ken, Thanks for your well detailed explanation. I understand to get N sets of population parameters (thetas, Omegas, and Sigmas) from a non-parametric BS in order to simulate N clinical studies is computationally burdensome, specially for large N (let say 2500). I think the key question is how large N should be to account properly for the uncertainty. The purist way says that we should run SUBPROBLEM=1 for each set of population parameters. However, if a smaller N is able to account properly for the uncertainty then the computational burdensome is less. Suppose, 50 non-parametric BS replicates are good enough to account for parameter uncertainty, then would you consider to use 50 sets of population parameters and run SUBPROBLEM=50 in order to get 2500 trial replicates? Regards, Juanjo. _______________________________________________________ From: "Nick Holford" n.holford@auckland.ac.nz Subject: Re: [NMusers] Simulation vs. actual data Date: Thu, July 14, 2005 10:09 pm Ken, Thanks for your comprehensive and comprehensible response. I distinguished 2 levels of random effects (PPV and RUV) because there are two kinds of interval that are of interest depending on what random effects are used: a) PPV only -- the 'true' variability in response independent of RUV b) PPV and RUV -- the variability in observations of the response Type a) is what would want to see for a prediction of the true response while type b) is what you want to see when comparing the preidiction with an observation. I can understand almost everything you say except: 1. What does M mean with respect to future observations? Your example suggests that M is the number of subjects rather than the number of observations. But perhaps you mean M is the number of observations conditioned on time and design within a single trial? 2. Why do you use 'degenerative' instead of 'degenerate'? I agree that the non degenerate tolerance interval is preferable over the degenerate case but the dilemma comes when NONMEM spins the dice and won't reveal the results of $COV and the runtimes preclude a non-parametric bootstrap (a common occurrence for me). In this case the degenerate tolerance interval (DTI) seems to be helpful. If approx 90% of observations lie within the 90% DTI then one has a computationally easy way to confirm model performance as a description of the data. In my rather limited experience of visual predictive checks it is common to see more than 90% of observations lying within the 90% DTI which points to a model failure to capture the random effects properly. I have yet to see an example of more than 10% of the data lying outside the 90% DTI which is what you would expect if uncertainty is high. Nick _______________________________________________________ From: "Kowalski, Ken" Ken.Kowalski@pfizer.com Subject: RE: [NMusers] Simulation vs. actual data Date: Fri, July 15, 2005 8:28 am Nick, You wrote: >1. What does M mean with respect to future observations? Your example suggests that M is the number of subjects rather than >the number of observations. But perhaps you mean M is the number of observations conditioned on time and design within a >single trial? Yes, in my example it is the number of observations conditioned on time and dose such that I'm averaging one observation per subject and hence M also refers to the number of subjects (it is our primary sampling unit in population modeling that we want to average over). Note instead of means of observations as my statistic, T, of interest, I could have used another statistic such as a noncompartmental estimate of AUC which is based on n observations (time points) per subject. In which case I have M=12 subject's AUCs that I would average (e.g., geometric mean) as my PPC statistic. >2. Why do you use 'degenerative' instead of 'degenerate'? My apologies, it should be 'degenerate'. Ken _______________________________________________________ From: "Kowalski, Ken" Ken.Kowalski@pfizer.com Subject: RE: [NMusers] Simulation vs. actual data Date: Sat, July 16, 2005 9:56 am Juanjo, If you perform 50 bootstraps but you are going to perform 2500 trial replicates you shouldn't just replicate the 50 bootstrap sets 50 times to get 2500...that's not random. I suppose you could bootstrap with replacement the 50 bootstrap sets to generate 2500 sets of population parameters, however, I wouldn't advise it. I you're going to generate 2500 trial replicates presumably it is because you are interested in getting accurate tail probabilities. 50 bootstraps is not enough to accurately nail down the tail probabilities and a "double bootstrap" procedure to "artificially" generate 2500 bootstrap sets from an original 50 doesn't help (i.e., the 2500 are ultimately based on only 50 unique sets). However, the 50 bootstrap sets are probably good enough to estimate the sample covariance matrix from which you can compare to the covariance matrix of the parameter estimates from the NONMEM output ($COV step). In fact, this would be a good diagnostic to decide whether or not to use the parametric multivariate normal assumption for the posterior distribution when taking into account parameter uncertainty. If the sample covariance matrix from 50 bootstraps is in fairly close agreement to the covariance matrix reported by NONMEM then I would use the parametric multivariate normal distribution using the covariance matrix estimated by NONMEM to account for parameter uncertainty (i.e., generate 2500 unique sets of population parameters from the multivariate normal posterior distribution). On the other hand, if the sample covariance matrix from the 50 bootstraps is substantially different than that estimated by NONMEM, then you may be in a situation where it is not reasonable to assume that the posterior distribution is multivariate normal and you may need to generate considerably more than 50 bootstrap sets from a nonparametric procedure although I don't think you necessarily need 2500 bootstraps...1000 bootstraps is probably sufficient if you want to estimate the 5th and 95th percentiles. Note that when the multivariate normal assumption doesn't hold it is often because the distribution is asymmetric. This is why nonparametric bootstrap confidence intervals on the parameter estimates can be quite asymmetric when they do differ from Wald-based intervals using the NONMEM standard errors. To really nail down this asymmetry in the distribution to get accurate tail probabilities really requires a lot more than 50 bootstraps. So, if you routinely advocate doing nonparametric bootstrapping to account for parameter uncertainty out of concern for this potential asymmetry then I think you've got to bite the bullet and do more than 50 bootstraps. Your suggestion of doing 50 bootstraps prompts me to respond to the issue Nick raised regarding what to do when one is willing to accept a model where the COV step fails but long runtimes prohibit performing the nonparametric bootstrap procedure to generate 500-1000 sets of parameters from the empirical posterior distribution. If the runtimes are not too long such that 50 bootstraps are feasible, then I would advocate calculating the sample covariance matrix and assume the multivariate normal distribution holds to generate 1000 sets of population parameters to take into account parameter uncertainty. Basically, I think it is better to do something to take into account parameter uncertainty than to do nothing at all...we don't really know when parameter uncertainty may or may not be important unless we account for it and see how much impact it has. Nevertheless, I'm sure that situations arise where runtimes are so long that even performing 50 bootstraps is prohibitive. Ken _______________________________________________________