## Topic started by: (Stu Beal) - 1 Feb 1994

>From time to time there has been discussion about the need to "validate" a population analysis. It is fair to say that Sam Vozeh was an "early" exponent of this idea (see the latest NONMEM bibliography: Data Analyses item 36, and Methodological item 21). At NONMEM intermediate level workshops late in 93 the subject came up repeatedly for discussion. The matter was also discussed at the COST conference in November. Below, I present a few ideas for using NONMEM IV for model validation. They are simple and straightforward, and I hope they will help you. I want to thank Mats Karlsson for his review and help with this discussion. In addition, though, please pause for just a few minutes and consider this question. What else, if anything, do you think should be done in the way of model validation, and how might NONMEM be improved in this regard? It may be difficult to answer this question. You might intuit that a more sophisticated approach could be undertaken than what I shall describe, pos- sibly one involving more sophisticated statistics. However, perhaps not being a statistician, or perhaps not having sufficient time to work out the details, you do not know *exactly* what to suggest, and, of course, until one sees an approach fully worked out, one really doesn't know one has a good idea. Some of us at the NONMEM Project Group have already given this matter some thought and believe that model validation may indeed be a tricky business. Still, if you think you have a cogent idea regarding a more sophisticated approach, I'd like to hear from you. If you reply, I'll not be critical; indeed, I'll probably not have the time to offer a critique. I'll acknowledge receipt of your message, and I'll be thankful that you offered something for us to think about. I may also need to dis- cuss something further with you, in which case, I'll contact you directly. On the other hand, most of you are in a position to recognize that some of what I'm about to describe might be implemented more easily with certain changes to NONMEM. Comments of this sort are also welcome and would be appreciated. Do ask yourself first, though, whether what I'm about to describe is so difficult to implement currently that the user can indeed be much better served by changing NONMEM. If you would like to share your thoughts with others on the users net, please of course do so. However, I do not regularly read mail on the net; so to share your thoughts with me too (or with me alone if you prefer), please respond directly to me at stuart@c255.ucsf.edu. ____________________________________________________________________________ INTRODUCTION By model validation, I mean obtaining information about how well a popula- tion model describes a set of data (called the "validation" or "test" set), none of which was used to develop the model itself. The data set used to develop the model is called the "index" or "learning" set, and the model that is developed may be called "the fitted model", or simply "the model". I shall take the term "the model" to mean a model form *together with* a set of values for the model parameters. A validation data set can be entered into a NONMEM problem, as can any index population data set. The parameters values of the model (THETA, OMEGA and SIGMA) can be entered as initial estimates, or an input MSF can be used which contains these values from the earlier problem which fit the model to the index data. (The same MSF can be used with different data sets.) The Estimation Step will not be implemented, so these population parameter values will not change. Predictions and the like will be com- puted using these values. The basic idea, of course, is to see how close model predictions are to the validation data. "Closeness" may be judged in clinical, rather than sta- tistical, terms. That is, the investigator might ask "do the differences I see mean that there can arise some clinical problem if I try to use this model?" It's important to make the distiction between clinical satisfaction and statistical satisfaction. The suggestions made herein are offered in the spirit of helping with the question of clinical satisfaction. They are really just a couple of very simple ones. There are two types of predictions one might make of observations from an individual in a validation set. The first is a prediction made without the benefit of using any of other observations from the indivdual that might be available, i.e. without using "feedback observations". This is the type of prediction made for a patient from which no observations have yet been made, or if they have been made, they are being ignored. It is the "popu- lation prediction", i.e. the one given by the population model using the value eta=0. It is available in a NONMEM table under the PRED heading, and can be included in a scatterplot by using the label PRED. The second type of prediction is made using some observations from the individal ("feedback observations), but different ones from those being predicted. I shall dis- cuss this type of prediction also, and remind you how they may be obtained. I'll call them "feedback predictions". Another term for feedback predic- tions might be "intraindividual predictions". As the number of feedback observations increases, feedback predictions depend increasingly on the feedback observations, and less so on much of the population model, so that the nature of what is being validated changes. POPULATION PREDICTIONS Consider first the population predictions. A simple graphical approach may suffice to examine how close predictions are to the validation data. Of course, if necessary, the numbers themselves can be examined in detail in a table. DV vs PRED is one helpful plot. When generating this plot, remember to include the option UNIT in the $SCATTERPLOT record in order to obtain a line with slope=1 on the scatterplot. Some people may prefer to plot (or plot in addition) RES versus a covariate such as ID or AGE, or perhaps RES vs PRED. A plot like RES vs AGE allows a judgement to be made regarding the (clinical) adequacy of the model for different subpopula- tions. The numbers can be summarized with various statistics, but for clinical purposes, probably as much, if not more, information is presented by these simple pictures as is presented by the statistics. Often the best statistics are pictures. Descriptive statistics that help assess all interesting features of certain pictures may be difficult to formulate; interesting information can easily "get lost" in the summarization. How- ever, I do not mean to suggest that in general this is not a feasible or useful activity. Formal statistics are certainly going to be required when it is important to assign a statistical error level to some inference. Some further discussion about a more formal statistical approach are given below. Another plot to help examine the closeness of predictions to observations is one which must be developed outside of NONMEM, using a graphics package with a table output by NONMEM. This is a plot of the data vs covariate, superimposed on a plot of PRED vs covariate. (There are a number of good graphics packages, and it is not a goal of the NONMEM Project Group at this time to include their capabilities into NONMEM itself. This is not to say that certain NONMEM improvements using printer-plots, which are the type of plots output by NONMEM, could not or should not be undertaken.) With these plots, there will be considerable differences (errors) between predictions and observations, and it may be of some clinical concern to know whether these differences are within the realm of error described by the population model. After all, if the population model does not describe the observed errors well enough with the validation data, then how reliable is the model; might its application with yet a second validation data set yield errors which collectively appear different from those seen with the index set *and* from those seen with the first validation set? However, it is first important to observe how large errors can be. After all, if, for example, errors are smaller than described by the model, but small enough across all data sets, then the model may be unreliable in the way just described, but nonetheless quite adequate. Population standard deviations measure the magnitudes of the errors, where the errors are regarded as arising due to *both* random intraindividual and random interindividual effects. They, just as the predictions, are numbers computed solely from the model. That is, they are independent of the observations, although they do depend on the values of the *independent* variables found in the data set. A population s.d.'s might be regarded as "a prediction of an unsigned error". Weighted residuals (WRES) have been a part of NONMEM output since the program's inception. The WRES are the RES expressed in (i.e. as fractions of) population standard deviation units. Although, the way this is done is a little complicated; see below. A plot of WRES vs a covariate (e.g. ID) can be used to make some assessment about whether the errors follow the description given for them under the population model. Ideally, the WRES should be homogeneously scattered about the zero line on the WRES axis, provided the description is a good one. For the scatter to be homogeneous, the amount of scatter in the WRES in each little region along the covariate axis, should be about the same. Ideally, all the WRES are almost indepen- dent of each other, even within the data from a given individual. That is, when viewing the data from a number of individuals, the correlation one usually sees within the data from a given individual, due to interindivi- dual differences, should not be seen within the WRES from a given indivi- dual, or at least should not be seen as nearly as much, provided (once again) the model well-describes (the variability and covariability of) the differences. Were this not so, the scatter in the plot of WRES vs ID would not appear to be homogeneous. The WRES are also routinely computed with the index data. The nominal homogeneous scatter of the WRES allow them to be used with this data during the model building process to detect a covariate influence that has not been modeled well enough. If one sees a trend in a plot of the WRES vs the covariate, rather than a homogeneous scatter, this means that the model is not describing the variability well enough; the model might be improved by more appropriate use of the covariate. The same is true with the valida- tion data, the only difference being that there may be no further opportun- ity for model improvement using validation data; the validation data set may have been be set aside *only for validation* purposes. On the other hand, suppose, for example, that the different values of AGE in the index data lie in a small range, so that an AGE influence was not detected with the index data, but that the values in the validation data are more widely dispersed, and, using this data, an influence seems to exist. Then one might contemplate changing the model *during the validation process* to include AGE. Perhaps one might take the point of view that the model with AGE distributed as with the index data can still be "validated", whereas the model with AGE distributed as with the validation data must remain "provisional". Suppose that for every individual in the validation set, there is just one DV item. Of course, by choosing one measurement per individual, this can always be arranged. Then with a given individual, there is just one RES and one WRES. The WRES is a simple normalization of the RES, viz. the RES divided by the population standard deviation of the observation. In this case there is another plot might be considered, one allowing a view of the data themselves along with an indication of a range within which the data might occur, as predicted by the model. This plot, as with another men- tioned above, must be developed outside of NONMEM, using a graphics package with a table output by NONMEM. It is a plot of the data vs covariate, superimposed on a plot of PRED +-2 population s.d. vs covariate. By PRED +- 2 population s.d., I mean that both the numbers PRED +2 population s.d. and PRED -2 population s.d. are plotted. A population standard deviation may be obtained using the formula population s.d. = RES/WRES. However, it should be understood that the number 2 has no magic meaning. Were (i) the data normally distributed, (ii) the estimates being used for the population parameters error-free, and (iii) the model description of the errors a good one, then about 95% of the data should lie within the PRED +- 2 population s.d. range. Since, though, assumption (i) need not hold (it is *not a part* of the model), nor need (ii) hold, then one can only use this plot in a qualitative way. In general, it may be more difficult to use this plot than a plot of WRES vs covariate, because the errors are only implicitly shown and are not adjusted to be homogeneously scattered (about the PRED curve). A little care may need to be taken with the computation of PRED +- 2 s.d. Suppose the index data (and presumably the validation data) are distributed in a rather skewed way (and suppose the data are concentrations, so that they are all nonnegative). Then it would be reasonable to have developed a population model for transformed data (for example, log transformed data). The structural model for the transformed data might have been obtained by taking a structural model for the untransformed data and applying the identical transformation to it (the "transform both sides" method). With this technique, predictions of untransformed data are obtainable, but the NONMEM ELS-type objective function would have been used to fit the differ- ences between transformed values of these predictions and transformed (index) data, and thus applied to differences that are approximately sym- metrically distributed. Then the plot in question (using the validation data) could be one of transformed data or one of untransformed data. If, though, untransformed data are plotted, then the two data-delimiting curves on which these data are superimposed are the *inverse transformed* PRED +- 2 s.d. curves, where PRED +- 2 s.d. refers here to those curves that would be used with the transformed data. To understand this, it may be instruc- tive to consider the plot if a transformation had not been used to fit the index data. In this case, the validation plot would suggest that much smaller errors occur at low concentrations than are predicted by the model, which, of course, would be the case. Indeed, the plot would suggest that there can be negative concentrations. When there are more than one DV items per individual, each WRES cannot be associated with a particular DV data item. In this case, each WRES for a given individual is a linear combination of the RES for that individual and thus is a function of all the data from that individual. In this case, there is no way (of which I know) to examine "predicted unsigned errors" and the data in the same plot. Vozeh (NONMEM bibliography: Data Analyses item 36) suggests computing the average RES (ARES) for each individual, i.e. the arithmetic average of all the RES from the individual. This is just the difference between the aver- age observation and the average prediction. The population standard devia- tion of the ARES, given by the model, can be computed, and a plot of the average observation vs covariate, superimposed on a plot of the average PRED +-2 population standard deviations, can be made. This plot does give one the ability to view observations, although they are *average observa- tions*, along with predicted unsigned errors. However, there is a problem interpreting the average observation and the ARES, since these are averages over a number of nonrandom heterogeneous conditions. For example, time, and dosage too perhaps, vary over an individual's record. Moreover, the ARES is but a single summary statistic of all the errors from the indivi- dual; some information about these errors will not be reflected in this statistic. It is one particular linear combination of the errors. In con- trast, each WRES from the individual is also a linear combination of the errors, and there are as many of these as there are observations from the individual. From all the WRES from the individual (and the population variance-covariance matrix of the data from this individual), all the errors can, in principle, be recomputed. For these reasons, and for the others mentioned above, it is not clear to me that one would not be just as well off with a plot of the WRES. Vozeh suggests performing a t-test, based on taking the average of the ARES across the individuals, to test the hypothesis that the model describes the errors in the validation data set. (The test uses the population s.d.'s of the different ARES values, and so, strictly speaking, is not a t-test, but is a z-test.) A test of this sort, testing the same hypothesis, can just as well be based on the average WRES over the entire set of WRES. (The population s.d. of each WRES is 1). Actually, neither test is an efficient one to test the hypothesis. Both tests are primarily ones of the hypothesis that the population average ARES or population average WRES is 0, and only secondarily do they test that the assumed population s.d.'s truly describe the magnitudes of the errors. Even so, they incorrectly assume that the estimates for the population parameter values are given without error. Finally, the very idea that a proper test of the hypothesis could be useful is questionable. How badly must the errors be described by the model before a clinical problem of some sort arises? FEEDBACK PREDICTIONS Feedback predictions of data from a given individual are based on estimates of individual-specific (PK/PD) parameter values for the individual, which in turn are based on data from that individual (feedback observations) other than those which are being predicted. The parameter estimates are obtained by means of Bayesian estimation. If the amount of data from the individual is very small, the estimates of many of the individual-specific parameters are essentially the estimates for the typical individual that are given by the population model. This is because as the amount of data decreases, individual-specific parameter Bayesian estimates become increas- ingly constrained by the "Bayes penalty term" in the Bayes objective func- tion to be "close to" the typical individual parameter estimates. As the amount of data per individual increases, the Bayesian term becomes less influential, and the individual-specific Bayesian parameter estimates become ELS estimates. If the amount of data per individual is large, the Bayesian estimates depend little on the population model, except for its basic structural form (and except for the intraindividual error model; see below). Hopefully, one has a pretty good idea about the basic structural form before embarking on the analysis of the index data. If, though, there remains some ambi- guity about this, or simply to check that things are as anticipated and that no mistakes have been made, the basic form can be validated by compar- ing the observations of the validation set with their feedback predictions. One way to do this is to run NONMEM with multiple problems, as many as there are individuals in the validation data set. With each problem, the basic structural model is fit to the data from a different individual, and thus feedback predictions for the individual's data are obtained. Bayesian estimation can be used (see e.g. NONMEM Guide II, section C), but, as men- tioned above, with much data per individual, the Bayes penalty term is negligible. It is sufficient simply to use ELS (the usual and default method for fitting single-subject data). All the separate ELS fits can be done within one NONMEM run by stacking all the problem specifications in a single control stream. In this case, the $DATA record with each problem points to the entire validation data set. However, with each problem, either the RECORDS option on the $DATA record must give the number of data records for that individual, or a FINISH record must terminate these particular data records (see Guide IV, chap. III, sec. B.5). Each problem involves single-subject data, amd with such data, the ID data items must differ between observation data records. NM- TRAN automatically generates appropriate ID data items, different from those included in the population data set (see Guide IV, chapter II, sec- tion C.4). However, an even simpler way to obtain feedback predictions, one which can be used with any amount of data per individual, is to actually use Bayesian estimation, implemented as follows. Construct a single problem, using the entire validation data set. Include POSTHOC and MAXEVALS=0 on the $ESTIMA- TION record. By setting MAXEVALS=0, the Estimation Step is not imple- mented, i.e. the population parameter values are just those given as ini- tial estimate values (or are those in the MSF output from the run fitting the model to the index data). The use of POSTHOC, however, directs Baye- sian estimation to be carried out with each individual's record; Bayesian estimates of all the eta's, for all individuals, are obtained, conditional on the population parameter values. Predictions based on these estimates are displayable in a table or scatterplot by including a statement such as IPRED=F (in $ERROR), and using the label IPRED in the $TABLE or $SCATTER- PLOT record. The name IPRED (for intraindividual predictions) is one I made up, you could use some other name in your abbreviated code. Intrain- dividual residuals, obtainable by also including the statement IRES=DV-F (in $ERROR), can also be plotted. (The estimates of an individual's eta's are essentially the estimates of that individual's parameters. For exam- ple, if the parameter is modeled as theta1 + theta2*WT + eta, then theta1 and theta2 are fixed in the computation of the estimate, as is WT (assuming WT doesn't change within the individual), and so the difference between the estimate of the parameter and the estimate of the eta is just the constant theta1 + theta2*WT.) The accuracy of the intraindividual error model can also be assessed. Just as there are population standard deviations, there are intraindividual standard deviations. These measure the magnitudes of the errors (the intraindividual errors) between an individual's observations and their feedback predictions due to random intraindividual effects only. They are computed under the intraindividual error model for the individual's data, which, in this discussion, is understood to be a model form together with parameter values *given by the Bayesian estimates*. With a heteroscedastic intraindividual error model which depends on feedback predictions, not only does the s.d. vary for different observations within the individual, but even when two individuals have the same values for *all* independent vari- ables, the s.d. can differ for the observations between the two individu- als. This is because the feedback observations are generally different between the two individuals, and therefore, so are the feedback predic- tions. Intraindividual residuals can be transformed to intraindividual weighted residuals, i.e. can be expressed in intraindividual standard deviation units. Unlike the situation with population weighted residuals, each intraindividual weighted residual can always be associated with a single observation; it is the residual divided by the s.d.. To obtain intraindi- vidual weighted residuals, include the statement IWRES=IRES/H, when Y is given by F+H*ERR. Plots of the data along with a predicted range, can also always be made. In this regard, note that the s.d. itself is given by IRES/IWRES. To some extent, residuals and weighted residuals can be pooled over indivi- duals. That is, a plot can include (for example) the residuals from all individuals. However, consider that fact that feedback predictions vary between individuals, if for no other reason than that the Bayesian esti- mates vary between individuals. One might wonder whether this variability, or more precisely - this source of variability, might affect the accuracy of the plot. Population predictions, in contrast, do not vary between individuals due to variability in parameter estimates. It seems intuitive to me that as long as the "precision" of the Baysian estimates is fairly constant across individuals, interindividual variability in feedback pred- ictions should not affect the accuracy of the plots. A problem might arise, though, when the feedback observations are obtained under very dif- ferent designs for different individuals. Moreover, the designs could differ in systematic, rather than random ways. Then the precision of the Bayesian estimates might vary considerably with the individual and in a nonrandom way. We have no experience that counsels us about this situa- tion. If necessary (or even in addition to pooled plots), residual and weighted residual plots can be stratified by individual (using list 3 of the $SCATTERPLOT record).## End of Topic - 11 Apr 94