From: "Piotrovskij, Vladimir [JanBe]" <VPIOTROV@janbe.jnj.com>
Subject: Stepwise regression
Date: Tue, 21 Nov 2000 11:21:26 +0100

Dear NM-users,

There has been recently a discussion at the S+ users list on a stepwise regression. Frank Harrell provided a reference on the following web page that contains a discussion of pitfalls of stepwise regression:

http://www.pitt.edu/~wpilib/statfaq/regrfaq.html

The bottom line was: "never do it." But we are doing it every day!

Any opinions?

Best regards,
Vladimir


*****


From: "Niclas Jonsson" <Niclas.Jonsson@biof.uu.se>
Subject: Re: Stepwise regression
Date: Tue, 21 Nov 2000 14:38:50 +0200 (EET)

Dear Vladimir,

One way of summarizing Frank Harrells view on stepwise model selection is:

All peeking at the data before fitting your model to it causes bias.

The reason for this is that you tend to follow the signals/pattern you find (be it graphical or by a stepwise model selection) and there is no way of telling that the pattern you see is a true one or the results of randomness. In addition, even if the pattern you see is "true" there is a rather high likelihood that you spotted it just because it was particularly strong in the data sample at hand. This applies to both the structural components of our models as well as the covariate components.

This is of course a pity since all model development we do tend to be driven by graphical, or other exploratory, means. Indeed, most of us would say that *not* looking at the data to see what model to use, is unwise.

The apparent conflict in views here stems, I think, from the kind of applications we have. Mechanistic PK/PD modeling is very much a learning exercise, i.e. to a large extent we don't know what model we will have to use before we get the data. And even if we have a fair idea towards the end of drug development, we usually have to handle information about new aspects that couldn't have been studied in earlier development phases (e.g. drug-drug interactions), or we might have to handle the situation of much sparser data. In the more traditional world of confirmatory statistics, the structure of the data tends to be more standardized and it is consequently easier to pre-specify the data analysis method.

In summary, I think that we should be aware of the dangers of stepwise model building but realize that this kind of approach is an essential component in our efforts to make sense out of the data we are to analyze.

(I also attended the Frank Harrel course Mats mentioned and it *was* very good!)

Best regards,

Niclas Jonsson
--
Department of Pharmacy
Uppsala University
Box 580
SE-751 23 Uppsala
Sweden
Phone: +46 18 471 43 85
Fax: +46 18 471 40 03
Mobile: +46 70 485 61 98
E-mail: niclas.jonsson@biof.uu.se


*****


From: Mats Karlsson <Mats.Karlsson@biof.uu.se>
Subject: Re: Stepwise regression
Date: Tue, 21 Nov 2000 13:41:10 +0100

Dear Vladimir,

I was to a course given by Frank Harrell recently (he was very good). It was very sobering and I think we will have to rethink what we're doing. I have certainly no new ideas, but for what it's worth, I think covariate-parameter relationships should be divided into two categories, those for which I want to determine the covariate-parameter relationship ("confirmatory") and those I want to screen for hypothesis generation ("exploratory"). Covariate model building could then be divided into 3 steps:

Step 1. Include all confirmatory covariate-parameter relationships into the model and estimate coefficients. This will result in the full (and also possibly final model) where coefficients are unbiased. CI's of covariate coefficients can be obtained by a suitable method as recently discussed here on the nmusers. These coefficients should be unbiased (up to the accuracy of the method used).

Step 2. Maybe not necessary, but relationships of no influence in Step 1 are eliminated to result in a smaller "final" confirmatory model.

Step 3. Start from the final confirmatory model and do stepwise search according to one of the method we use today to obtain an exploratory model, which I would put less trust in compared to the confiramtory model. The coefficients are likely to be biased.

I would then use the confirmatory and exploratory models for partly different purposes. In addition, I would try to be mechanistic in selection of relationships (as advocated by e.g. Nick Holford) and avoid the use of highly correlated covariates (not to use WT, HT, BMI, IBW, BSA, ... as alternative covariates to be selected between).

I haven't really implemented this yet and it is a bit cumbersome, so I hope someone can tell me that there is an easier way to handle this problem!

Best regards,
Mats
--
Mats Karlsson, PhD
Professor of Biopharmaceutics and Pharmacokinetics
Div. of Biopharmaceutics and Pharmacokinetics
Dept of Pharmacy
Faculty of Pharmacy
Uppsala University
Box 580
SE-751 23 Uppsala
Sweden
phone +46 18 471 4105
fax +46 18 471 4003
mats.karlsson@biof.uu.se


*****


From: James Wright <J.G.Wright@ncl.ac.uk>
Subject: Re: Stepwise regression
Date: Thu, 23 Nov 2000 15:14:47 +0000

Dear nmusers,

If anyone is in doubt about the validity of the advice below "Subset selection in Regression" by A.Miller is extremely convincing. It focuses on settings with far more data than we conventionally have and stepwise selection still fails.

Samples rarely contain enough data to make numerous (conditioned) decisions for you - use other sources of knowledge whenever you can.

Neural networks are interesting alternative approach to covariate models, with there own pitfalls.

Regards,

James Wright


*****


From: "J.G. Wright" <J.G.Wright@newcastle.ac.uk>
Subject: Re: Stepwise regression
Date: Fri, 24 Nov 2000 15:30:30 +0000 (GMT)

Dear Niclas,

The bias issue applies to any model-building procedure and can be avoided by splitting your data. Strictly speaking, one should select, estimate and evaluate on 3 different data sets. However, the problem with stepwise regression is more fundamental - if you try to make a series of conditioned hypotheses tests you are almost certain to go wrong at some point and then future hypothesis tests are conditioned on a misleading model. Stepwise regression is a recipe for disaster.

However, we do still need to build models...perhaps we should focus on ways to make decision making more robust. These basically fall into two categories - be more cautious or use more data. Being more cautious loses powere and using more data is either expensive or subjective...

There also various methodologies that supposedly help...

Regards, James


*****


From: Michael J Fossler <Michael.J.Fossler@dupontpharma.com>
Subject: Stepwise regression
Date: 27 Nov 2000 13:20:46 -0500

I have been following this thread with great interest. It should be noted that others (such as Hosmer and Lemeshow in their excellent book on logistic regression) have a kinder view of stepwise regression. It seems to me that many of the pitfalls of this method can be avoided by selecting reasonable alpha values for covariate entry and retention, along with a healthy dose of common sense.

One additional comment: The advice that one should not look at plots of your data before modeling it strikes me as bizarre to say the very least, not to mention unscientific. Are we really so easily swayed that one peek at the data will ruin us for all eternity? I take quite the opposite view: ALWAYS look at plots of your data before you model, but set (a priori) criteria which will be used to include or exclude variables when you start modeling. Finally, as is so often noted in this forum, know the biology of your system so that the importance (or lack thereof) of a covariate can be judged scientifically.

Just another example to support my view that statistics is too important to be left to statisticians...

Mike
*****************************************************************************
Michael J. Fossler
Senior Research Scientist
Drug Metabolism and Pharmacokinetics, DuPont Pharmaceuticals
(302) 366-6445
michael.j.fossler@dupontpharma.com
*****************************************************************************


*****


From: "Piotrovskij, Vladimir [JanBe]" <VPIOTROV@janbe.jnj.com>
Subject: RE: Stepwise regression
Date: Mon, 11 Dec 2000 12:29:13 +0100

Thanks to all who responded on my posting about stepwise regression. Everybody agreed it is not the right way to build a covariate model, however, nobody suggested a universal solution. What is clear, we should not rely only on statistical significance and should always explore data before starting model building. Actually, graphical analysis should guide the model development. Suppose we have selected a structural model and fitted it to data. The next step can be plotting random effects (ETAs) versus available covariates and visual selection of a covariate which has the most significant impact on one of the PK parameters. Looking at the plots may also suggest a functional form of the relationship. Of course, the eye should be well trained... After fitting a model with the covariate included a new set of ETAs should be examined again. Log likelihood difference and AIC/BIC should be used to confirm the significance of the covariate, but not to guide model building. In case of highly correlated covariates like body size variables one have to choose the most practical one: body weight unless there are strong evidences that a derived variable like LBM or BSA can predict, say, CL much better. It should always be kept in mind that the goal of population PK modeling is to support treatment optimization: there is no need to include clinically irrelevant effects even if they are statistically significant at, say, alpha=0.95.

Best regards,
Vladimir


*****


From: Mats Karlsson <Mats.Karlsson@biof.uu.se>
Subject: Re: Stepwise regression
Date: Mon, 11 Dec 2000 22:32:18 +0100

Dear Vladimir,

I, and I imagine most modellers, agree that your recommendations are as good as any single procedure can be. However, we shouldn't ignore the problems even with this. For example, even if they are very attractive (and should be pursued) graphical procedures aren't without problems. Parameters (or ETA's) can be imprecise, biased and vary within subjects over time. Covariates may be correlated and time-varying rendering "normal" graphics relatively uninformative. Trends and significances are not easy to assess graphically (analyst-dependent?) and how are we to choose which relationship to include first (which parameter?, categorical versus continuous covariate?). If graphics based on etas are used should we include covariances in the basic model (Diane Moulds question that didn't seem to get an answer)? If a single covariate is influencing several parameters should we add it stepwise or simultaneously on these? In addition, purely graphical procedures for inclusion may be difficult to specify and time-consuming (this is an issue for most people even if we in the ideal case don't need to take them into consideration). Also, using graphics doesn't really address the problem that started the discussion. Stepwise procedures using graphics may suffer from exactly the problems you first mentioned.

I think we need to continue to pursue the investigation of different model building procedures to learn more about which ones are suitable for what type of model building (no of parameters, no of covariates, complexity and purpose (treatment optimization/hypothesis generation/clinical trial simulation,...) of model, etc). Also, scientific plausibility/mechanistical interpretability and, as you mentioned, clinical significance, should probably be given a more prominant role than presently.

Best regards,
Mats