Regression is a widely-used statistical technique for studying how a response variable depends on one or more predictors. A regression analysis includes the construction of a model that attempts to describe this dependence in mathematical terms. Before the model is used to address questions about the relationship between response and predictors, the fit of the model to the data should be assessed; a field known as diagnostics. I propose methodology to aid in this task, pulling together traditional model-checking methods, recently-developed graphical approaches to regression, and modern Bayesian computational tools.

Suppose a physician wishes to decide whether a breast mass is
malignant or benign without using (often painful) surgical
procedures. A less invasive procedure is based on characteristics of
cells extracted (painlessly) using a very narrow needle -- *Fine
Needle Aspiration* (FNA). One approach to assessing this procedure is
to construct a regression model to describe the dependence of the
response (malignant or benign) on the predictors (cell
characteristics). Diagnostic methods are then used to
infer if there is information in the data to contradict this
model. There could be severe consequences for the patient if the
physician's decision was based on an ineffective model. Traditional
diagnostic methods include plotting particular features of the model
and calculating numerical summaries designed to reflect how well the
model fits with respect to various optimality criteria.
Cook (1998b) describes additional, recently-developed methods
for checking regression models; these methods emphasize a graphical
approach to regression.

If the model provides a good description of the response variable's
dependence on the predictors, then diagnostic plots and numerical
summaries should reflect this fact. For example, the physician could
use the model to estimate the probability of malignancy for each
sampled patient based on their cell characteristics. This information
could be summarized in a plot with one of the cell characteristics
(average cell radius, say) on the horizontal axis and a curve
representing the model-based estimate of the probability of malignancy
on the vertical axis. Another curve representing *model-free*
estimates of malignancy probability could be obtained by
*smoothing* the response variable against the radius. (A simple way to
calculate this smoother is to divide the horizontal axis into
intervals, calculate the proportion of patients with malignant masses
in each interval, and draw a curve through the corresponding points;
in practice, more sophisticated smoothing techniques are possible.)
Graphical checks on the model require comparing the model-based curve
to the model-free curve -- a good model should have curves that are
close to one another, no matter which cell characteristic is plotted
on the horizontal axis. Numerical checks could compare model-based and
model-free estimated malignancy probabilities for patients in
particular groups (pre- and post-menopausal, say).

However, judging whether diagnostic plots or numerical summaries have
particular characteristics can be problematic in practice,
particularly when low-dimensional plots (such as two-dimensional
scatter-plots) are used as summaries of high-dimensional models
with a large number of predictors. The analyst needs
to compare the single best model estimate to a model-free
estimate, with little guidance about how *close together* the two
estimates need to be for the model to be effective; the variability in
the model estimates is not being taken into account. This variability
can be viewed from different philosophical viewpoints.
*Frequentist* statisticians assess the variability indirectly through
imagined repetitions of the study that gave rise to the data. On the
other hand, *Bayesian* statisticians assess the variability
directly by modifying their belief about a model in the light of data,
and quantifying the variability in probabilistic terms. My approach to
model assessment follows the Bayesian paradigm: repeatedly generate
plausible values of a particular quantity from the estimated model
(such as a curve representing estimated probability of malignancy),
and then consider a model-free estimate of this same quantity within
the context of the generated values. It is only with recent advances
in computing that it is possible to use such an approach routinely for
a wide variety of models.

This approach offers a way to overcome difficulties in judging
patterns in diagnostic plots and gauging values of numerical
summaries. Generated model quantities provide a back-drop against
which a model-free quantity can be more easily assessed. Any
systematic differences between the model-free quantity and generated
model quantities indicate potential failings of the model. For
instance, the physician could generate 100 model-based curves to
represent probabilities of malignancy with respect to a cell
characteristic. If the corresponding model-free curve appears
particularly unusual in relation to the 100 generated model-based
curves, then the model is called into question, and a different model
may need to be considered. The nature of the *unusualness* of the
model-free curve might suggest how an improved model may be found. For
example, the model might appear to be adequate for younger women but
not for older women, suggesting the need for a more complicated model
for older women.

Generating model-based estimates to calibrate a model-free estimate
can also address the common problem of *over-fitting* -- perceiving
patterns in plots where there are none, and concluding that a different
model is needed. For example, a model based on surgical procedures
may appear to more accurately predict malignancy. However, when the
variability in model estimates is taken into account, this model may
offer little improvement over the FNA model, and would not
justify the extra time, cost, and additional trauma to the
patient. Thus, surgical procedures might be recommended only when
results of the FNA procedure are inconclusive.

- Apply modern computational methods of generating model samples to graphical regression diagnostic techniques for linear models, a frequently-used class of statistical models.
- Extend the methods to more complex models, such as generalized linear models where the response variable is binary (can take on two values only).
- Complement the graphical techniques with numerical methodology.
- Compare performance of the new and existing methods using simulated and real data.

Gelman et al. (1996) motivate a sample generating approach to model assessment. Samples can be generated with exact methods when closed-form calculations are possible (such as in simple linear models), and with approximate methods (such as Markov chain Monte Carlo techniques) when more complex models are required. Traditional and modern regression diagnostic techniques are discussed in Cook (1998) and Cook and Weisberg (1999).

Scientists try to make sense of the world by formulating models for how they believe it works. These models need to be assessed relative to their objectives before they can be put to good use. For example, Newton's law is perfectly adequate for many purposes, but with respect to certain phenomena, Einstein's relativity becomes a necessary replacement model. Hill (1996) noted that model assessment is an important problem that "despite its long history, going back at least to Daniel Bernouill's celebrated analysis of the planetary orbits, is still largely unsolved and controversial". I aim in this thesis to bring new ideas to this old problem.

Regression is a widely-used statistical technique for fitting models. Assessing the fit of a posited model is a crucial, yet often problematic, step in any regression analysis. I believe that my proposed methods can greatly ease the model-checking process. Widely available computer software allows samples to be generated from many models very easily, and, with modern computers, very quickly. I aim to develop graphical and numerical methods that are sufficiently simple that an analyst can easily implement them for routine use. This has the potential to enable statistical analyses to be carried out more easily and quickly than before, and also to give the analyst greater confidence that any model that passes the checks proposed can then be used as a serious component in a scientific investigation.

Cook, R. D. (1998). *Regression Graphics: Ideas for studying
regressions through graphics*. New York: Wiley.

Cook, R. D. and S. Weisberg (1999). *Applied Regression Including Computing and Graphics*. New York: Wiley.

Gelman, A., X.-L. Meng, and H. Stern (1996). Posterior predictive
assessment of model fitness via realized discrepancies (with
discussion). *Statistica Sinica 6*, 733-807.

Hill, B. M. (1996). Comment on "Posterior predictive assessment of
model fitness via realized discrepancies" by Gelman, A., X.-L. Meng,
and H. Stern. *Statistica Sinica 6*, 767-773.

Pardoe, I. (to appear). A Bayesian sampling approach to regression
model checking. *Journal of Computational and Graphical
Statistics*.

Back to Research page.

Send me e-mail at ipardoe at lcbmail.uoregon.edu

*Last updated: March 9, 2000*

*The views and opinions expressed in this page are strictly those of
the page author. The contents of this page have not been reviewed or
approved by the University of Oregon.*

*© 2000, Iain Pardoe, Lundquist College of Business, University of
Oregon*