## A Bayesian Approach to Regression Diagnostics

### Abstract

Regression is a widely-used statistical technique for studying how a response variable depends on one or more predictors. A regression analysis includes the construction of a model that attempts to describe this dependence in mathematical terms. Before the model is used to address questions about the relationship between response and predictors, the fit of the model to the data should be assessed; a field known as diagnostics. I propose methodology to aid in this task, pulling together traditional model-checking methods, recently-developed graphical approaches to regression, and modern Bayesian computational tools.

### Background

Suppose a physician wishes to decide whether a breast mass is malignant or benign without using (often painful) surgical procedures. A less invasive procedure is based on characteristics of cells extracted (painlessly) using a very narrow needle -- Fine Needle Aspiration (FNA). One approach to assessing this procedure is to construct a regression model to describe the dependence of the response (malignant or benign) on the predictors (cell characteristics). Diagnostic methods are then used to infer if there is information in the data to contradict this model. There could be severe consequences for the patient if the physician's decision was based on an ineffective model. Traditional diagnostic methods include plotting particular features of the model and calculating numerical summaries designed to reflect how well the model fits with respect to various optimality criteria. Cook (1998b) describes additional, recently-developed methods for checking regression models; these methods emphasize a graphical approach to regression.

If the model provides a good description of the response variable's dependence on the predictors, then diagnostic plots and numerical summaries should reflect this fact. For example, the physician could use the model to estimate the probability of malignancy for each sampled patient based on their cell characteristics. This information could be summarized in a plot with one of the cell characteristics (average cell radius, say) on the horizontal axis and a curve representing the model-based estimate of the probability of malignancy on the vertical axis. Another curve representing model-free estimates of malignancy probability could be obtained by smoothing the response variable against the radius. (A simple way to calculate this smoother is to divide the horizontal axis into intervals, calculate the proportion of patients with malignant masses in each interval, and draw a curve through the corresponding points; in practice, more sophisticated smoothing techniques are possible.) Graphical checks on the model require comparing the model-based curve to the model-free curve -- a good model should have curves that are close to one another, no matter which cell characteristic is plotted on the horizontal axis. Numerical checks could compare model-based and model-free estimated malignancy probabilities for patients in particular groups (pre- and post-menopausal, say).

However, judging whether diagnostic plots or numerical summaries have particular characteristics can be problematic in practice, particularly when low-dimensional plots (such as two-dimensional scatter-plots) are used as summaries of high-dimensional models with a large number of predictors. The analyst needs to compare the single best model estimate to a model-free estimate, with little guidance about how close together the two estimates need to be for the model to be effective; the variability in the model estimates is not being taken into account. This variability can be viewed from different philosophical viewpoints. Frequentist statisticians assess the variability indirectly through imagined repetitions of the study that gave rise to the data. On the other hand, Bayesian statisticians assess the variability directly by modifying their belief about a model in the light of data, and quantifying the variability in probabilistic terms. My approach to model assessment follows the Bayesian paradigm: repeatedly generate plausible values of a particular quantity from the estimated model (such as a curve representing estimated probability of malignancy), and then consider a model-free estimate of this same quantity within the context of the generated values. It is only with recent advances in computing that it is possible to use such an approach routinely for a wide variety of models.

This approach offers a way to overcome difficulties in judging patterns in diagnostic plots and gauging values of numerical summaries. Generated model quantities provide a back-drop against which a model-free quantity can be more easily assessed. Any systematic differences between the model-free quantity and generated model quantities indicate potential failings of the model. For instance, the physician could generate 100 model-based curves to represent probabilities of malignancy with respect to a cell characteristic. If the corresponding model-free curve appears particularly unusual in relation to the 100 generated model-based curves, then the model is called into question, and a different model may need to be considered. The nature of the unusualness of the model-free curve might suggest how an improved model may be found. For example, the model might appear to be adequate for younger women but not for older women, suggesting the need for a more complicated model for older women.

Generating model-based estimates to calibrate a model-free estimate can also address the common problem of over-fitting -- perceiving patterns in plots where there are none, and concluding that a different model is needed. For example, a model based on surgical procedures may appear to more accurately predict malignancy. However, when the variability in model estimates is taken into account, this model may offer little improvement over the FNA model, and would not justify the extra time, cost, and additional trauma to the patient. Thus, surgical procedures might be recommended only when results of the FNA procedure are inconclusive.

### Objectives

1. Apply modern computational methods of generating model samples to graphical regression diagnostic techniques for linear models, a frequently-used class of statistical models.
2. Extend the methods to more complex models, such as generalized linear models where the response variable is binary (can take on two values only).
3. Complement the graphical techniques with numerical methodology.
4. Compare performance of the new and existing methods using simulated and real data.

### Design and methodology

Gelman et al. (1996) motivate a sample generating approach to model assessment. Samples can be generated with exact methods when closed-form calculations are possible (such as in simple linear models), and with approximate methods (such as Markov chain Monte Carlo techniques) when more complex models are required. Traditional and modern regression diagnostic techniques are discussed in Cook (1998) and Cook and Weisberg (1999).

### Potential significance of the research

Scientists try to make sense of the world by formulating models for how they believe it works. These models need to be assessed relative to their objectives before they can be put to good use. For example, Newton's law is perfectly adequate for many purposes, but with respect to certain phenomena, Einstein's relativity becomes a necessary replacement model. Hill (1996) noted that model assessment is an important problem that "despite its long history, going back at least to Daniel Bernouill's celebrated analysis of the planetary orbits, is still largely unsolved and controversial". I aim in this thesis to bring new ideas to this old problem.

Regression is a widely-used statistical technique for fitting models. Assessing the fit of a posited model is a crucial, yet often problematic, step in any regression analysis. I believe that my proposed methods can greatly ease the model-checking process. Widely available computer software allows samples to be generated from many models very easily, and, with modern computers, very quickly. I aim to develop graphical and numerical methods that are sufficiently simple that an analyst can easily implement them for routine use. This has the potential to enable statistical analyses to be carried out more easily and quickly than before, and also to give the analyst greater confidence that any model that passes the checks proposed can then be used as a serious component in a scientific investigation.

### References

Cook, R. D. (1998). Regression Graphics: Ideas for studying regressions through graphics. New York: Wiley.

Cook, R. D. and S. Weisberg (1999). Applied Regression Including Computing and Graphics. New York: Wiley.

Gelman, A., X.-L. Meng, and H. Stern (1996). Posterior predictive assessment of model fitness via realized discrepancies (with discussion). Statistica Sinica 6, 733-807.

Hill, B. M. (1996). Comment on "Posterior predictive assessment of model fitness via realized discrepancies" by Gelman, A., X.-L. Meng, and H. Stern. Statistica Sinica 6, 767-773.

Pardoe, I. (to appear). A Bayesian sampling approach to regression model checking. Journal of Computational and Graphical Statistics.

Back to Research page.

Send me e-mail at ipardoe at lcbmail.uoregon.edu

Last updated: March 9, 2000

The views and opinions expressed in this page are strictly those of the page author. The contents of this page have not been reviewed or approved by the University of Oregon.