Stata – Applied Regression Modeling, 2nd edition

These instructions accompany Applied Regression Modeling by Iain Pardoe, 2nd edition published by Wiley in 2012. The numbered items cross-reference with the "computer help" references in the book. These instructions are based on Stata 8 for Windows, but they should also work for other versions. Find instructions for other statistical software packages here.

Getting started and summarizing univariate data

Change Stata's default options by selecting ?.
To open a Stata data file, type use "file.dta", where file.dta is the name of the data file (with the correct path specified if necessary). You can also import text or Excel data files using the Text Import Wizard by selecting File > Browse.
To recall a previously entered command, single-click it in the "Review" window.
Output appears in the "Stata Results" window and can be copied and pasted from Stata to a word processor like OpenOffice Writer or Microsoft Word. Graphs appear in separate windows and can also easily be copied using Edit > Copy Graph and then pasted to other applications.
You can access help by selecting Help > Contents. To find out about a particular topic click Help > Search or to find out about a particular Stata command click Help > Stata Command.
To transform data or compute a new variable, type, for example, generate logX=ln(X) for the natural logarithm of X and generate Xsq=X^2 for X². If you get the error message "[?]," this means that there is a syntax error in your expression—a common mistake is to forget the multiplication symbol (*) between a number and a variable (e.g., 2*X represents 2X).
To create indicator (dummy) variables from a qualitative variable, type, for example, generate D1=(X=="level"), where X is the qualitative variable and level is the name of one of the categories in X. Repeat for other indicator variables (if necessary).
- To find a percentile (critical value) for a t-distribution, type display invttail(df, p), where p is the one-tail significance level (upper-tail area) and df is the degrees of freedom. For example, display invttail(29, 0.05) returns the 95th percentile of the t-distribution with 29 degrees of freedom (1.699), which is the critical value for an upper-tail test with a 5% significance level. By contrast, display invttail(29, 0.025) returns the 97.5th percentile of the t-distribution with 29 degrees of freedom (2.045), which is the critical value for a two-tail test with a 5% significance level.
- To find a percentile (critical value) for an F-distribution, type display invFtail(df1, df2, p), where p is the significance level (upper-tail area), df1 is the numerator degrees of freedom, and df2 is the denominator degrees of freedom. For example, display invFtail(2, 3, 0.05) returns the 95th percentile of the F-distribution with 2 numerator degrees of freedom and 3 denominator degrees of freedom (9.552).
- To find a percentile (critical value) for a chi-squared distribution, type display invchi2tail(df, p), where p is the significance level (upper-tail area) and df is the degrees of freedom. For example, display invchi2tail(2, 0.05) returns the 95th percentile of the chi-squared distribution with 2 degrees of freedom (5.991).
- To find an upper-tail area (one-tail p-value) for a t-distribution, type display ttail(df, t), where t is the absolute value of the t-statistic and df is the degrees of freedom. For example, display ttail(29, 2.40) returns the upper-tail area for a t-statistic of 2.40 from the t-distribution with 29 degrees of freedom (0.012), which is the p-value for an upper-tail test. By contrast, display 2*ttail(29, 2.40) returns the two-tail area for a t-statistic of 2.40 from the t-distribution with 29 degrees of freedom (0.023), which is the p-value for a two-tail test.
- To find an upper-tail area (p-value) for an F-distribution, type display Ftail(df1, df2, f), where f is the value of the F-statistic, df1 is the numerator degrees of freedom, and df2 is the denominator degrees of freedom. For example, display Ftail(2, 3, 51.4) returns the upper-tail area (p-value) for an F-statistic of 51.4 for the F-distribution with 2 numerator degrees of freedom and 3 denominator degrees of freedom (0.005).
- To find an upper-tail area (p-value) for a chi-squared distribution, type display chi2tail(df, chisq), where chisq is the value of the chi-squared statistic and df is the degrees of freedom. For example, display chi2tail(2, 0.38) returns the upper-tail area (p-value) for a chi-squared statistic of 0.38 for the chi-squared distribution with 2 degrees of freedom (0.827).
Calculate descriptive statistics for quantitative variables by typing summarize Y, where Y is the quantitative variable. Type summarize Y, detail to include more statistics. Other, more specific commands include mean Y, median Y, sd Y, min Y, and max Y.
Create contingency tables or cross-tabulations for qualitative variables by typing table X1 X2, where X1 and X2 are the qualitative variables. Calculate row percentages by typing table X1 X2, row. Calculate column percentages by typing table X1 X2, column.
If you have a quantitative variable and a qualitative variable, you can calculate descriptive statistics for cases grouped in different categories by typing, for example,
sort X by X: summarize Ywhere Y is the quantitative variable and X is the qualitative variable.
To make a stem-and-leaf plot for a quantitative variable, type stem Y, round(d), where Y is the quantitative variable and d controls the rounding (e.g., "1" for integers, "0.1" for tenths, etc.).
To make a histogram for a quantitative variable, type histogram Y, bin=10, where Y is the quantitative variable and bin specifies the number of bins.
To make a scatterplot with two quantitative variables, type graph twoway scatter Y X, where Y is the vertical axis variable and X is the horizontal axis variable.
All possible scatterplots for more than two variables can be drawn simultaneously (called a scatterplot matrix}) by typing graph matrix Y X1 X2, where Y, X1, and X2 are quantitative variables.
You can mark or label cases in a scatterplot with different colors/symbols according to categories in a qualitative variable by using the separate command. For example, suppose X2 contains values 1 and 2 to represent two categories, and Y and X1 are two quantitative variables. Then the following code produces a scatterplot with different symbols (representing the values of X2) marking the points:
separate Y, by(X2) graph twoway (scatter Y1 X1, mlabel(X2)) (scatter Y2 X1, mlabel(X2)).
You can identify individual cases in a scatterplot by using the mlabel option, for example, graph twoway scatter Y X, mlabel(id), where X is the horizontal axis variable, Y is the vertical axis variable, and id is a variable containing labels for the points.
To remove one of more observations from a dataset, type, for example, Drop if ID==1, which would remove the observation with ID 1.
To make a bar chart for cases in different categories, use graph bar.
- For frequency bar charts of one qualitative variable, type graph bar (count) X1, over(X1), where X1 is a qualitative variable.
- For frequency bar charts of two qualitative variables, type graph bar (count) X1, over(X1) over(X2), where X1 and X2 are qualitative variables.
- The bars can also represent various summary functions for a quantitative variable. For example, to produce a bar chart of means, type graph bar (mean) Y, over(X1) over(X2), where X1 and X2 are the qualitative variables and Y is a quantitative variable.
To make boxplots for cases in different categories, use graph box.
- For just one qualitative variable, type graph box Y, by(X1), where Y is a quantitative variable and X1 is a qualitative variable.
- For two qualitative variables, type graph box Y, by(X1) by(X2), where Y is a quantitative variable, and X1 and X2 are qualitative variables.
To make a QQ-plot (also known as a normal probability plot) for a quantitative variable, type qnorm Y, where Y is a quantitative variable.
To compute a confidence interval for a univariate population mean, type ci Y, level(0.95), where Y is the variable for which you want to calculate the confidence interval, and the value in parentheses after level is the confidence level of the interval.
To do a hypothesis test for a univariate population mean, type ttest Y==value, where Y is the variable for which you want to do the test and value is the (null) hypothesized value.

Simple linear regression

To fit a simple linear regression model (i.e., find a least squares line), type regress Y X, where Y is the response variable and X is the predictor variable. In the rare circumstance that you wish to fit a model without an intercept term (regression through the origin), type regress Y X, noconstant.
To add a regression line or least squares line to a scatterplot, type graph twoway (scatter Y X) (lfit Y X), where Y is the response variable and X is the predictor variable.
Stata displays 95% confidence intervals for the regression parameters in a simple linear regression model by default. This applies more generally to multiple linear regression also.
- To find a fitted value or predicted value of Y (the response variable) at a particular value of X (the predictor variable), type predict yhat, xb after fitting the model (see help #25). This sets variable yhat equal to the fitted or predicted values of Y at each of the X-values in the dataset.
- You can also obtain a fitted or predicted value of Y at an X-value that is not in the dataset by typing adjust X=a, ci level(95), where a is the particular X-value that we are interested in. In multiple linear regression use, for example, adjust X1=a X2=b, ci level(95).
- This applies more generally to multiple linear regression also.
- To find confidence intervals for the mean of Y at particular values of X, first find the standard errors of estimation by typing predict see, stdp after fitting the model (see help #25). Find the lower limits of the confidence intervals for the mean of Y at each of the X-values in the dataset by typing generate lci = yhat-invttail(df,p)*see, where yhat is the fitted or predicted values of Y (see computer help #28), and invttail(df,p) is the appropriate t-percentile (see computer help #8). Find the upper limits similarly by typing generate uci = yhat+invttail(df,p)*see.
- You can also obtain a confidence interval for the mean of Y at an X-value that is not in the dataset by typing adjust X=a, ci level(95), where a is the particular X-value that we are interested in. In multiple linear regression use, for example, adjust X1=a X2=b, ci level(95).
- This applies more generally to multiple linear regression also.
- To find prediction intervals for an individual value of Y at particular values of X, first find the standard errors of prediction by typing predict sep, stdf after fitting the model (see help #25). Find the lower limits of the prediction intervals for an individual value of Y at each of the X-values in the dataset by typing generate lpi = yhat-invttail(df,p)*sep, where yhat is the fitted or predicted values of Y (see computer help #28), and invttail(df,p) is the appropriate t-percentile (see computer help #8). Find the upper limits similarly by typing generate upi = yhat+invttail(df,p)*sep.
- You can also obtain a prediction interval for an individual value of Y at an X-value that is not in the dataset by typing adjust X=a, stdf ci level(95), where a is the particular X-value that we are interested in. In multiple linear regression use, for example, adjust X1=a X2=b, stdf ci level(95).
- This applies more generally to multiple linear regression also.

Multiple linear regression

To fit a multiple linear regression model, type regress Y X1 X2, where Y is the response variable and X1 and X2 are the predictor variables. In the rare circumstance that you wish to fit a model without an intercept term (regression through the origin), type regress Y X1 X2, noconstant.
To add a quadratic regression line to a scatterplot, type graph twoway (scatter Y X) (qfit Y X), where Y is the response variable and X is the predictor variable.
Categories of a qualitative variable can be thought of as defining subsets of the sample. If there is also a quantitative response and a quantitative predictor variable in the dataset, a regression model can be fit to the data to represent separate regression lines for each subset. For example, suppose X2 contains the values 1-4 to represent four categories, and Y and X1 are two quantitative variables. The following code produces a scatterplot with different symbols representing the values of X2 and four separate regression lines:
separate Y, by(X2) graph twoway (scatter Y1 X1) (scatter Y2 X1) (scatter Y3 X1) (scatter Y4 X1) // (lfit Y1 X1) (lfit Y2 X1) (lfit Y3 X1) (lfit Y4 X1).
To find the F-statistic and associated p-value for a nested model F-test in multiple linear regression, first fit the complete model, for example, regress Y X1 X2 X3. Then type, for example, test X2 X3 to test whether the regression parameters for both X1 and X2 are zero. Stata displays the F-statistic and the associated p-value (labeled Prob > F).
To save residuals in a multiple linear regression model, type predict res, residuals, after fitting the model (see help #31). The variable res can now be used just like any other variable, for example, to construct residual plots. To save what Pardoe (2012) calls standardized residuals, use rstandard in place of residuals. To save what Pardoe (2012) calls studentized residuals, use rstudent.
To add a lowess fitted line to a scatterplot (useful for checking the zero mean regression assumption in a residual plot), type, for example, graph twoway (lowess stures yhat, bwidth(.75)) (scatter stures yhat), where stures are studentized residuals (see help #35), yhat are fitted or predicted values (see help #28), and bwidth controls how wiggly the line is (lower means more wiggly).
To save leverages in a multiple linear regression model, type predict lev, leverage, after fitting the model (see help #31). The variable lev can now be used just like any other variable, for example, to construct scatterplots.
To save Cook's distances in a multiple linear regression model, type predict cook, cooksd, after fitting the model (see help #31). The variable cook can now be used just like any other variable, for example, to construct scatterplots.
To create some residual plots automatically in a multiple linear regression model, type rvfplot after fitting the model (see help #31), which produces a plot of residuals versus fitted values or type lvr2plot, which produces a plot of leverage versus "normalized squared residuals." To create residual plots manually, first create studentized residuals (see help #35), and then construct scatterplots with these studentized residuals on the vertical axis.
To create a correlation matrix of quantitative variables (useful for checking potential multicollinearity problems), type correlate Y X1 X2, where Y, X1, and X2 are quantitative variables.
To find variance inflation factors in a multiple linear regression model, type vif after fitting the model (see help #31).
To draw a predictor effect plot for graphically displaying the effects of transformed quantitative predictors and/or interactions between quantitative and qualitative predictors in multiple linear regression, first create a variable representing the effect, say, "X1effect" (see help #6).
- If the "X1effect" variable just involves X1 (e.g., 1 + 3X1 + 4X1²), type graph twoway connected scatter X1effect X1, msymbol(none) sort.
- If the "X1effect" variable involves a qualitative variable (e.g., 1 − 2X1 + 3D2X1, where D2 is an indicator variable), type graph twoway connected scatter X1effect X1, by(D2) msymbol(none) sort.
See Section 5.5 in Pardoe (2012) for an example.