These instructions were kindly prepared by Tom Kari to accompany Applied Regression Modeling by Iain Pardoe, 2nd edition published by Wiley in 2012. The numbered items crossreference with the "computer help" references in the book. These instructions are based on the programming interface (command line) of SAS 9 for Windows, but they should also work for other versions of SAS. Find instructions for other statistical software packages here.
These instructions use data from the SASHELP library, which is automatically available for examples and for testing code. You can just drop the code from an instruction into a SAS code window, and it will execute.
Getting started and summarizing univariate data
 Change SAS's default options by selecting Tools > Options.
 There are different ways to import data, for example use File > Import Data to open text data files or Excel spreadsheets (when you are prompted to
"Choose the SAS destination" type a name for the dataset into the "Member" box). Once you have successfully imported some data, you conduct analyses by writing lines of code in an "Editor" window and then submitting the code by selecting Run > Submit (or simply clicking the "running person" Submit button). You can save the code you write into a text ".sas" file (recommended).  To recall a previously entered command, use the icon.
 Output appears in the Result Viewer window and can be copied and
pasted from SAS to a word processor like OpenOffice Writer or Microsoft Word. Graphs also appear in the Result Viewer window. To get a copy to use elsewhere, rightclick on the graph, select Save picture as..., and select a location and format.  You can access help by selecting Help > SAS Help and Documentation.
 To transform data or compute a new variable, type, for example,
data mydata;
set sashelp.class;
logWeight = log(Weight);
WeightSq = Weight**2;
run;
for the natural logarithm of Weight and Weight^{2} respectively. If you get a "syntax error" message this means there is a syntax error in your expression—a common mistake is to forget the multiplication symbol (*) between a number and a variable (e.g., 2*X represents 2X).  To create indicator (dummy) variables from a qualitative variable, type,
for example,
data mydata;
set sashelp.class;
if Sex='M' then D1=1;
else D1=0;
run;
where Sex is the qualitative variable and "M" is the name of one of the
categories in Sex. Repeat for other indicator variables (if necessary). 
 To find a percentile (critical value) for a tdistribution, type, for example,
data mydata;
cvt = quantile('t', p, df);
run;
where p is the lowertail area (i.e., one minus the onetail significance level) and df is the degrees of freedom. When you run the program, the result will be in variable cvt in the output dataset. For example, quantile('t', .95, 29) returns the 95th percentile of the tdistribution with 29 degrees of freedom (1.699), which is the critical value for an uppertail test with a 5% significance level. By contrast, quantile('t', .975, 29) returns the 97.5th percentile of the tdistribution with 29 degrees of freedom (2.045), which is the critical value for a twotail test with a 5% significance level.  To find a percentile (critical value) for an Fdistribution, type, for example,
data mydata;
cvt = quantile('f', p, df1, df2);
run;
where p is the lowertail area (i.e., one minus the significance level), df1 is the numerator degrees of freedom, and df2 is the denominator degrees of freedom. When you run the program, the result will be in variable cvt in the output dataset. For example, quantile('f', .95, 2, 3) returns the 95th percentile of the Fdistribution with 2 numerator degrees of freedom and 3 denominator degrees of freedom (9.552).  To find a percentile (critical value) for a chisquared distribution, type, for example,
data mydata;
cvt = quantile('chisq', p, df);
run;
where p is the lowertail area (i.e., one minus the significance level) and df is the degrees of freedom. When you run the program, the result will be in variable cvt in the output dataset. For example, quantile('chisq', 0.95, 2) returns the 95th percentile of the chisquared distribution with 2 degrees of freedom (5.991).
 To find a percentile (critical value) for a tdistribution, type, for example,

 To find an uppertail area (onetail pvalue) for a tdistribution, type, for example,
data mydata;
pt = 1  probt(t, df);
run;
where t is the value of the tstatistic and df is the degrees of freedom.
When you run the program, the result will be in variable pt in the output dataset. For example, pt = 1  probt(2.40, 29); returns the uppertail area for a tstatistic of 2.40 from the tdistribution with 29 degrees of freedom (0.012), which is the pvalue for an uppertail test. By contrast, pt = 2 * (1  probt(2.40, 29)); returns the twotail area for a tstatistic of 2.40 from the tdistribution with 29 degrees of freedom (0.023), which is the pvalue for a twotail test.  To find an uppertail area (pvalue) for an Fdistribution, type, for example,
data mydata;
pf = 1  probf(f, df1, df2);
run;
where f is the value of the Fstatistic, df1 is the numerator degrees of freedom, and df2 is the denominator degrees of freedom. When you run the program, the result will be in variable pf in the output dataset. For example, pf = 1  probf(51.4, 2, 3); returns the uppertail area (pvalue) for an Fstatistic of 51.4 for the Fdistribution with 2 numerator degrees of freedom and 3 denominator degrees of freedom (0.005).  To find an uppertail area (pvalue) for a chisquared distribution, type, for example,
data mydata;
pchisq = 1  probchi(chisq, df);
run;
where chisq is the value of the tstatistic and df is the degrees of freedom. When you run the program, the result will be in variable pchisq in the output dataset. For example, pchisq = 1  probchi(0.38, 2); returns the uppertail area (pvalue) for a chisquared statistic of 0.38 for the chisquared distribution with 2 degrees of freedom (0.827).
 To find an uppertail area (onetail pvalue) for a tdistribution, type, for example,
 Calculate descriptive statistics for quantitative variables by typing
proc univariate data=sashelp.class;
var Height;
run;
where Height is the quantitative variable. Specify an output statement to calculate other statistics beyond those calculated by default (see SAS Help for specific details on how to do this).  Create contingency tables or crosstabulations for qualitative
variables by typing
proc freq data=sashelp.bweight;
tables Married*Smoke;
run;
where Married and Smoke are the qualitative variables.  If you have a quantitative variable and a qualitative variable, you can calculate
descriptive statistics for cases grouped in different categories by first sorting the data, and then using the univariate procedure:
proc sort data=sashelp.shoes out=mydata;
by Region;
run;
proc univariate data=mydata;
var Sales;
by Region;
run;
where Sales is the quantitative variable and Region is the qualitative variable. Specify an output statement to calculate other statistics beyond those calculated by default (see SAS Help for specific details on how to do this).  To make a stemandleaf plot for a quantitative variable, type
ods graphics off;
proc univariate data=sashelp.class plot;
var Age;
run;
where Age is the quantitative variable (for large sample sizes, SAS will create a horizontal bar chart instead of a stemandleaf plot). If you don't use the ODS statement, you'll always get a bar chart.  To make a histogram for a quantitative variable, type
proc sgplot data=sashelp.shoes;
histogram Stores / binstart = 1 binwidth=3 nbins=15;
run;
where Stores is the quantitative variable and the bin options specify how to construct the breakpoints.  To make a scatterplot with two quantitative variables, type
proc sgplot data=sashelp.shoes;
scatter x=Sales y=Returns;
run;
where Returns is the vertical axis variable and Sales is the horizontal axis variable.  All possible scatterplots for more than two variables can be drawn simultaneously (called a scatterplot matrix) by typing
proc sgscatter data=sashelp.shoes;
matrix Sales Inventory Returns;
run;
where Sales, Inventory, and Returns are quantitative variables.  You can mark or label cases in a scatterplot with different colors/symbols according to categories in a qualitative variable. First set up the code for the scatterplot using computer help #15. Suppose Product contains values to represent a number of categories. Then the following code produces a scatterplot with different indications for the different values of Product:
proc sgplot data=sashelp.shoes;
scatter x=Sales y=Returns / group=Product;
run;
To change the colors/symbols used add an "ods graphics" statement and a "styleattrs" statement as below:
ods graphics / attrpriority=none;
proc sgplot data=sashelp.shoes;
styleattrs datacontrastcolors=(red red red blue blue blue green green green) datasymbols=(circle star square);
scatter x=Sales y=Returns / group=Product;
run;  You can identify individual cases in a scatterplot by typing
ods graphics / imagemap=on;
proc sgplot data=sashelp.class;
scatter x=Height y=Weight / tip=(Name);
run;
where Weight is the vertical axis variable, Height is the horizontal axis variable, and Name is the variable that identifies the case. Then hover over an individual case to identify it.  To remove one of more observations from a dataset, determine the value(s) with respect to a particular variable and add a where clause to the appropriate proc code. For example, to remove points from a scatterplot:
proc sgplot data=sashelp.shoes;
scatter x=Sales y=Returns;
where Sales <= 500000;
run;  To make a bar chart for cases in different categories, use the sgplot procedure.
 For frequency bar charts of one qualitative variable, type
proc sgplot data=sashelp.shoes;
vbar Region;
run;
where Region is a qualitative variable.  For frequency bar charts of two qualitative variables, type
proc sgplot data=sashelp.shoes;
vbar Region / group=Product groupdisplay=cluster;
run;
where Region and Product are qualitative variables.  The bars can also represent various summary functions for a quantitative variable. For example, to produce a bar chart of means, type
proc sgplot data=sashelp.shoes;
vbar Region / response=Sales stat=mean;
run;
or:
proc sgplot data=sashelp.shoes;
vbar Region / group=Product groupdisplay=cluster response=Sales stat=mean;
run;
where Region and Product are the qualitative variables and Sales is a quantitative variable.
 For frequency bar charts of one qualitative variable, type
 There are two good options to create boxplots for cases in different categories:
Method 1: Using PROC SGPLOT For just one qualitative variable, type
proc sgplot data=sashelp.shoes;
vbox Sales / category=Region;
run;
where Sales is a quantitative variable and Region is the qualitative variable.  For two qualitative variables, type
proc sgplot data=sashelp.shoes;
vbox Sales / category=Region group=Product;
run;
where Sales is a quantitative variable, Region is the major qualitative variable, and Product is the minor qualitative variable
Method 2: Using PROC SGPANEL
 For just one qualitative variable, type
proc sgpanel data=sashelp.shoes;
panelby Region;
vbox Sales;
run;
where Sales is a quantitative variable and Region is the qualitative variable.  For two qualitative variables, type
proc sgpanel data=sashelp.shoes;
panelby Region Product;
vbox Sales;
run;
where Sales is a quantitative variable, Region is the major qualitative variable, and Product is the minor qualitative variable
Which approach you will use depends on which graphical presentation you prefer.
 For just one qualitative variable, type
 To make a QQplot (also known as a normal probability plot) for a
quantitative variable, type
proc univariate data=sashelp.class noprint;
qqplot Weight / normal(mu=est sigma=est);
run;
where Weight is a quantitative variable.  To compute a confidence interval for a univariate population mean,
type
proc univariate data=sashelp.class cibasic(alpha=0.05);
var Weight;
run;
where Weight is the variable for which you want to calculate the confidence interval, and alpha is the confidence level of the interval.  To do a hypothesis test for a univariate population mean, type
proc univariate data=sashelp.class mu0=90;
var Weight;
run;
where Weight is the variable for which you want to do the test and mu0 is the (null) hypothesized value.
Simple linear regression

 To fit a simple linear regression model (i.e., find a least squares line), type
proc reg data=sashelp.class;
model Weight=Height;
run;
where Weight is the response variable and Height is the predictor variable. In the rare circumstance that you wish to fit a model without an intercept term (regression through the origin), type
proc reg data=sashelp.class;
model Weight=Height / noint;
run;  To add a regression line or least squares line to a scatterplot, type
proc sgplot data=sashelp.shoes;
scatter x=Sales y=Returns;
reg x=Sales y=Returns;
run;
where Returns is the vertical axis variable and Sales is the horizontal axis variable.  To find confidence intervals for the regression parameters in a simple
linear regression model, type
proc reg data=sashelp.class alpha=0.05;
model Weight=Height / clb;
run;
where Weight is the response variable, Height is the predictor variable, and alpha is the confidence level of the intervals.
 To fit a simple linear regression model (i.e., find a least squares line), type
This applies more generally to multiple linear regression also.

 To find a fitted value or predicted value of Y (the response
variable) at a particular value of X (the predictor variable), type
proc reg data=sashelp.class alpha=0.05;
model Weight=Height;
output out=PredictedValues predicted=PredictedWeight;
run;
where Weight is the response variable, Height is the predictor variable, and alpha is the confidence level of the interval. A SAS dataset named PredictedValues will be created, with the original variables and values from the input dataset, and a new variable named PredictedWeight that is the predicted value of Weight. Note that you can use any names you wish for PredictedValues and PredictedWeight.  You can also obtain a fitted or predicted values of Y at an Xvalue that is not in the dataset by doing the following. Before fitting the regression model, add the Xvalue to the dataset using code such as:
data mydata;
input Height;
datalines;
75
run;
data mydata2;
set sashelp.class mydata;
run;
Then fit the regression model using:
proc reg data=mydata2 alpha=0.05;
model Weight=Height;
output out=PredictedValues predicted=PredictedWeight;
run;
SAS will ignore the Xvalue you added when fitting the model (since there is no corresponding Yvalue), so all the regression output (such as the estimated regression parameters) will be the same. But SAS will calculate a fitted or predicted values of Y at this new Xvalue based on the results of the regression.
This applies more generally to multiple linear regression also.
 To find a fitted value or predicted value of Y (the response

 To find a confidence interval for the mean of Y at a particular value of
X, type
proc reg data=sashelp.class alpha=0.05;
model Weight=Height / clm;
run;
where Weight is the response variable, Height is the predictor variable, and alpha is the confidence level of the interval. The confidence intervals for the mean of Y at each of the Xvalues in the dataset are displayed as two columns headed CL Mean.  You can also obtain a confidence interval for the mean of Y at an Xvalue that is not in the dataset by doing the following. Before fitting the regression model, add the Xvalue to the dataset using code such as:
data mydata;
input Height;
datalines;
75
run;
data mydata2;
set sashelp.class mydata;
run;
Then fit the regression model using:
proc reg data=mydata2 alpha=0.05;
model Weight=Height / clm;
run;
SAS will ignore the Xvalue you added when fitting the model (since there is no corresponding Yvalue), so all the regression output (such
as the estimated regression parameters) will be the same. But SAS will calculate a confidence interval for the mean of Y at this new Xvalue based on the results of the regression.  This applies more generally to multiple linear regression also.
 To find a confidence interval for the mean of Y at a particular value of

 To find a prediction interval for an individual value of Y at a particular
value of X, type
proc reg data=sashelp.class alpha=0.05;
model Weight=Height / cli;
run;
where Weight is the response variable, Height is the predictor variable, and alpha is the confidence level of the interval. The prediction intervals for an individual value of Y at each of the Xvalues in the dataset are displayed as two columns headed CL Predict.  You can also obtain a prediction interval for an individual value of Y at an
Xvalue that is not in the dataset by doing the following. Before fitting the regression model, add the Xvalue to the dataset using code such as:
data mydata;
input Height;
datalines;
75
run;
data mydata2;
set sashelp.class mydata;
run;
Then fit the regression model using:
proc reg data=mydata2 alpha=0.05;
model Weight=Height / cli;
run;
SAS will ignore the Xvalue you added when fitting the model (since there is no corresponding Yvalue), so all the regression output (such as the estimated regression parameters) will be the same. But SAS will calculate a prediction interval for an individual value of Y at this new Xvalue based on the results of the regression.  This applies more generally to multiple linear regression also.
 To find a prediction interval for an individual value of Y at a particular
Multiple linear regression
 To fit a multiple linear regression model, type
proc reg data=sashelp.class;
model Weight=Age Height;
run;
where Weight is the response variable and Age and Height are the predictor
variables. In the rare circumstance that you wish to fit a model without an intercept term (regression through the origin), type
proc reg data=sashelp.class;
model Weight=Age Height / noint;
run;  To add a quadratic regression line to a scatterplot, type
proc sgplot data=sashelp.shoes;
scatter x=Sales y=Returns;
reg x=Sales y=Returns / degree=2;
run;
where Returns is the vertical axis variable and Sales is the horizontal axis variable.  Categories of a qualitative variable can be thought of as defining subsets
of the sample. If there is also a quantitative response and a quantitative predictor variable in the dataset, a regression model can be fit to the data to represent separate regression lines for each subset. For example, suppose that Type is a qualitative variable with six categories, and Horsepower and EngineSize are two quantitative variables. Then the following code produces a scatterplot with different colors (representing the value of Type) marking the
points, and six separate regression lines:
proc sgplot data=sashelp.cars;
scatter x=EngineSize y=Horsepower / group=Type;
reg x=EngineSize y=Horsepower / group=Type;
run;  To find the Fstatistic and associated pvalue for a nested model Ftest in
multiple linear regression, submit code such as the following:
proc reg data=sashelp.cars;
model MPG_Highway={EngineSize Horsepower} {MSRP Invoice} / selection=forward
groupnames='EngineSize Horsepower' 'MSRP Invoice'
slentry=0.99;
run;
Here, EngineSize and Horsepower are in the reduced model, while EngineSize, Horsepower, MSRP, and Invoice are in the complete model. The Fstatistic is in the second row of the "Summary of Forward Selection Table" in the column headed F Value, while the associated
pvalue is in the column headed Pr > F.  To save residuals in a multiple linear regression model, type
proc reg data=sashelp.class;
model Weight=Age Height;
output out=work.res1 residual=resid;
run;
This code saves the residuals to the SAS dataset res1 (in the work library) as
variable resid, where they can now be used just like any other variable, for example, to construct residual plots (note that all the variables in the original dataset are included in the new dataset). To save what Pardoe (2012) calls standardized residuals, use keyword student in place of residual. To save what Pardoe (2012) calls studentized residuals, use keyword rstudent.  To add a loess fitted line to a scatterplot, type
proc sgplot data=sashelp.shoes;
scatter x=Sales y=Returns;
loess x=Sales y=Returns;
run;
where Returns is the vertical axis variable and Sales is the horizontal axis variable. If you wish to adjust the smoothness of the line, see SAS Help for information about the smooth option of the loess statement.  To save leverages in a multiple linear regression model, type
proc reg data=sashelp.class;
model Weight=Age Height;
output out=work.res1 h=lev;
run;
This code saves the leverages to the SAS dataset res1 (in the work library) as
variable lev, where they can now be used just like any other variable, for example, to construct scatterplots (note that all the variables in the original dataset are included in the new dataset).  To save Cook's distances in a multiple linear regression model, type
proc reg data=sashelp.class;
model Weight=Age Height;
output out=work.res1 cookd=cooksd;
run;
This code saves the Cook's distances to the SAS dataset res1 (in the work library) as variable cooksd, where they can now be used just like any other variable, for example, to construct scatterplots (note that all the variables in the original dataset are included in the new dataset).  To create some residual plots automatically in a multiple linear regression
model, type
proc reg data=sashelp.class;
model Weight=Age Height;
plot rstudent.*predicted. rstudent.*nqq. rstudent.*cookd.;
run;
This produces a plot of studentized residuals versus fitted values, a QQplot of the studentized residuals, and a plot of studentized residuals versus Cook's distances. To create residual plots manually, first create studentized residuals (see help #35), and then construct scatterplots with these studentized residuals on the vertical axis.  To create a correlation matrix of quantitative variables (useful for
checking potential multicollinearity problems), type
proc corr data=sashelp.cars;
var EngineSize Horsepower MPG_City;
run;
where EngineSize, Horsepower, and MPG_City are quantitative variables.  To find variance inflation factors in multiple linear regression,
type
proc reg data=sashelp.class;
model Weight=Age Height / vif;
run;
where Weight is the response variable and Age and Height are the predictor
variables.  To draw a predictor effect plot for graphically displaying the effects of
transformed quantitative predictors and/or interactions between quantitative and qualitative predictors in multiple linear regression, first create an "x1effect" variable representing the effect (see help #6). As an example, we'll create WeightEffect1 and WeightEffect2 using the following, as well as an indicator variable for a qualitative variable (see help #7):
data mydata;
set sashelp.class;
if Sex='M' then D1=1;
else D1=0;
WeightEffect1 = 1 + (3 * Weight) + (4 * (Weight **2));
WeightEffect2 = 1  (2 * Weight) + (3 * D1 * Weight);
run;
Next, sort the dataset by the values of the X variable (if necessary):
proc sort data=mydata out=mydata2;
by Weight;
run; If the "X1effect" just involves the X variable (e.g., 1 + 3X + 4X^{2}), type
proc sgplot data=mydata2;
series x=Weight y=WeightEffect1;
run;  If the "X1 effect" variable involves a qualitative variable (e.g., 1 − 2X + 3D1X, where D1 is an indicator variable), type
proc sgplot data=mydata2;
series x=Weight y=WeightEffect2 / group=Sex;
run;
where Sex is the qualitative variable.
See Section 5.5 in Pardoe (2012) for an example.
 If the "X1effect" just involves the X variable (e.g., 1 + 3X + 4X^{2}), type