Skip to content

SAS Code

These instructions were kindly prepared by Tom Kari to accompany Applied Regression Modeling by Iain Pardoe, 2nd edition published by Wiley in 2012. The numbered items cross-reference with the "computer help" references in the book. These instructions are based on the programming interface (command line) of SAS 9 for Windows, but they should also work for other versions of SAS. Find instructions for other statistical software packages here.

These instructions use data from the SASHELP library, which is automatically available for examples and for testing code. You can just drop the code from an instruction into a SAS code window, and it will execute.

Getting started and summarizing univariate data

  1. Change SAS's default options by selecting Tools > Options.
  2. There are different ways to import data, for example use File > Import Data to open text data files or Excel spreadsheets (when you are prompted to
    "Choose the SAS destination" type a name for the dataset into the "Member" box). Once you have successfully imported some data, you conduct analyses by writing lines of code in an "Editor" window and then submitting the code by selecting Run > Submit (or simply clicking the "running person" Submit button). You can save the code you write into a text ".sas" file (recommended).
  3. To recall a previously entered command, use the SAS recall icon.
  4. Output appears in the Result Viewer window and can be copied and
    pasted from SAS to a word processor like OpenOffice Writer or Microsoft Word. Graphs also appear in the Result Viewer window. To get a copy to use elsewhere, right-click on the graph, select Save picture as..., and select a location and format.
  5. You can access help by selecting Help > SAS Help and Documentation.
  6. To transform data or compute a new variable, type, for example,
    data mydata;
    set sashelp.class;
    logWeight = log(Weight);
    WeightSq = Weight**2;
    run;

    for the natural logarithm of Weight and Weight2 respectively. If you get a "syntax error" message this means there is a syntax error in your expression—a common mistake is to forget the multiplication symbol (*) between a number and a variable (e.g., 2*X represents 2X).
  7. To create indicator (dummy) variables from a qualitative variable, type,
    for example,
    data mydata;
    set sashelp.class;
    if Sex='M' then D1=1;
    else D1=0;
    run;

    where Sex is the qualitative variable and "M" is the name of one of the
    categories in Sex. Repeat for other indicator variables (if necessary).
    • To find a percentile (critical value) for a t-distribution, type, for example,
      data mydata;
      cvt = quantile('t', p, df);
      run;

      where p is the lower-tail area (i.e., one minus the one-tail significance level) and df is the degrees of freedom. When you run the program, the result will be in variable cvt in the output dataset. For example, quantile('t', .95, 29) returns the 95th percentile of the t-distribution with 29 degrees of freedom (1.699), which is the critical value for an upper-tail test with a 5% significance level. By contrast, quantile('t', .975, 29) returns the 97.5th percentile of the t-distribution with 29 degrees of freedom (2.045), which is the critical value for a two-tail test with a 5% significance level.
    • To find a percentile (critical value) for an F-distribution, type, for example,
      data mydata;
      cvt = quantile('f', p, df1, df2);
      run;

      where p is the lower-tail area (i.e., one minus the significance level), df1 is the numerator degrees of freedom, and df2 is the denominator degrees of freedom. When you run the program, the result will be in variable cvt in the output dataset. For example, quantile('f', .95, 2, 3) returns the 95th percentile of the F-distribution with 2 numerator degrees of freedom and 3 denominator degrees of freedom (9.552).
    • To find a percentile (critical value) for a chi-squared distribution, type, for example,
      data mydata;
      cvt = quantile('chisq', p, df);
      run;

      where p is the lower-tail area (i.e., one minus the significance level) and df is the degrees of freedom. When you run the program, the result will be in variable cvt in the output dataset. For example, quantile('chisq', 0.95, 2) returns the 95th percentile of the chi-squared distribution with 2 degrees of freedom (5.991).
    • To find an upper-tail area (one-tail p-value) for a t-distribution, type, for example,
      data mydata;
      pt = 1 - probt(t, df);
      run;

      where t is the value of the t-statistic and df is the degrees of freedom.
      When you run the program, the result will be in variable pt in the output dataset. For example, pt = 1 - probt(2.40, 29); returns the upper-tail area for a t-statistic of 2.40 from the t-distribution with 29 degrees of freedom (0.012), which is the p-value for an upper-tail test. By contrast, pt = 2 * (1 - probt(2.40, 29)); returns the two-tail area for a t-statistic of 2.40 from the t-distribution with 29 degrees of freedom (0.023), which is the p-value for a two-tail test.
    • To find an upper-tail area (p-value) for an F-distribution, type, for example,
      data mydata;
      pf = 1 - probf(f, df1, df2);
      run;

      where f is the value of the F-statistic, df1 is the numerator degrees of freedom, and df2 is the denominator degrees of freedom. When you run the program, the result will be in variable pf in the output dataset. For example, pf = 1 - probf(51.4, 2, 3); returns the upper-tail area (p-value) for an F-statistic of 51.4 for the F-distribution with 2 numerator degrees of freedom and 3 denominator degrees of freedom (0.005).
    • To find an upper-tail area (p-value) for a chi-squared distribution, type, for example,
      data mydata;
      pchisq = 1 - probchi(chisq, df);
      run;

      where chisq is the value of the t-statistic and df is the degrees of freedom. When you run the program, the result will be in variable pchisq in the output dataset. For example, pchisq = 1 - probchi(0.38, 2); returns the upper-tail area (p-value) for a chi-squared statistic of 0.38 for the chi-squared distribution with 2 degrees of freedom (0.827).
  8. Calculate descriptive statistics for quantitative variables by typing
    proc univariate data=sashelp.class;
    var Height;
    run;

    where Height is the quantitative variable. Specify an output statement to calculate other statistics beyond those calculated by default (see SAS Help for specific details on how to do this).
  9. Create contingency tables or cross-tabulations for qualitative
    variables by typing
    proc freq data=sashelp.bweight;
    tables Married*Smoke;
    run;

    where Married and Smoke are the qualitative variables.
  10. If you have a quantitative variable and a qualitative variable, you can calculate
    descriptive statistics for cases grouped in different categories by first sorting the data, and then using the univariate procedure:
    proc sort data=sashelp.shoes out=mydata;
    by Region;
    run;
    proc univariate data=mydata;
    var Sales;
    by Region;
    run;

    where Sales is the quantitative variable and Region is the qualitative variable. Specify an output statement to calculate other statistics beyond those calculated by default (see SAS Help for specific details on how to do this).
  11. To make a stem-and-leaf plot for a quantitative variable, type
    ods graphics off;
    proc univariate data=sashelp.class plot;
    var Age;
    run;

    where Age is the quantitative variable (for large sample sizes, SAS will create a horizontal bar chart instead of a stem-and-leaf plot). If you don't use the ODS statement, you'll always get a bar chart.
  12. To make a histogram for a quantitative variable, type
    proc sgplot data=sashelp.shoes;
    histogram Stores / binstart = 1 binwidth=3 nbins=15;
    run;

    where Stores is the quantitative variable and the bin options specify how to construct the breakpoints.
  13. To make a scatterplot with two quantitative variables, type
    proc sgplot data=sashelp.shoes;
    scatter x=Sales y=Returns;
    run;

    where Returns is the vertical axis variable and Sales is the horizontal axis variable.
  14. All possible scatterplots for more than two variables can be drawn simultaneously (called a scatterplot matrix) by typing
    proc sgscatter data=sashelp.shoes;
    matrix Sales Inventory Returns;
    run;

    where Sales, Inventory, and Returns are quantitative variables.
  15. You can mark or label cases in a scatterplot with different colors/symbols according to categories in a qualitative variable. First set up the code for the scatterplot using computer help #15. Suppose Product contains values to represent a number of categories. Then the following code produces a scatterplot with different indications for the different values of Product:
    proc sgplot data=sashelp.shoes;
    scatter x=Sales y=Returns / group=Product;
    run;

    To change the colors/symbols used add an "ods graphics" statement and a "styleattrs" statement as below:
    ods graphics / attrpriority=none;
    proc sgplot data=sashelp.shoes;
    styleattrs datacontrastcolors=(red red red blue blue blue green green green) datasymbols=(circle star square);
    scatter x=Sales y=Returns / group=Product;
    run;
  16. You can identify individual cases in a scatterplot by typing
    ods graphics / imagemap=on;
    proc sgplot data=sashelp.class;
    scatter x=Height y=Weight / tip=(Name);
    run;

    where Weight is the vertical axis variable, Height is the horizontal axis variable, and Name is the variable that identifies the case. Then hover over an individual case to identify it.
  17. To remove one of more observations from a dataset, determine the value(s) with respect to a particular variable and add a where clause to the appropriate proc code. For example, to remove points from a scatterplot:
    proc sgplot data=sashelp.shoes;
    scatter x=Sales y=Returns;
    where Sales <= 500000;
    run;
  18. To make a bar chart for cases in different categories, use the sgplot procedure.
    • For frequency bar charts of one qualitative variable, type
      proc sgplot data=sashelp.shoes;
      vbar Region;
      run;

      where Region is a qualitative variable.
    • For frequency bar charts of two qualitative variables, type
      proc sgplot data=sashelp.shoes;
      vbar Region / group=Product groupdisplay=cluster;
      run;

      where Region and Product are qualitative variables.
    • The bars can also represent various summary functions for a quantitative variable. For example, to produce a bar chart of means, type
      proc sgplot data=sashelp.shoes;
      vbar Region / response=Sales stat=mean;
      run;

      or:
      proc sgplot data=sashelp.shoes;
      vbar Region / group=Product groupdisplay=cluster response=Sales stat=mean;
      run;

      where Region and Product are the qualitative variables and Sales is a quantitative variable.
  19. There are two good options to create boxplots for cases in different categories:
    Method 1: Using PROC SGPLOT

    • For just one qualitative variable, type
      proc sgplot data=sashelp.shoes;
      vbox Sales / category=Region;
      run;

      where Sales is a quantitative variable and Region is the qualitative variable.
    • For two qualitative variables, type
      proc sgplot data=sashelp.shoes;
      vbox Sales / category=Region group=Product;
      run;

      where Sales is a quantitative variable, Region is the major qualitative variable, and Product is the minor qualitative variable

    Method 2: Using PROC SGPANEL

    • For just one qualitative variable, type
      proc sgpanel data=sashelp.shoes;
      panelby Region;
      vbox Sales;
      run;

      where Sales is a quantitative variable and Region is the qualitative variable.
    • For two qualitative variables, type
      proc sgpanel data=sashelp.shoes;
      panelby Region Product;
      vbox Sales;
      run;

      where Sales is a quantitative variable, Region is the major qualitative variable, and Product is the minor qualitative variable

    Which approach you will use depends on which graphical presentation you prefer.

  20. To make a QQ-plot (also known as a normal probability plot) for a
    quantitative variable, type
    proc univariate data=sashelp.class noprint;
    qqplot Weight / normal(mu=est sigma=est);
    run;

    where Weight is a quantitative variable.
  21. To compute a confidence interval for a univariate population mean,
    type
    proc univariate data=sashelp.class cibasic(alpha=0.05);
    var Weight;
    run;

    where Weight is the variable for which you want to calculate the confidence interval, and alpha is the confidence level of the interval.
  22. To do a hypothesis test for a univariate population mean, type
    proc univariate data=sashelp.class mu0=90;
    var Weight;
    run;

    where Weight is the variable for which you want to do the test and mu0 is the (null) hypothesized value.

Simple linear regression

    1. To fit a simple linear regression model (i.e., find a least squares line), type
      proc reg data=sashelp.class;
      model Weight=Height;
      run;

      where Weight is the response variable and Height is the predictor variable. In the rare circumstance that you wish to fit a model without an intercept term (regression through the origin), type
      proc reg data=sashelp.class;
      model Weight=Height / noint;
      run;
    2. To add a regression line or least squares line to a scatterplot, type
      proc sgplot data=sashelp.shoes;
      scatter x=Sales y=Returns;
      reg x=Sales y=Returns;
      run;

      where Returns is the vertical axis variable and Sales is the horizontal axis variable.
    3. To find confidence intervals for the regression parameters in a simple
      linear regression model, type
      proc reg data=sashelp.class alpha=0.05;
      model Weight=Height / clb;
      run;

      where Weight is the response variable, Height is the predictor variable, and alpha is the confidence level of the intervals.

This applies more generally to multiple linear regression also.

    • To find a fitted value or predicted value of Y (the response
      variable) at a particular value of X (the predictor variable), type
      proc reg data=sashelp.class alpha=0.05;
      model Weight=Height;
      output out=PredictedValues predicted=PredictedWeight;
      run;

      where Weight is the response variable, Height is the predictor variable, and alpha is the confidence level of the interval. A SAS dataset named PredictedValues will be created, with the original variables and values from the input dataset, and a new variable named PredictedWeight that is the predicted value of Weight. Note that you can use any names you wish for PredictedValues and PredictedWeight.
    • You can also obtain a fitted or predicted values of Y at an X-value that is not in the dataset by doing the following. Before fitting the regression model, add the X-value to the dataset using code such as:
      data mydata;
      input Height;
      datalines;
      75
      run;
      data mydata2;
      set sashelp.class mydata;
      run;

      Then fit the regression model using:
      proc reg data=mydata2 alpha=0.05;
      model Weight=Height;
      output out=PredictedValues predicted=PredictedWeight;
      run;

      SAS will ignore the X-value you added when fitting the model (since there is no corresponding Y-value), so all the regression output (such as the estimated regression parameters) will be the same. But SAS will calculate a fitted or predicted values of Y at this new X-value based on the results of the regression.
      This applies more generally to multiple linear regression also.
    • To find a confidence interval for the mean of Y at a particular value of
      X, type
      proc reg data=sashelp.class alpha=0.05;
      model Weight=Height / clm;
      run;

      where Weight is the response variable, Height is the predictor variable, and alpha is the confidence level of the interval. The confidence intervals for the mean of Y at each of the X-values in the dataset are displayed as two columns headed CL Mean.
    • You can also obtain a confidence interval for the mean of Y at an X-value that is not in the dataset by doing the following. Before fitting the regression model, add the X-value to the dataset using code such as:
      data mydata;
      input Height;
      datalines;
      75
      run;
      data mydata2;
      set sashelp.class mydata;
      run;

      Then fit the regression model using:
      proc reg data=mydata2 alpha=0.05;
      model Weight=Height / clm;
      run;

      SAS will ignore the X-value you added when fitting the model (since there is no corresponding Y-value), so all the regression output (such
      as the estimated regression parameters) will be the same. But SAS will calculate a confidence interval for the mean of Y at this new X-value based on the results of the regression.
    • This applies more generally to multiple linear regression also.
    • To find a prediction interval for an individual value of Y at a particular
      value of X, type
      proc reg data=sashelp.class alpha=0.05;
      model Weight=Height / cli;
      run;

      where Weight is the response variable, Height is the predictor variable, and alpha is the confidence level of the interval. The prediction intervals for an individual value of Y at each of the X-values in the dataset are displayed as two columns headed CL Predict.
    • You can also obtain a prediction interval for an individual value of Y at an
      X-value that is not in the dataset by doing the following. Before fitting the regression model, add the X-value to the dataset using code such as:
      data mydata;
      input Height;
      datalines;
      75
      run;
      data mydata2;
      set sashelp.class mydata;
      run;

      Then fit the regression model using:
      proc reg data=mydata2 alpha=0.05;
      model Weight=Height / cli;
      run;

      SAS will ignore the X-value you added when fitting the model (since there is no corresponding Y-value), so all the regression output (such as the estimated regression parameters) will be the same. But SAS will calculate a prediction interval for an individual value of Y at this new X-value based on the results of the regression.
    • This applies more generally to multiple linear regression also.

Multiple linear regression

  1. To fit a multiple linear regression model, type
    proc reg data=sashelp.class;
    model Weight=Age Height;
    run;

    where Weight is the response variable and Age and Height are the predictor
    variables. In the rare circumstance that you wish to fit a model without an intercept term (regression through the origin), type
    proc reg data=sashelp.class;
    model Weight=Age Height / noint;
    run;
  2. To add a quadratic regression line to a scatterplot, type
    proc sgplot data=sashelp.shoes;
    scatter x=Sales y=Returns;
    reg x=Sales y=Returns / degree=2;
    run;

    where Returns is the vertical axis variable and Sales is the horizontal axis variable.
  3. Categories of a qualitative variable can be thought of as defining subsets
    of the sample. If there is also a quantitative response and a quantitative predictor variable in the dataset, a regression model can be fit to the data to represent separate regression lines for each subset. For example, suppose that Type is a qualitative variable with six categories, and Horsepower and EngineSize are two quantitative variables. Then the following code produces a scatterplot with different colors (representing the value of Type) marking the
    points, and six separate regression lines:
    proc sgplot data=sashelp.cars;
    scatter x=EngineSize y=Horsepower / group=Type;
    reg x=EngineSize y=Horsepower / group=Type;
    run;
  4. To find the F-statistic and associated p-value for a nested model F-test in
    multiple linear regression, submit code such as the following:
    proc reg data=sashelp.cars;
    model MPG_Highway={EngineSize Horsepower} {MSRP Invoice} / selection=forward
    groupnames='EngineSize Horsepower' 'MSRP Invoice'
    slentry=0.99;
    run;

    Here, EngineSize and Horsepower are in the reduced model, while EngineSize, HorsepowerMSRP, and Invoice are in the complete model. The F-statistic is in the second row of the "Summary of Forward Selection Table" in the column headed F Value, while the associated
    p-value is in the column headed Pr > F.
  5. To save residuals in a multiple linear regression model, type
    proc reg data=sashelp.class;
    model Weight=Age Height;
    output out=work.res1 residual=resid;
    run;

    This code saves the residuals to the SAS dataset res1 (in the work library) as
    variable resid, where they can now be used just like any other variable, for example, to construct residual plots (note that all the variables in the original dataset are included in the new dataset). To save what Pardoe (2012) calls standardized residuals, use keyword student in place of residual. To save what Pardoe (2012) calls studentized residuals, use keyword rstudent.
  6. To add a loess fitted line to a scatterplot, type
    proc sgplot data=sashelp.shoes;
    scatter x=Sales y=Returns;
    loess x=Sales y=Returns;
    run;

    where Returns is the vertical axis variable and Sales is the horizontal axis variable. If you wish to adjust the smoothness of the line, see SAS Help for information about the smooth option of the loess statement.
  7. To save leverages in a multiple linear regression model, type
    proc reg data=sashelp.class;
    model Weight=Age Height;
    output out=work.res1 h=lev;
    run;

    This code saves the leverages to the SAS dataset res1 (in the work library) as
    variable lev, where they can now be used just like any other variable, for example, to construct scatterplots (note that all the variables in the original dataset are included in the new dataset).
  8. To save Cook's distances in a multiple linear regression model, type
    proc reg data=sashelp.class;
    model Weight=Age Height;
    output out=work.res1 cookd=cooksd;
    run;

    This code saves the Cook's distances to the SAS dataset res1 (in the work library) as variable cooksd, where they can now be used just like any other variable, for example, to construct scatterplots (note that all the variables in the original dataset are included in the new dataset).
  9. To create some residual plots automatically in a multiple linear regression
    model, type
    proc reg data=sashelp.class;
    model Weight=Age Height;
    plot rstudent.*predicted. rstudent.*nqq. rstudent.*cookd.;
    run;

    This produces a plot of studentized residuals versus fitted values, a QQ-plot of the studentized residuals, and a plot of studentized residuals versus Cook's distances. To create residual plots manually, first create studentized residuals (see help #35), and then construct scatterplots with these studentized residuals on the vertical axis.
  10. To create a correlation matrix of quantitative variables (useful for
    checking potential multicollinearity problems), type
    proc corr data=sashelp.cars;
    var EngineSize Horsepower MPG_City;
    run;

    where EngineSize, Horsepower, and MPG_City are quantitative variables.
  11. To find variance inflation factors in multiple linear regression,
    type
    proc reg data=sashelp.class;
    model Weight=Age Height / vif;
    run;

    where Weight is the response variable and Age and Height are the predictor
    variables.
  12. To draw a predictor effect plot for graphically displaying the effects of
    transformed quantitative predictors and/or interactions between quantitative and qualitative predictors in multiple linear regression, first create an "x1effect" variable representing the effect (see help #6). As an example, we'll create WeightEffect1 and WeightEffect2 using the following, as well as an indicator variable for a qualitative variable (see help #7):
    data mydata;
    set sashelp.class;
    if Sex='M' then D1=1;
    else D1=0;
    WeightEffect1 = 1 + (3 * Weight) + (4 * (Weight **2));
    WeightEffect2 = 1 - (2 * Weight) + (3 * D1 * Weight);
    run;

    Next, sort the dataset by the values of the X variable (if necessary):
    proc sort data=mydata out=mydata2;
    by Weight;
    run;

    • If the "X1effect" just involves the X variable (e.g., 1 + 3X + 4X2), type
      proc sgplot data=mydata2;
      series x=Weight y=WeightEffect1;
      run;
    • If the "X1 effect" variable involves a qualitative variable (e.g., 1 − 2X + 3D1X, where D1 is an indicator variable), type
      proc sgplot data=mydata2;
      series x=Weight y=WeightEffect2 / group=Sex;
      run;

      where Sex is the qualitative variable.

    See Section 5.5 in Pardoe (2012) for an example.