Skip to content

SPSS

These instructions accompany Applied Regression Modeling by Iain Pardoe, 2nd edition published by Wiley in 2012. The numbered items cross-reference with the "computer help" references in the book. These instructions are based on SPSS 20 for Windows, but they (or something similar) should also work for other versions. Find instructions for other statistical software packages here.

Getting started and summarizing univariate data

  1. If desired, change SPSS's default options by selecting Edit > Options. For example, to display variable names (in alphabetical order) rather than labels in dialog boxes, click the General tab; in the Variable Lists group select Display names and select Alphabetical. To show variable names rather than labels in output tables, click the Output Labels tab; under Pivot Table Labeling change Variables in labels shown as to Names. To display small numbers in tables without using scientific notation (which can make reading the numbers more difficult), click the General tab; under Output check No scientific notation for small numbers in tables.
  2. To open a SPSS data file, select File > Open > Data.
  3. To recall a previously used dialog box, click the Dialog Recall tool (fourth button from the left in the Data Editor Window, sixth button from the left in the Viewer Window).
  4. Output can be edited in the Viewer Window. Individual pieces of output (including tables and graphs) can be selected, edited, moved, deleted, and so on using both the Outline Pane (on the left) and the Display Pane (on the right). Text and headings can be entered using the Insert menu. Alternatively, copy and paste pieces of output from SPSS to a word processor like OpenOffice Writer or Microsoft Word.
  5. You can access help by selecting Help > Topics. For example, to find out about "boxplots" click the Index tab, type boxplots in the first box, and select the index entry you want in the second box.
  6. To transform data or compute a new variable, select Transform > Compute Variable. Type a name (with no spaces) for the new variable in the Target Variablebox, and type a mathematical expression for the variable in the Numeric Expression box. Current variables in the dataset can be moved into the Numeric Expression box, while the keypad and list of functions can be used to create the expression. Examples are LN(X) for the natural logarithm of X and X**2 for X2. Click OKto create the new variable, which will be added to the dataset (check it looks correct in the Data Editor Window); it can now be used just like any other variable. If you get the error message "expression ends unexpectedly," this means there is a syntax error in your Numeric Expression—a common mistake is to forget the multiplication symbol (*) between a number and a variable (e.g., 2*X represents 2X).
  7. To create indicator (dummy) variables from a qualitative variable, select Transform > Recode into Different Variables. Move the qualitative variable into theInput Variable -> Output Variable box, type a name for the first indicator variable in the Output Variable Name box, and press Change (the name should replace the question mark in the Input Variable -> Output Variable box). Next, press Old and New Values, type the appropriate category name/number into the Old Value box, type 1 into the New Value box, and press Add. Then select All other values, type 0 into the New Value box, and press Add. Click Continueto return to the previous dialog box, and click OK (check that the correct indicator variable has been added to your spreadsheet in the Data Editor Window). Repeat for other indicator variables (if necessary).
    • To find a percentile (critical value) for a t-distribution, select Transform > Compute Variable. Type a name (with no spaces) in the Target Variable box (e.g., "cvt"). Then type IDF.T(p, df) into the Numeric Expression box. Here p is the lower-tail area (i.e., one minus the one-tail significance level) and df is the degrees of freedom. Click OK to see the result in the Data Editor Window, where it will appear in a new column. You may need to click Variable View (at the bottom of the window) to change the number of decimal places displayed. For example, IDF.T(0.95, 29) returns the 95th percentile of the t-distribution with 29 degrees of freedom (1.699), which is the critical value for an upper-tail test with a 5% significance level. By contrast, IDF.T(0.975, 29) returns the 97.5th percentile of the t-distribution with 29 degrees of freedom (2.045), which is the critical value for a two-tail test with a 5% significance level.
    • To find a percentile (critical value) for an F-distribution, select Transform > Compute Variable. Type a name (with no spaces) in the Target Variable box (e.g., "cvf"). Then type IDF.F(p, df1, df2) into the Numeric Expression box. Here p is the lower-tail area (i.e., one minus the significance level), df1 is the numerator degrees of freedom, and df2 is the denominator degrees of freedom. For example, IDF.F(0.95, 2, 3) returns the 95th percentile of the F-distribution with 2 numerator degrees of freedom and 3 denominator degrees of freedom (9.552).
    • To find a percentile (critical value) for a chi-squared distribution, select Transform > Compute Variable. Type a name (with no spaces) in the Target Variable box (e.g., "cvchisq"). Then type IDF.CHISQ(p, df) into the Numeric Expression box. Here p is the lower-tail area (i.e., one minus the significance level) and df is the degrees of freedom. For example, IDF.CHISQ(0.95, 2) returns the 95th percentile of the chi-squared distribution with 2 degrees of freedom (5.991).
    • To find an upper-tail area (one-tail p-value) for a t-distribution, select Transform > Compute Variable. Type a name (with no spaces) in the Target Variablebox (e.g., "pt"). Then type 1 - CDF.T(t, df) into the Numeric Expression box. Here t is the value of the t-statistic and df is the degrees of freedom. For example, 1 - CDF.T(2.40, 29) returns the upper-tail area for a t-statistic of 2.40 from the t-distribution with 29 degrees of freedom (0.012), which is the p-value for an upper-tail test. By contrast, 2*(1 - CDF.T(2.40, 29)) returns the two-tail area for a t-statistic of 2.40 from the t-distribution with 29 degrees of freedom (0.023), which is the p-value for a two-tail test.
    • To find an upper-tail area (p-value) for an F-distribution, select Transform > Compute Variable. Type a name (with no spaces) in the Target Variable box (e.g., "pf"). Then type SIG.F(f, df1, df2) into the Numeric Expression box. Here f is the value of the F-statistic, df1 is the numerator degrees of freedom, and df2 is the denominator degrees of freedom. For example, SIG.F(51.4, 2, 3) returns the upper-tail area (p-value) for an F-statistic of 51.4 for the F-distribution with 2 numerator degrees of freedom and 3 denominator degrees of freedom (0.005).
    • To find an upper-tail area (p-value) for a chi-squared distribution, select Transform > Compute Variable. Type a name (with no spaces) in the Target Variable box (e.g., "pchisq"). Then type SIG.CHISQ(chisq, df) into the Numeric Expression box. Here chisq is the value of the chi-squared statistic and dfis the degrees of freedom. For example, SIG.CHISQ(0.38, 2) returns the upper-tail area (p-value) for a chi-squared statistic of 0.38 for the chi-squared distribution with 2 degrees of freedom (0.827).
  8. Calculate descriptive statistics for quantitative variables by selecting Analyze > Descriptive Statistics > Frequencies. Move the variable(s) into theVariable(s) list. Click Statistics to select the summaries, such as the Mean, that you would like. To avoid superfluous output uncheck Display frequency tables.
  9. Create contingency tables or cross-tabulations for qualitative variables by selecting Analyze > Descriptive Statistics > Crosstabs. Move one qualitative variable into the Row(s) list and another into the Column(s) list. Cell percentages (within rows, columns, or the whole table) can be calculated by clicking Cells.
  10. If you have a quantitative variable and a qualitative variable, you can calculate descriptive statistics for cases grouped in different categories by selecting Analyze > Reports > Case Summaries. Move the quantitative variable(s) into the Variables list and the qualitative variable(s) into the Grouping Variable(s) list. Click Statistics to select the summaries that you would like; the default is Number of Cases, but other statistics such as the Mean and Standard Deviation can also be selected. To avoid superfluous output uncheck Display cases.
  11. To make a stem-and-leaf plot for a quantitative variable, select Analyze > Descriptive Statistics > Explore. Move the variable into the Dependent Listbox. You can alter the statistics that are calculated and the plots that are constructed by clicking Statistics and Plots.
  12. To make a histogram for a quantitative variable, select Graphs > Legacy Dialogs > Histogram. Move the variable into the Variable box.
  13. To make a scatterplot with two quantitative variables, select Graphs > Legacy Dialogs > Scatter/Dot. Choose Simple Scatter and move the vertical axis variable into the Y Axis box and the horizontal axis variable into the X Axis box.
  14. All possible scatterplots for more than two variables can be drawn simultaneously (called a scatterplot matrix}) by selecting Graphs > Legacy Dialogs > Scatter/Dot, then choosing Matrix Scatter and moving the variables into the Matrix Variables list.
  15. You can mark or label cases in a scatterplot with different colors/symbols according to categories in a qualitative variable by moving the variable into the Set Markers bybox in the Scatterplot dialog. To change the colors/symbols used, edit the plot (double-click it in the Viewer Window) to bring up a Chart Editor Window, select the symbol you want to change by clicking on it in the legend at the right of the plot (the data points corresponding to this symbol should become highlighted when you do this), and select Edit > Properties. Select the color/symbol you want and click Apply to see the effect. click Close to return to the plot; close the plot to return to the Viewer Window.
  16. You can identify individual cases in a scatterplot using labels by moving a qualitative text variable into the Label Cases by box in the Scatterplot dialog. This has no apparent effect on the plot when it is first drawn, but if you subsequently edit the plot (double-click it in the Viewer Window) to bring up a Chart Editor Window, you can then use the Point Identification tool (under Elements > Data Label Mode) to click on a point and the label for that point will be displayed.
  17. To remove one of more observations from a dataset, select Data > Select Cases and choose an appropriate selection criteria.
  18. To make a bar chart for cases in different categories, select Graphs > Legacy Dialogs > Bar.
    • For frequency bar charts of one qualitative variable, choose Simple and move the variable into the Category Axis box.
    • For frequency bar charts of two qualitative variables, choose Clustered and move one variable into the Category Axis box and the other into the Define Clusters by box.
    • The bars can also represent various summary functions for a quantitative variable. For example, to produce a bar chart of means, select Other statistic (e.g., mean) and move the quantitative variable into the Variable box.
  19. To make boxplots for cases in different categories, select Graphs > Legacy Dialogs > Boxplot.
    • For just one qualitative variable, choose Simple and move the qualitative variable into the Category Axis box. Move the quantitative variable into the Variablebox.
    • For two qualitative variables, choose Clustered and move one qualitative variable into the Category Axis box and the other into the Define Clusters by box. Move the quantitative variable into the Variable box.
  20. To make a QQ-plot (also known as a normal probability plot) for a quantitative variable, select Analyze > Descriptive Statistics > Q-Q Plots. Move the variable into the Variables box and leave the Test Distribution as Normal to assess normality of the variable. This procedure produces a regular QQ-plot (described in Section 1.2 of Pardoe, 2012) as well as a "detrended" one.
  21. To compute a confidence interval for a univariate population mean, select Analyze > Descriptive Statistics > Explore. Move the variable for which you want to calculate the confidence interval into the Dependent List box and select Statistics for Display. Then click the Statistics button to bring up another dialog box in which you can specify the confidence level for the interval (among other things). Clicking Continue will take you back to the previous dialog box, where you can now click OK.
  22. To do a hypothesis test for a univariate population mean, select Analyze > Compare Means > One-Sample T Test. Move the variable for which you want to do the test into the Test Variable(s) box and type the (null) hypothesized value into the Test Value box. The p-value calculated (displayed as "Sig.") is a two-tailed p-value; to obtain a one-tailed p-value you will either need to divide this value by two or subtract it from one and then divide by two (draw a picture to figure out which).

Simple linear regression

  1. To fit a simple linear regression model (i.e., find a least squares line), select Analyze > Regression > Linear. Move the response variable into the Dependent box and the predictor variable into the Independent(s) box. Just click OK for now—the other items in the dialog box are addressed below. In the output, ignore the column headed "Standardized Coefficients." In the rare circumstance that you wish to fit a model without an intercept term (regression through the origin), click Options before clicking OK and uncheck Include constant in equation.
  2. To add a regression line or least squares line to a scatterplot, edit the plot (double-click it in the Viewer Window) to bring up a Chart Editor Window and select Elements > Fit Line at Total. This brings up another dialog in which you need to make sure Linear is selected under Fit Method. Click Close to add the least squares line and return to the plot; close the plot to return to the Viewer Window.
  3. To find 95% confidence intervals for the regression parameters in a simple linear regression model, select Analyze > Regression > Linear. Move the response variable into the Dependent box and the predictor variable into the Independent(s) box. Before clicking OK, click the Statistics button and check Confidence intervals (under Regression Coefficient) in the subsequent Linear Regression: Statistics dialog box. Click Continue to return to the main Linear Regression dialog box, and then click OK. The confidence intervals are displayed as the final two columns of the "Coefficients" output. This applies more generally to multiple linear regression also.
    • To find a fitted value or predicted value of Y (the response variable) at a particular value of X (the predictor variable), select Analyze > Regression > Linear. Move the response variable into the Dependent box and the predictor variable into the Independent(s) box. Before clicking OK, click the Save button and check Unstandardized under Predicted Values in the subsequent Linear Regression: Save dialog box. Click Continue to return to the main Linear Regression dialog box, and then click OK. The fitted or predicted values of Y at each of the X-values in the dataset are displayed in the column headed PRE_1 in the Data Editor Window (not in the Viewer Window). Each time you ask SPSS to calculate fitted or predicted values like this it will add a new column to the dataset and increment the end digit by one; for example, the second time you calculate fitted or predicted values they will be called PRE_2.
    • You can also obtain a fitted or predicted value of Y at an X-value that is not in the dataset by doing the following. Before fitting the regression model, add the X-value to the dataset in the Data Editor Window (go down to the bottom of the spreadsheet, and type the X-value in the appropriate cell of the next blank row). Then fit the regression model and follow the steps above. SPSS will ignore the X-value you typed when fitting the model (since there is no corresponding Y-value), so all the regression output (such as the estimated regression parameters) will be the same. But SPSS will calculate a fitted or predicted value of Y at this new X-value based on the results of the regression. Again, look for it in the dataset; it will be displayed in the column headed PRE in the Data Editor Window (not in the Viewer Window).
    • This applies more generally to multiple linear regression also.
    • To find a confidence interval for the mean of Y at a particular value of X, select Analyze > Regression > Linear. Move the response variable into theDependent box and the predictor variable into the Independent(s) box. Before clicking OK, click the Save button and check Mean (under Prediction Intervals) in the subsequent Linear Regression: Save dialog box. Type the value of the confidence level that you want in the Confidence Interval box (the default is 95%), click Continue to return to the main Linear Regression dialog box, and then click OK. The confidence intervals for the mean of Y at each of the X-values in the dataset are displayed as two columns headed LMCI_1 and UMCI_1 in the Data Editor Window (not in the Viewer Window). The "LMCI" stands for "lower mean confidence interval," while the "UMCI" stands for "upper mean confidence interval." Each time you ask SPSS to calculate confidence intervals like this it will add new columns to the dataset and increment the end digit by one; for example, the second time you calculate confidence intervals for the mean of Y the end points will be called LMCI_2 and UMCI_2.
    • You can also obtain a confidence interval for the mean of Y at an X-value that is not in the dataset by doing the following. Before fitting the regression model, add the X-value to the dataset in the Data Editor Window (go down to the bottom of the spreadsheet, and type the X-value in the appropriate cell of the next blank row). Then fit the regression model and follow the steps above. SPSS will ignore the X-value you typed when fitting the model (since there is no corresponding Y-value), so all the regression output (such as the estimated regression parameters) will be the same. But SPSS will calculate a confidence interval for the mean of Y at this new X-value based on the results of the regression. Again, look for it in the dataset; it will be displayed in the two columns headed LMCI and UMCI in the Data Editor Window(not in the Viewer Window).
    • This applies more generally to multiple linear regression also.
    • To find a prediction interval for an individual value of Y at a particular value of X, select Analyze > Regression > Linear. Move the response variable into theDependent box and the predictor variable into the Independent(s) box. Before clicking OK, click the Save button and check Individual (under Prediction Intervals) in the subsequent Linear Regression: Save dialog box. Type the value of the confidence level that you want in the Confidence Interval box (the default is 95%), click Continue to return to the main Linear Regression dialog box, and then click OK. The prediction intervals for an individual Y-value at each of the X-values in the dataset are displayed as two columns headed LICI_1 and UICI_1 in the Data Editor Window (not in the Viewer Window). The "LICI" stands for "lower individual confidence interval," while the "UICI" stands for "upper individual confidence interval." Each time you ask SPSS to calculate prediction intervals like this it will add new columns to the dataset and increment the end digit by one; for example, the second time you calculate prediction intervals for the mean of Y the end points will be called LICI_2 and UICI_2.
    • You can also obtain a prediction interval for an individual Y-value at an X-value that is not in the dataset by doing the following. Before fitting the regression model, add the X- value to the dataset in the Data Editor Window (go down to the bottom of the spreadsheet, and type the X-value in the appropriate cell of the next blank row). Then fit the regression model and follow the steps above. SPSS will ignore the X-value you typed when fitting the model (since there is no corresponding Y-value), so all the regression output (such as the estimated regression parameters) will be the same. But SPSS will calculate a prediction interval for an individual Y- value at this new X-value based on the results of the regression. Again, look for it in the dataset; it will be displayed in the two columns headed LICI and UICI in the Data Editor Window (not in the Viewer Window).
    • This applies more generally to multiple linear regression also.

Multiple linear regression

  1. To fit a multiple linear regression model, select Analyze > Regression > Linear. Move the response variable into the Dependent box and the predictor variables into the Independent(s) box. In the rare circumstance that you wish to fit a model without an intercept term (regression through the origin), click Options before clicking OK and uncheck Include constant in equation.
  2. To add a quadratic regression line to a scatterplot, edit the plot (double-click it in the Viewer Window) to bring up a Chart Editor Window and select Elements > Fit Line at Total. This brings up another dialog in which you need to check the Quadratic option under Fit Method. Click Apply and Close to add the quadratic regression line and return to the plot; close the plot to return to the Viewer Window.
  3. Categories of a qualitative variable can be thought of as defining subsets of the sample. If there is also a quantitative response and a quantitative predictor variable in the dataset, a regression model can be fit to the data to represent separate regression lines for each subset. First use help #15 and #17 to make a scatterplot with the response variable on the vertical axis, the quantitative predictor variable on the horizontal axis, and the cases marked with different colors/symbols according to the categories in the qualitative predictor variable. To add a regression line for each subset to this scatterplot, edit the plot (double-click it in the Viewer Window) to bring up a Chart Editor Window and select Elements > Fit Line at Subgroups. This brings up another dialog in which you need to make sure Linear is selected under Fit Method. Click Close to add the least squares lines for each subset of selected points and return to the plot. Close the plot to return to the Viewer Window.
  4. To find the F-statistic and associated p-value for a nested model F-test in multiple linear regression, select Analyze > Regression > Linear. Move the response variable into the Dependent box and the predictor variables in the reduced model into the Independent(s) box. Click the Next button to the right of where it says Block 1 of 1; it should now say Block 2 of 2 and the Independent(s) box should have been cleared. Move the additional predictors in the complete model (i.e., the predictors whose usefulness you are assessing) into this Block 2 Independent(s) box. You should now have the predictors that are in both the reduced and complete models in Block 1, and the predictors that are only in the complete model in Block 2. Then click Statistics and check R squared change. Finally click Continue to return to the Regression dialog and OK to obtain the results. The F-statistic is in the second row of the "Model Summary" in the column headed F Change, while the associated p-value is in the column headed Sig. (Ignore the numbers in the first rows of these columns.)
  5. To save residuals in a multiple linear regression model, select Analyze > Regression > Linear. Move the response variable into the Dependent box and the predictor variables into the Independent(s) box. Before clicking OK, click the Save button and check Unstandardized under Residuals in the subsequent Linear Regression: Save dialog box. Click Continue to return to the main Linear Regression dialog box, and then click OK. The residuals are saved as a variable called RES_1 in the Data Editor Window; they can now be used just like any other variable, for example, to construct residual plots. Each time you ask SPSS to save residuals like this it will add a new variable to the dataset and increment the end digit by one; for example, the second time you save residuals they will be called RES_2. To save what Pardoe (2012) calls standardized residuals, check Studentized under Residuals in the Linear Regression: Save dialog box—they will be saved as a variable called SRE in the Data Editor Window. To save what Pardoe (2012) calls studentized residuals, check Studentized deleted under Residuals in the Linear Regression: Save dialog box—they will be saved as a variable called SDR in the Data Editor Window.
  6. To add a loess fitted line to a scatterplot (useful for checking the zero mean regression assumption in a residual plot), edit the plot (double-click it in the Viewer Window) to bring up a Chart Editor Window and select Elements > Fit Line at Total. This brings up another dialog in which you need to check the Loess option under Fit Method. The default value of 50 for % of points to fit tends to be a little on the low side: I would change it to 75. Click Apply and Close to add the loess fitted line and to return to the plot; close the plot to return to the Viewer Window.
  7. To save leverages in a multiple linear regression model, select Analyze > Regression > Linear. Move the response variable into the Dependent box and the predictor variables into the Independent(s) box. Before clicking OK, click the Save button and check Leverage values under Distances in the subsequent Linear Regression: Save dialog box. Click Continue to return to the main Linear Regression dialog box, and then click OK. This results in "centered" leverages being saved as a variable called LEV_1 in the Data Editor Window; they can now be used just like any other variable, for example, to construct scatterplots. Each time you save leverages like this, SPSS will add a new variable to the dataset and increment the end digit by one; for example, the second set of leverages will be called LEV_2. Centered leverage = ordinary leverage − 1/n, where ordinary leverage is defined in Section 5.1.2 of Pardoe (2012) and n is the sample size.
  8. To save Cook's distances in a multiple linear regression model, select Analyze > Regression > Linear. Move the response variable into the Dependent box and the predictor variables into the Independent(s) box. Before clicking OK, click the Save button and check Cook's under Distances in the subsequent Linear Regression: Save dialog box. Click Continue to return to the main Linear Regression dialog box, and then click OK. Cook's distances are saved as a variable called COO_1 in the Data Editor Window; they can now be used just like any other variable, for example, to construct scatterplots. Each time you save Cook's distances like this, SPSS will add a new variable to the dataset and increment the end digit by one; for example, the second set of Cook's distances will be called COO_2.
  9. To create some residual plots automatically in a multiple linear regression model, select Analyze > Regression > Linear. Move the response variable into theDependent box and the predictor variables into the Independent(s) box. Before clicking OK, click the Plots button and move *SRESID into the Y box and *ZPRED into the X box to create a scatterplot of the standardized residuals on the vertical axis versus the standardized predicted values on the horizontal axis. Click Continue to return to the main Linear Regression dialog box, and then hit OK. To create residual plots manually, first create studentized residuals (see help #35), and then construct scatterplots with these studentized residuals on the vertical axis.
  10. To create a correlation matrix of quantitative variables (useful for checking potential multicollinearity problems), select Analyze > Correlate > Bivariate. Move the variables into the Variables box and click OK.
  11. To find variance inflation factors in multiple linear regression, select Analyze > Regression > Linear. Move the response variable into the Dependent box and the predictor variables into the Independent(s) box. Before clicking OK, click Statistics and check Collinearity diagnostics. Click Continue to return to the Regression dialog and OK to obtain the results. The variance inflation factors are in the last column of the "Coefficients" output under "VIF."
  12. To draw a predictor effect plot for graphically displaying the effects of transformed quantitative predictors and/or interactions between quantitative and qualitative predictors in multiple linear regression, first create a variable representing the effect, say, "X1effect" (see computer help #6). Then select Graphs > Legacy Dialogs > Scatter/Dot. Choose Simple Scatter and move the "X1effect" variable into the Y Axis box and X1 into the X Axis box.
    • If the "X1effect" variable just involves X1 (e.g., 1 + 3X1 + 4X12), you can click OK at this point then double-click on the scatterplot to get into the chart editor. Right-click on Markers and select Properties Window. In the Properties Window dialog box, click on the Variables tab and in the top drop-down box select Path (this will draw the line between the markers on the plot). Finally, click on Apply and then Close.
    • If the "X1effect" variable also involves a qualitative variable (e.g., 1 − 2X1 + 3D2X1, where D2 is an indicator variable), you should Set markers by the qualitative variable before clicking OK and editing the plot.

    See Section 5.5 in Pardoe (2012) for an example.