# Data Desk

These instructions accompany Applied Regression Modeling by Iain Pardoe, 2nd edition published by Wiley in 2012. The numbered items cross-reference with the "computer help" references in the book. These instructions are based on Data Desk 6 for Windows, but they (or something similar) should also work for other versions. Find instructions for other statistical software packages here.

#### Getting started and summarizing univariate data

1. If desired, change Data Desk's default options by selecting Edit > Preferences.
2. To open a Data Desk data file, select File > Open Datafile. You can also use File > Import to open text data files or use Data Desk/XL, an Excel add-in, to export Excel spreadsheets to Data Desk.
3. Data Desk does not appear to offer a way to edit last dialog box.
4. Output appears in a folder named Results.
5. You can access help by selecting Help > Data Desk Help.
6. To transform data or compute a new variable, select the variable you want to transform (denoted Y) and select Manip > Transform and the required transformation. Examples are Exponentials > ln(y) for the natural logarithm of Y and Exponentials > y^2 for Y2. Alternatively, select Manip > Transform > New Derived Variable, name the new derived variable, click OK, and in the resulting text window type the equation for the new variable (this is particularly useful if the variable is a function of more than one of the existing variables). The new variable should now appear in the same icon window as the original variable, and have an appropriate name, e.g., LY for the natural logarithm of Y (check it looks correct by showing the numbers); it can now be used just like any other variable.
7. To create indicator (dummy) variables from a qualitative variable, select Manip > Transform > New Derived Variable. Name the new derived variable, click OK, and in the resulting text window write: If TextOf(`var') = "cat1" then 1 else 0, where var is the name of the qualitative variable and cat1 is the category for which you want the indicator variable to have the value 1. Check that the correct indicator variable has been created by showing the numbers. Repeat for other indicator variables (if necessary).
8. Data Desk does not appear to offer a way to find percentiles (critical values) for t, F, or chi-squared distributions.
9. Data Desk does not appear to offer a way to find tail areas (p-values) for t, F, or chi-squared distributions.
10. Calculate descriptive statistics for quantitative variables by selecting the quantitative variable (denoted Y) and selecting Calc > Summaries > Reports. Use the HyperView menu (top-left triangle) to select the summaries, such as the Mean, that you would like.
11. Create contingency tables or cross-tabulations for qualitative variables by selecting the first qualitative variable (denoted Y, representing the row categories), shift-selecting the second qualitative variable (denoted X, representing the column categories), and selecting Calc > Contingency Tables. Use the HyperView menu to calculate cell percentages (within rows, columns, or the whole table).
12. If you have quantitative variables and qualitative variables, you can calculate descriptive statistics for cases grouped in different categories by selecting the quantitative variable (denoted Y), shift-selecting the qualitative variable (denoted X), and selecting Calc > Summaries > Reports By Groups. Use the HyperView menu to select the summaries, such as the Mean, that you would like. If you want to group using two qualitative variables, first create a new variable consisting of all category combinations by selecting the two qualitative variables (one can be Y, the other X, it does not matter which) and selecting Manip > Transform > Misc > Concatenate(y,x). Then use the new variable as the qualitative variable (X) in the previous instructions.
13. Data Desk does not appear to offer an automatic way to make a stem-and-leaf plot for a quantitative variable.
14. To make a histogram for a quantitative variable, select the quantitative variable (denoted Y) and select Plot > Histograms.
15. To make a scatterplot with two quantitative variables, select the vertical axis quantitative variable (denoted Y), shift-select the horizontal-axis quantitative variable (denoted X), and select Plot > Scatterplots.
16. All possible scatterplots for more than two variables can be drawn simultaneously (called a scatterplot matrix) by selecting the variables you want plotted (it does not matter which are denoted Y or X) and selecting Plot > Plot Matrix.
17. You can mark or label cases in a scatterplot with different colors/symbols according to categories in a qualitative variable by selecting the qualitative variable and selecting Modify > Colors > Add > by Group or Modify > Symbols > Add > by Group.
18. You can identify individual cases in a scatterplot using labels by opening the variable window containing the labels and selecting the Query tool from the tools palette (fourth one down in the right column). You can then click on a point in the scatterplot and the label for that point will be displayed.
19. To remove one of more observations from a dataset, double-click the response variable in the Data folder, highlight the value(s) that you want to remove and select Edit > Clear.
20. To make a bar chart for cases in different categories, select the qualitative variable that represents the different categories and select Plot > Bar Charts.
• This will produce a frequency bar chart of the qualitative variable. For frequency bar charts of two qualitative variables use a newly created qualitative variable consisting of all category combinations (as in computer help #12).
• Data Desk does not appear to offer an automatic way to have the bars represent summary functions for a quantitative variable, such as the mean.
21. To make boxplots for cases in different categories, select the quantitative variable (denoted Y), shift-select the qualitative variable (denoted X), and select Plot > Boxplot y by x. For two qualitative variables, use a newly created qualitative variable consisting of all category combinations (as in computer help #12).
22. To make a QQ-plot (also known as a normal probability plot) for a quantitative variable, select the quantitative variable (denoted Y) and select Plot > Normal Prob Plot.
23. To compute a confidence interval for a univariate population mean, select the quantitative variable (denoted Y) and select Calc > Estimate. In the resulting window, select t-Interval for Individual μ's, select Individual (rather than Total), specify the confidence level for the interval, and click Show Results.
24. To do a hypothesis test for a univariate population mean, select the quantitative variable (denoted Y) and select Calc > Test. In the resulting window, select t-Test of Individual μ's, select Individual (rather than Total), specify the significance level (Alpha level) for the test, type the (null) hypothesized value into the "Ho:μ="box, select the alternative hypothesis (Ha) to be lower-tail ("μ<"), two-tail ("μ≠"), or upper-tail ("μ>"), and click Show Results.

#### Simple linear regression

1. To fit a simple linear regression model (i.e., find a least squares line), select the response variable (denoted Y), shift-select the predictor variable (denoted X), and select Calc > Regression. Some of the items in the HyperView menu are addressed below. In the rare circumstance that you wish to fit a model without an intercept term (regression through the origin), select Calc > Calculation Options > Regression Options... and uncheck Include constant term before fitting the model.
2. To add a regression line or least squares line to a scatterplot, select Add Regression Line from the scatterplot's HyperView menu.
3. Data Desk does not appear to offer an automatic way to find 95% confidence intervals for the regression parameters in a simple or multiple linear regression model. It is possible to calculate these intervals by hand using Data Desk regression output and appropriate percentiles from a t-distribution.
• To find a fitted value or predicted value of Y (the response variable) at a particular value of X (the predictor variable) in a linear regression model, select Compute > Predicted from the regression's HyperView menu. The fitted or predicted values of Y at each of the X-values in the dataset are displayed in a new variable namedpredicted(*), where the star abbreviates the response variable name.
• You can also obtain a fitted or predicted value of Y at an X-value that is not in the dataset by doing the following. Before fitting the regression model, add the X-value to the dataset by typing the X-value at the bottom of the column of values for the predictor. Then fit the regression model and follow the steps above. Data Desk will ignore the X-value you typed when fitting the model (since there is no corresponding Y-value), so all the regression output (such as the estimated regression parameters) will be the same. But Data Desk will calculate a fitted or predicted value of Y at this new X-value based on the results of the regression. Again, look for it in the new variable named predicted(*).
• This applies more generally to multiple linear regression also.
4. Data Desk does not appear to offer an automatic way to find a confidence interval for the mean of Y at a particular value of X in a simple linear regression model. It is possible to calculate such an interval by hand using Data Desk regression output and an appropriate percentile from a t-distribution. This applies more generally to multiple linear regression also.
5. Data Desk does not appear to offer an automatic way to find a prediction interval for an individual Y-value at a particular X-value in a simple linear regression model. It is possible to calculate such an interval by hand using Data Desk regression output and an appropriate percentile from a t-distribution. This applies more generally to multiple linear regression also.

#### Multiple linear regression

1. To fit a multiple linear regression model, select the response variable (denoted Y), shift-select the predictor variables (denoted X), and select Calc > Regression. Some of the items in the HyperView menu are addressed below. In the rare circumstance that you wish to fit a model without an intercept term (regression through the origin), select Calc > Calculation Options > Regression Options... and uncheck Include constant term before fitting the model.
2. Data Desk does not appear to offer an automatic way to to add a quadratic regression line to a scatterplot.
3. Categories of a qualitative variable can be thought of as defining subsets of the sample. If there are also a quantitative response and a quantitative predictor variable in the dataset, a regression model can be fit to the data to represent separate regression lines for each subset. First use computer help #15 and #17 to make a scatterplot with the response variable on the vertical axis, the quantitative predictor variable on the horizontal axis, and the cases marked with different colors according to the categories in the qualitative predictor variable. To add a regression line for each subset to this scatterplot, select Add Color Regression Lines from the HyperView menu.
4. Data Desk does not appear to offer an automatic way to to find the F-statistic and associated p-value for a nested model F-test in multiple linear regression. It is possible to calculate these quantities by hand using Data Desk regression output and appropriate percentiles from a F-distribution.
5. To save residuals in a multiple linear regression model, select Compute > Residuals from the regression's HyperView menu. The residuals are saved as a variable called residuals(*), where the star abbreviates the response variable name; they can now be used just like any other variable, for example, to construct residual plots. To save what Pardoe (2012) calls standardized residuals, select Compute > IStudRes—they will be saved as a variable called IStudRes(*). To save what Pardoe (2012) calls studentized residuals, select Compute > EStudRes—they will be saved as a variable called EStudRes(*).
6. To add a loess fitted line to a scatterplot (useful for checking the zero mean regression assumption in a residual plot), select Smoothing > Add Lowess Smooth from the scatterplot's HyperView menu. Select Smoothing > Smoothing Options to change the value of the Lowess Span %; you can experiment to find a value that captures the major trends in the scatterplot without being overly "wiggly."
7. To save leverages in a multiple linear regression model, select Compute > Leverages from the regression's HyperView menu. The leverages are saved as a variable called leverages(*), where the star abbreviates the response variable name; they can now be used just like any other variable, for example, to construct scatterplots.
8. To save Cook's distances in a multiple linear regression model, select Compute > Cook from the regression's HyperView menu. The Cook's distances are saved as a variable called Cook(*), where the star abbreviates the response variable name; they can now be used just like any other variable, for example, to construct scatterplots.
9. To create a residual plot automatically in a multiple linear regression model, select Scatterplot studentized residual vs predicted from the regression'sHyperView menu. This will create a scatterplot of the studentized residuals on the vertical axis versus the predicted values on the horizontal axis. To create residual plots manually, first create studentized residuals (see computer help #35), and then construct scatterplots with these studentized residuals on the vertical axis.
10. To create a correlation matrix of quantitative variables (useful for checking potential multicollinearity problems), select the variables (it does not matter which are denoted Y or X) and select Calc > Correlations > Pearson Product-Moment.
11. Data Desk does not appear to offer an automatic way to to find variance inflation factors in multiple linear regression.
12. To draw a predictor effect plot for graphically displaying the effects of transformed quantitative predictors and/or interactions between quantitative and qualitative predictors in multiple linear regression, first create a variable representing the effect, say, "X1effect" (see computer help #6). Then select the "X1effect" variable (denoted Y), shift-select the X1 variable (denoted X), and select Plot > Scatterplots.
• If the "X1effect" variable just involves X1 (e.g., 1 + 3X1 + 4X12), the resulting plot should be fine, albeit the effect will be represented by points rather than a line (as in Section 5.5 in Pardoe 2012). If you would prefer a line, select Add Regression Line from the scatterplot's HyperView menu (as in computer help #26).
• If the "X1effect" variable also involves a qualitative variable (e.g., 1 − 2X1 + 3D2X1, where D2 is an indicator variable), you should then select the qualitative variable and select Modify > Colors > Add > by Group (as in computer help #17) and finally select Add Color Regression Lines from the scatterplot's HyperView menu (as in computer help #33).

See Section 5.5 in Pardoe (2012) for an example.