Skip to content

R/S-Plus

These instructions accompany Applied Regression Modeling by Iain Pardoe, 2nd edition published by Wiley in 2012. The numbered items cross-reference with the "computer help" references in the book. These instructions are based on the programming interface (command line) of R 1.40 for Mac OS X, but they should also work for other versions of R and also S-PLUS. Find instructions for other statistical software packages here.

Getting started and summarizing univariate data

  1. Change R's default options by selecting R > Preferences (Mac) or Edit > GUI Preferences (Windows).
  2. To open a comma-separated (CSV) data file, type mydata <- read.csv("file.csv"), where file.csv is the name of the data file (with the correct path specified if necessary). One way to then make the dataset available for analysis is to type attach(mydata). You will now be able to analyze each of the variables in the mydatadataset. To see what these variables are, type names(mydata). When you are finished with this dataset, type detach(mydata). We do not go into the other methods for working with R datasets here—see an R companion book such as Fox and Weisberg (2011) for further discussion of this.
  3. To recall a previously entered command, hit the "up" arrow (repeatedly) on your keyboard.
  4. Output appears in the main R Console Window and can be copied and pasted from R to a word processor like OpenOffice Writer or Microsoft Word. Graphs appear in separate windows and can also easily be copied and pasted to other applications.
  5. You can access help by selecting Help > R help. For example, to find out about "boxplots" click Search Engine & Keywords, type boxplots in the search box, and click on one of the resulting hits.
  6. To transform data or compute a new variable, type, for example, logX <- log(X) for the natural logarithm of X and Xsq <- X**2 for X2. If you get the error message "unexpected input in ...," this means that there is a syntax error in your expression—a common mistake is to forget the multiplication symbol (*) between a number and a variable (e.g., 2*X represents 2X). Created variables like this will be placed in R's "global environment." To see a list of the variables in the global environment, type ls(). Variables in the global environment take precedence over identically named variables in attached datasets. To avoid any confusion, avoid creating a variable (or other R "object") that has the same name as a variable in an attached dataset.
  7. To create indicator (dummy) variables from a qualitative variable, type, for example, D1 <- ifelse(X=="level", yes=1, no=0), where X is the qualitative variable and level is the name of one of the categories in X. Repeat for other indicator variables (if necessary).
    • To find a percentile (critical value) for a t-distribution, type qt(p, df, lower.tail=F), where p is the one-tail significance level (upper-tail area) and df is the degrees of freedom. For example, qt(0.05, 29, lower.tail=F) returns the 95th percentile of the t-distribution with 29 degrees of freedom (1.699), which is the critical value for an upper-tail test with a 5% significance level. By contrast, qt(0.025, 29, lower.tail=F) returns the 97.5th percentile of the t-distribution with 29 degrees of freedom (2.045), which is the critical value for a two-tail test with a 5% significance level.
    • To find a percentile (critical value) for an F-distribution, type qf(p, df1, df2, lower.tail=F), where p is the significance level (upper-tail area), df1 is the numerator degrees of freedom, and df2 is the denominator degrees of freedom. For example, qf(0.05, 2, 3, lower.tail=F) returns the 95th percentile of the F-distribution with 2 numerator degrees of freedom and 3 denominator degrees of freedom (9.552).
    • To find a percentile (critical value) for a chi-squared distribution, type qchisq(p, df, lower.tail=F), where p is the significance level (upper-tail area) and df is the degrees of freedom. For example, qchisq(0.05, 2, lower.tail=F) returns the 95th percentile of the chi-squared distribution with 2 degrees of freedom (5.991).
    • To find an upper-tail area (one-tail p-value) for a t-distribution, type pt(t, df, lower.tail=F), where t is the absolute value of the t-statistic and df is the degrees of freedom. For example, pt(2.40, 29, lower.tail=F) returns the upper-tail area for a t-statistic of 2.40 from the t-distribution with 29 degrees of freedom (0.012), which is the p-value for an upper-tail test. By contrast, 2*pt(2.40, 29, lower.tail=F) returns the two-tail area for a t-statistic of 2.40 from the t-distribution with 29 degrees of freedom (0.023), which is the p-value for a two-tail test.
    • To find an upper-tail area (p-value) for an F-distribution, type pf(f, df1, df2, lower.tail=F), where f is the value of the F-statistic, df1 is the numerator degrees of freedom, and df2 is the denominator degrees of freedom. For example, pf(51.4, 2, 3, lower.tail=F) returns the upper-tail area (p-value) for an F-statistic of 51.4 for the F-distribution with 2 numerator degrees of freedom and 3 denominator degrees of freedom (0.005).
    • To find an upper-tail area (p-value) for a chi-squared distribution, type pchisq(chisq, df, lower.tail=F), where chisq is the value of the chi-squared statistic and df is the degrees of freedom. For example, pchisq(0.38, 2, lower.tail=F) returns the upper-tail area (p-value) for a chi-squared statistic of 0.38 for the chi-squared distribution with 2 degrees of freedom (0.827).
  8. Calculate descriptive statistics for quantitative variables by typing summary(Y), where Y is the quantitative variable. Other, more specific commands include mean(Y)median(Y)sd(Y)min(Y)max(Y), and quantile(Y, probs=c(.25,.5,.75), type=7), where probs is a numeric vector of probabilities with values in [0, 1] and type is an integer between 1 and 9 selecting one of nine quantile (percentile) algorithms (the default algorithm 7 gives slightly different results than algorithm 6 used by SPSS and Minitab).
  9. Create contingency tables or cross-tabulations for qualitative variables by typing table(X1, X2), where X1 and X2 are the qualitative variables. Calculate row sums by typing rowsum <- apply(table(X1, X2), 1, sum). Then the row percentages are 100*table(X1, X2)/rowsum. Calculate column sums by typing colsum <- apply(table(X1, X2), 2, sum). Then the column percentages are 100$*t(t(table(X1, X2))/colsum) (the function t() stands for "transpose").
  10. If you have a quantitative variable and a qualitative variable, you can calculate descriptive statistics for cases grouped in different categories by typing, for example,aggregate(Y, by, FUN), where Y is the quantitative variable, by is a list of the qualitative variables (e.g., list(X1=X1, X2=X2)), and FUN is the name of the function for calculating the descriptive statistic (e.g., mean).
  11. To make a stem-and-leaf plot for a quantitative variable, type stem(Y, scale=1), where Y is the quantitative variable and scale controls the plot length.
  12. To make a histogram for a quantitative variable, type hist(Y, freq=TRUE, breaks=c(1,2,3,4)), where Y is the quantitative variable, freq is a logical value (if TRUE, the histogram represents frequencies, if FALSE, it represents probability densities), and breaks specifies how to construct the breakpoints.
  13. To make a scatterplot with two quantitative variables, type plot(X, Y), where X is the horizontal axis variable and Y is the vertical axis variable.
  14. All possible scatterplots for more than two variables can be drawn simultaneously (called a scatterplot matrix}) by typing pairs(cbind(Y, X1, X2)), where YX1, and X2 are quantitative variables.
  15. You can mark or label cases in a scatterplot with different colors/symbols according to categories in a qualitative variable by using the points() function. For example, in a dataset of 20 observations, suppose X2 contains values 1-4 to represent four categories, and Y and X1 are two quantitative variables. Then the following code produces a scatterplot with numbers (representing the value of X2) marking the points:
    plot(X1, Y, type="n")
    for (i in 1:20) points(X1[i], Y[i], pch=as.character(X2[i]))
    .
  16. You can identify individual cases after drawing a scatterplot by typing identify(X, Y, labels), where X is the horizontal axis variable, Y is the vertical axis variable, and labels is an optional variable giving labels for the points. After typing this command, left-click on points on the scatterplot to make labels appear; when you're finished, right-click (and select Stop if needed).
  17. To remove one of more observations from a dataset, type, for example, mydatax <- mydata[-1,], which would remove observation #1, or mydatax <- mydata[-c(1,3),], which would remove observations #1 and #3. To work with the new dataset, type detach(mydata) and attach(mydatax).
  18. To make a bar chart for cases in different categories, use the barplot() function.
    • For frequency bar charts of one qualitative variable, type barplot(table(X1)), where X1 is a qualitative variable.
    • For frequency bar charts of two qualitative variables, type barplot(table(X1, X2), beside=TRUE, legend.text=TRUE), where X1 and X2 are qualitative variables and beside is a logical value for whether the chart is clustered (TRUE) or stacked (FALSE).
    • The bars can also represent various summary functions for a quantitative variable. For example, to produce a bar chart of means, type:
      means <- aggregate(Y, by=list(X1=X1, X2=X2), mean)
      barplot(tapply(means$x, list(means$X1, means$X2), sum), beside=TRUE)
      ,
      where X1 and X2 are the qualitative variables and Y is a quantitative variable.
  19. To make boxplots for cases in different categories, use the boxplot() function.
    • For just one qualitative variable, type boxplot(Y ∼ X1).
    • For two qualitative variables, type boxplot(Y ∼ X1 + X2), where Y is a quantitative variable, and X1 and X2 are qualitative variables. Alternatively, use the bwplot() function in the Lattice package, which has syntax bwplot(Y ~ X1 | X2).
  20. To make a QQ-plot (also known as a normal probability plot) for a quantitative variable, type qqnorm(Y), where Y is a quantitative variable. To add a diagonal line to aid interpretation, type qqline(Y).
  21. To compute a confidence interval for a univariate population mean, type t.test(Y, conf.level=0.95), where Y is the variable for which you want to calculate the confidence interval, and conf.level is the confidence level of the interval.
  22. To do a hypothesis test for a univariate population mean, type t.test(Y, mu=value, alternative="two.sided"), where Y is the variable for which you want to do the test, mu is the (null) hypothesized value, and alternative is "two.sided" for two-tailed, "less" for lower-tailed, or "greater" for upper-tailed.

Simple linear regression

  1. To fit a simple linear regression model (i.e., find a least squares line), type model <- lm(Y ∼ X), where Y is the response variable and X is the predictor variable, and then type summary(model). In the rare circumstance that you wish to fit a model without an intercept term (regression through the origin), type model <- lm(Y ∼ X - 1).
  2. To add a regression line or least squares line to a scatterplot, type plot(X, Y), where X is the predictor variable and Y is the response variable, and then type lines(sort(X), fitted(model)[order(X)]), where model is the name of the fitted model object (see help #25).
  3. To find confidence intervals for the regression parameters in a simple linear regression model, type confint(model, level=0.95), where model is the name of the fitted model object (see help #25) and level is the confidence level required. This applies more generally to multiple linear regression also.
    • To find a fitted value or predicted value of Y (the response variable) at a particular value of X (the predictor variable), type predict(model), where model is the name of the fitted model object (see help #25). This returns the fitted or predicted values of Y at each of the X-values in the dataset.
    • You can also obtain a fitted or predicted value of Y at an X-value that is not in the dataset by typing predict(model, newdata=data.frame(X=a)), where newdata contains a data frame with variables that have the same names as the predictors in the model (X in this case), and a is the particular X-value that we are interested in.
    • This applies more generally to multiple linear regression also.
    • To find a confidence interval for the mean of Y at a particular value of X, type predict(model, interval="confidence"), where model is the name of the fitted model object (see help #25). The confidence intervals for the mean of Y at each of the X-values in the dataset are displayed as two columns headed lwr and upr.
    • You can also obtain a confidence interval for the mean of Y at an X-value that is not in the dataset by typing predict(model, newdata=data.frame(X=a), interval="confidence"), where newdata contains a data frame with variables that have the same names as the predictors in the model (X in this case), and a is the particular X-value that we are interested in.
    • This applies more generally to multiple linear regression also.
    • To find a prediction interval for an individual value of Y at a particular value of X, type predict(model, interval="prediction"), where model is the name of the fitted model object (see help #25). The prediction intervals for individual values of Y at each of the X-values in the dataset are displayed as two columns headed lwr and upr.
    • You can also obtain a prediction interval for an individual value of Y at an X-value that is not in the dataset by typing predict(model, newdata=data.frame(X=a), interval="prediction"), where newdata contains a data frame with variables that have the same names as the predictors in the model (X in this case), and a is the particular X-value that we are interested in.
    • This applies more generally to multiple linear regression also.

Multiple linear regression

  1. To fit a multiple linear regression model, type model <- lm(Y ∼ X1 + X2), where Y is the response variable and X1 and X2 are the predictor variables, and then type summary(model). If you wish to include an interaction either create an interaction term first using help #6 and include it in the list of model terms or, alternatively, typemodel <- lm(Y ∼ X1 + X2 + X1:X2). In the rare circumstance that you wish to fit a model without an intercept term (regression through the origin), type model <- lm(Y ∼ X1 + X2 - 1).
  2. To add a quadratic regression line to a scatterplot, first fit the quadratic regression model, model <- lm(Y ∼ X + Xsq), where Y is the response variable, X is the predictor variable, X, and Xsq is X2. To produce a smooth quadratic regression line, it might be necessary to create a "more even" version of X for use in the graph, for example, newX <- seq(min(X), max(X), length=100) and newXsq <- newX**2. Next, type plot(X, Y) and then type lines(newX, predict(model, newdata=data.frame(X=newX, Xsq=newXsq))).
  3. Categories of a qualitative variable can be thought of as defining subsets of the sample. If there is also a quantitative response and a quantitative predictor variable in the dataset, a regression model can be fit to the data to represent separate regression lines for each subset. For example, in a dataset of 20 observations, suppose X2 contains the values 1-4 to represent four categories, and Y and X1 are two quantitative variables. Create an interaction term, X1X2 <- X1*X2, and fit the model, model <- lm(Y ∼ X1 + X2 + X1X2). Then the following code produces a scatterplot with numbers (representing the value of X2) instead of circles marking the points, and four separate regression lines:
    plot(X1, Y, type="n")
    for (i in 1:20) points(X1[i], Y[i], pch=as.character(X2[i]))
    for (i in 1:4) lines(X1[X2==i], fitted(model)[X2==i])
    .
  4. To find the F-statistic and associated p-value for a nested model F-test in multiple linear regression, first fit the reduced model, for example, model1 <- lm(Y ∼ X1), and the complete model, for example, model2 <- lm(Y ∼ X1 + X2 + X3). Then type anova(model1, model2). The F-statistic is in the second row of the "Analysis of Variance Table" in the column headed F, while the associated p-value is in the column headed Pr(>F).
  5. To save residuals in a multiple linear regression model, type res <- residuals(model), where model is the name of the fitted model object (see help #31). They can now be used just like any other variable, for example, to construct residual plots. To save what Pardoe (2012) calls standardized residuals, type sta <- rstandard(model). To save what Pardoe (2012) calls studentized residuals, type stu <- rstudent(model).
  6. To add a loess fitted line to a scatterplot (useful for checking the zero mean regression assumption in a residual plot), type, for example,scatter.smooth(fitted(model), rstudent(model), span=0.75), where model is the name of the fitted model object (see help #31).
  7. To save leverages in a multiple linear regression model, type lev <- hatvalues(model), where model is the name of the fitted model object (see help #31). They can now be used just like any other variable, for example, to construct scatterplots.
  8. To save Cook's distances in a multiple linear regression model, type cook <- cooks.distance(model), where model is the name of the fitted model object (see help #31). They can now be used just like any other variable, for example, to construct scatterplots.
  9. To create some residual plots automatically in a multiple linear regression model, type plot(model), where model is the name of the fitted model object (see help #31). This produces a plot of residuals versus fitted values, a QQ-plot of the standardized residuals, a plot of the square root of the standardized residuals versus fitted values, and a plot of standardized residuals versus leverages (with Cook's distance thresholds of 0.5 and 1 marked)—hit Return to cycle through the plots. To create residual plots manually, first create studentized residuals (see help #35), and then construct scatterplots with these studentized residuals on the vertical axis.
  10. To create a correlation matrix of quantitative variables (useful for checking potential multicollinearity problems), type cor(cbind(Y, X1, X2)), where YX1, and X2are quantitative variables.
  11. To find variance inflation factors in multiple linear regression, first install and load the car package of Fox and Weisberg (2011). Then type vif(model), where model is the name of the fitted model object (see help #31).
  12. To draw a predictor effect plot for graphically displaying the effects of transformed quantitative predictors and/or interactions between quantitative and qualitative predictors in multiple linear regression, first create a variable representing the effect, say, "X1effect" (see help #6).
    • If the "X1effect" variable just involves X1 (e.g., 1 + 3X1 + 4X12), type plot(sort(X1), X1effect[order(X1)], type="l").
    • If the "X1effect" variable involves a qualitative variable (e.g., 1 − 2X1 + 3D2X1, where D2 is an indicator variable), type the following:
      X1effect0 <- X1effect[D2==0]
      X1effect1 <- X1effect[D2==1]
      plot(X1, X1effect, type="n")
      lines(sort(X1[D2==0]), X1effect0[order(X1[D2==0])])
      lines(sort(X1[D2==1]), X1effect1[order(X1[D2==1])])
      .

    See Section 5.5 in Pardoe (2012) for an example.