Regression: Module #1–Calculations and observations based on a small dataset

Module #1, Interactive Exercise

Download PDF Copy Go to Regression Applet

In this exercise, you will learn how to use the WISE regression applet to deepen your understanding of regression.  Using the very small set of data shown below, we will step through relevant regression values and see how they are calculated and how they are represented graphically. Answers are provided for all problems in Module 1. If you have a copy this handout, go directly to the applet.

Set up the applet:  From the ‘Select a Lesson:’ menu in the lower right hand corner of the applet, choose ‘Regression.’  Remove the checks from all boxes except for the box: Show Regression Line.

X Y
1 2
3 5
5 7
7 6

 

A. Correlation, Slope and Y-intercept. The applet provides these statistics, which are important for regression analysis. Find these terms in the applet and enter each below.

r (correlation) = __________

slope (by) =  __________

y-intercept (a) = _________

Check Your Answers

The correlation (r) is .837, the slope of the line (b) is .700, and the Intercept (a) is 2.200, taken from the right panel in the applet for Regression

B. The regression equation.  The regression equation is a formula for the straight line that best fits the data. Later we will learn exactly how ‘best fit’ is defined. The regression equation can be used to predict the Y score (called Y´, or Y-prime) for each of our x values. The general form of the regression equation is Y´ = a +bX .

In our example, a=2.2 and b=.700, so the regression equation is Y´ = 2.2 + .700X. Our first X score is 1, which generates a predicted Y score of 2.9, from 2.2 + .7(1).  Calculate the three remaining Y´ values by hand and enter them into the table below.

X Y
1 2 2.9
3 5
5 7
7 6

 

If you get stuck

Calculation for the last value is 2.2 + .7×7 = 2.2 + 4.9 = 7.1.

Check Your Answers

4.3, 5.7, and 7.1.

C. SS Total (Total Variance).  SS total is the sum of squared deviations of observed Y scores from the mean of Y.  This is an indication of the error we expect if we predict every Y score to be at the mean of Y.  (If X is not available or if X is not useful, then the mean of Y is our best prediction of Y scores.)

To calculate SS Total, take each value of Y, subtract the mean, and square the result, then sum all of the values in the column. A general formula for SS Total is ???. For these data the mean of Y is 5. For the first case, the squared deviation from the mean is 9. Calculate the values for the last three cases, and sum the values for all four cases in the last column to get SS Total.

X Y
1 2 2-5 = -3 (-3)2 = 9
3 5
5 7
7 6
Sum =

 

Check Your Answers

The squared deviations from the mean for the four cases are 9, 0, 4, and 1, respectively.  SS Total = 14.

D. Deviations from the mean. In the applet, place a check mark in the boxes titled Show SS Total and Show Mean of Y and remove all other checks. The vertical black lines represent the deviations of each case from the mean of Y. Verify the correspondence of the length of these lines with the values in the table for the column . Which case has the largest deviation from the mean?

The largest deviation from the mean is _____ for Case ___..

If you get stuck…

Look at the graph in the applet and at your calculations in the table.

Check Your Answers

The largest deviation from the mean is -3, for Case 1

E. Contribution to SS Total. Now check the box labeled Show Error as Squares. The sizes of the black squares correspond to the squared deviations from the mean, and the sum of the areas of these squares corresponds to SS Total. Notice how the deviations from the mean for the first and fourth cases are -3 and +1, while the squared deviations are 9 and 1. This shows how points farther from the mean contribute much more to SS Total than points closer to the mean. What is the contribution of the second case to SS Total? Why?

The contribution to SS Total for Case 2 is ______ because (answer below)

Check Your Answers

The contribution of the second case to SS Total is zero, because the Y value of 5 is exactly equal to the mean.

F. Squared deviations. Now calculate the sum of the squared deviations from the mean . You can do this by adding the values in the column headed .

 = SS Total = ________.   In the applet, SS for Total = ________.

If you get stuck…

You can find this as the sum of the last column in your table, and this value is also shown in the applet in the SS column in the Analysis of Variance section

Check Your Answers

SS Total = 14.

G. SS Total meaning. Explain what SS Total means.How would the plot differ if SS Total was much smaller, say 2.00?  What if SS Total was much larger, say 100?

Check Your Answers

SS Total is the sum of the squared deviations of Y scores from the mean of Y. If SS Total was much smaller, then all of the Y values must be close to the mean. SS Total could be much larger for several reasons: many of the Y values could be somewhat farther from the mean, a few values, or even one value, could be very far from the mean, or we could simply have many more Y values. Note that a single Y value that differed from the mean by 10 points would contribute 100 to SS Total.

There is a close relationship between SS Total and variance. An estimate of the population variance taken from a sample is calculated at the sum of the squared deviations from the mean divided by the degrees of freedom, which is (SS Total) / (n-1) for a single sample.  In our example, this is 14/3 = 4.667. The standard deviation is the square root of variance = 2.16, the value shown in the applet as the std dev for the DV.

H. SS Error.  SS Error is the sum of squared deviations of observed Y scores from the predicted Y scores when we use information on X to predict Y scores with a regression equation.  SS Error is the part of SS Total that CANNOT be explained by the regression.

Complete the calculations below using the predicted scores () calculated in part B. The sample mean is 5.0 for every case.

Case X Y (Y – Y´) (Y – Y´)2
  1     1     2 2.9 2-2.9 = -0.9 (-0.9)2 = 0.81
  2     3     5
  3     5     7
  4     7     6
Sum   16   20

 

Check Your Answers

For the second case, Y’ = 4.3, Y-Y’ = (5 – 4.3) = .7, and (Y – Y’)2 = .49.  For the third case, Y’ = 5.7, (Y-Y’)=(7 – 5.7) = 1.3, and (Y – Y’)2 = 1.69

I. Regression Line and Deviations.  Now place check marks in the boxes titled Show Regression Line and Show SS error, and remove checks from all other boxes. Deviations of the observed points from their predicted values on the regression line are shown in red.

The largest deviation is for Case ____, and the size of the deviation is ______.

The smallest deviation is for Case ____, and the size of the deviation is ______.

Check Your Answers

The largest deviation is for Case 3, and the size of the deviation is 1.3.

The smallest deviation is for Case 2, and the size of the deviation is .7.

J. Calculating SS Error.  Now check the box titled Show Errors as Squares. The sizes of the red squares correspond to the squared deviations. In the table for part h, compare the squared deviations shown in the last column for Cases 2 and 3. Observe how the red boxes for Cases 2 and 3 correspond to these values. The sum of the squared deviations is the sum of the last column in the table.

Record your  calculated value here _________.  This is the Sum of Squares Error.

In the applet under Analysis of Variance find the value for SS Error

Check Your Answers

The calculated value for the Sum of Squares Error = SS Error = 4.200.

K. SS Error Meaning.  Explain in simple English what SS Error means.What would the plot look like if SS Error was very small compared to SS Total? What would the plot look like if SS Error is about as large as SS Total?

Check Your Answers

SS Error is the sum of the squared deviations of observed scores from the predicted scores. If SS Error is very small, every observed score is close to the predicted score, so the plot of every observed score is close to the regression line.

If SS Error is much smaller than SS Total, then the sum of deviations around the regression line is much smaller than the sum of deviations around the mean. Thus, the regression equation gives much more accurate predictions of scores than simply using the mean as the prediction for all scores. The plot would show a strong linear relationship between X and Y.

If SS Error is about the same size as SS Total, then the regression equation has not improved our prediction of Y scores. The regression line would be close to horizontal at the mean. The plot would not show any indication of a linear relationship between X and Y.

L. SS Predicted. SS Predicted is the part of SS Total that CAN be predicted from the regression.  This corresponds to the sum of squared deviations of predicted values of Y from the mean of Y.

Complete the calculations below using the predicted scores () calculated for each case in part B and the mean of Y (5).

Case X Y
  1     1     2   2.9 2.9 – 5.0 = -2.1 (-2.1)2 = 4.41
  2     3     5
  3     5     7
  4     7     6
Sum   16   20 20.0

 

Check Your Answers

For the second case, the predicted score is 4.3, which is .7 below the mean of 5.0, so the squared deviation of the predicted score from the mean is .49. For the third case, the deviation is +.7, and for the fourth case the deviation is +2.1. The sum of the squared deviations is 9.80.

M. SS Predicted  Now click the boxes marked Show Mean of Y and Show Regression Line and remove the checks from all other boxes. Check Show SS Predicted to see deviations of regression line from the mean, shown in blue.  The blue lines represent the differences between the mean and predicted scores.  If X were not useful in predicting Y, then the best prediction of Y would be simply the mean of Y for any value of X, and the blue lines would be zero in length. If X is useful in predicting Y, then the predicted values differ from the mean.  The blue lines give an indication of how well X predicts Y.

Click the box marked Show Error as Squares, to see the squared deviations of predicted scores from means. Compare these to the red squares for SS Error. (You can click Show SS Error if you would like to be reminded of the size of the red squares.) Is X useful for predicting Y in this plot?  How do you know?

Check Your Answers

Yes, it appears that X is useful in predicting Y in our plot. The blue lines, which indicate predictive ability, are substantial. They are relatively long, compared to the red lines we observed for error deviations, and the blue squares are relatively large compared to the red squares. Thus, it appears that the SS Predicted is substantial.

N. calculating SS Predicted.  The sum of the squared deviations of the predicted scores from the mean is the sum of the last column in the table in part L.

Record the calculated value here _________.  This is the Sum of Squares Predicted.

In the applet under Analysis of Variance find the value for SS Predicted.

If you get stuck…

Look at the last column in the table in part L.

Check Your Answers

The Sum of Squares Predicted from the Analysis of Variance table in the applet is 9.800, which is also the sum of the last column in the table in part L.

O. SS Predicted Meaning.  Explain what SS Predicted means. What would the plot look like if SS Predicted was very small relative to SS Total?

Check Your Answers

SS Predicted is the sum of the squared deviations of predicted scores from the mean. If the regression model is not at all useful, then the predicted score will be the mean for each case, and SS Predicted will be zero. If the regression model is only slightly helpful, then the predicted scores will be only slightly different from the mean, and SS Predicted will be small relative to SS Total. This plot would show virtually no linear relationship between X and Y, and the regression line would be close to the horizontal line for the mean of Y.

If there is a strong linear relationship in the data, SS Predicted is large relative to SS Error, and the observed data fall close to the regression line.

P. r-squared as proportion of variance explained.  Note that SS Total = SS Predicted + SS Error. (14.000 = 9.800 + 4.200). Thus, with the regression model, we split SS Total into two parts, SS Predicted and SS Error. We can compute the proportion of SS Total that is in SS Predicted.  In terms of sums of squares, this is the ratio of SS Predicted to SS Total.

Calculate [SS Predicted/ SS Total] = _________ / __________ = ____________.

SS Total is the numerator of the variance of Y (i.e.,  ???), so the calculated ratio can be interpreted as the proportion of variance in Y that can be predicted from X using the regression model. A useful fact in regression is that this ratio is equal to the correlation squared (r-squared).  Thus, the correlation squared (r-squared) represents the proportion of variance in Y that can be explained by X, using the regression model.

What does the applet report for the correlation r and r-squared?

r = ______;    r squared = ________

Summarize the relationship between X and Y for this set of data in simple English.

Check Your Answers

[SS Predicted / SS Total] = 9.800 / 14.000 = .700.

The applet reports r = .837 and  r squared = .700.

This sample data shows a strong linear relationship, as measured by r=.837. The plot shows this strong positive relationship, with larger values of X generally associated with larger values of Y. In this sample, 70% of the variance in Y can be explained by the linear relationship with X.

We should note that this is an extremely small sample, and that we would not be able to generalize to the relationship in a population of X and Y values, even if there four cases are a random sample from that population.

Begin Module 2

Loading