Correlation:

Correlation represents the relationship between two variables.

Correlation is typically used in situations where we have continuously varying independent variables (X).

Example:

Suppose you overhear one of your roommates make a remark that tall people have big feet. Your other roommate, who is 6' 7", says in reply that small people have big mouths. To serve as the mediator in this dispute, as a good social science student, you decide to use your statistical skills to help end the dispute. You sample five students who live in the same apartment complex where you live, and obtain the following results:

Student

Shoe Size

Height (inch)

1

9.0

70

2

6.5

62

3

8.0

66

4

7.5

69

5

8.5

72

Based on this information, you'd like to be able to determine if a person's height and shoe size are related. The idea is that you wish to make predictions from one variable to another.

The first step in answering this question is to construct what is known as a scatter plot.

 

 

It's apparent from a glance at the scatterplot that tall people do tend to have large feet and small people do tend to have small feet. The relationship in the scatter plot isn't perfect, but the two variables are correlated.

The next problem is in developing a technique for describing the degree of correlation between two variables.

 

The problem of determining the degree of relationship between two variables can be approached through a number of methods, including the technique of the method of least squares. It's illustrated by passing several lines through the data points in our example. If the line passes through a data point, no error is made; if the line misses the data point, an error equal to the distance between the point and the line has been made.

If we use line A as our solution, we make the above errors.

 

If we use Line B as our solution then we make the following errors:

Which is the best line to use?

The best line is that line which provides the least error in prediction. Although line A is the best of the three we've drawn, it may not be the best possible straight line. Calculus is required to determine the formula for the best of all possible lines.

 

This best-fitting straight line is called the regression line and the slope of the regression line (when X and Y are expressed as z scores) is equal to the correlation coefficient.

Linear Equations:

a: Y-intercept

b: slope

y = a + bx

This is what is termed a deterministic function; y is a transformation of x, and there is no error term. We can use this formula to predict change in y per x.

 

This formula indicates that when x increases by 1, y increases by 3/2.

i

X

Y

1

0

3

2

-2

0

b = slope =

(Slope is the first derivative.)

a is the Y-intercept because when

X = 0 then Y = a (in this case 3).

This is the point where the line crosses the Y-axis.

i X Y

1 -1 3

2 0 3

3 1 3

X does not affect Y because a change in X does not change Y from being equal to 3.

Y is always equal to 3.

i X Y

1 0 3

2 2 0

When X increases by 1, Y decreases by 3/2, so the slope is negative (i.e., - 3/2).

All three of the proceeding examples are deterministic functions; in the real world, no X and Y are perfectly correlated. In other words, the relationship is not deterministic, so X and Y never fit exactly on a straight line.

In regression analysis we find the best fitting line for a scatterplot of data. This provides a compact summary of the relationship between X and Y.

In regression analysis, we find an average value of Y for a given X. This is known as a conditional mean function.

The regression model:

 

The prediction equation or the regression line:

 

The error term or residual:

 

In regression analysis we pick a and b to minimize

The regression model summarizes the relationship between X and Y.

Since:

By substitution:

Then:

Ordinary Least Squares: (OLS)

We pick a and b to minimize

is also known as

(i.e., the error sum of squares in regression)

OLS minimizes

To minimize set the partial derivatives equal to 0, and solve for a and b.

 

 

These are the normal equations, which give the solution to the OLS minimization problem.

There are two normal equations with two unknowns (i.e., a and b).

The solution is:

One of the implications is that the regression line always passes through the point

Bivariate Regression Coefficient:

Another way of saying this is that

Covariance:

Indicates the extent to which X and Y have a linear relationship.

A larger positive indicates a larger positive relationship.

A more negative indicates a stronger negative linear relationship.

If = 0, then there is no linear relationship.

Example:

i ()X()

1

-2

-2

-2

-2

4

2

-1

0

-1

0

0

3

2

4

2

4

8

4

3

2

3

2

6

5

-1

-1

-1

-1

1

6

-1

-3

-1

-3

3

 

i ()X()

1

2

2

2

2

4

2

-2

2

-2

2

-4

3

-2

-2

-2

-2

4

4

2

-2

2

-2

-4

so there is no linear relationship

indicates the strength and direction of a linear relationship between X and Y.

A covariance is an indicator of the extent to which X and Y have a linear relationship. It doesn't tell us the exact position of the line in the coordinate system. For that, we must use regression.

The covariance ranges from - infinity to + infinity depending on the means of X and Y.

It is usually easier to interpret a standardized measure (z score).

Correlation Coefficient:

(coefficient of determination)

The coefficient of determination is calculated:

The correlation coefficient ranges from - 1 (perfect inverse association) to +1 (perfect positive association). It equals 0 if there is no linear association.

 

In summary:

 

Examples: