Suppose we have two variables, and our goal is to describe the relationship between the two variables.

The question: does one variable cause changes or explain changes in the other variable? This would imply a causal

relationship.

Example: in young children, as they get older they gain weight and grow taller. Hence changes in age cause (explain) changes in weight and height.

Often two variables are associated, and yet one variable does not cause changes in the other variable.

Example: high math SAT scores are often associated with high verbal SAT scores, but one does not cause the other.

Suppose we have two quantitative variables X and Y.

We want to explain the causal relationship between X and Y by writing Y as a linear function of X.

This linear function will then be used to predict values of Y for specified values of X.

X is called the independent or explanatory variable, which is a measurement variable that has no restraints placed on it and attempts to explain the observed outcomes of Y.

Example: X = age of child

Y is called the dependent or response variable, and is the measurement variable that measures an outcome of a process that is the effect or consequence of the independent variable.

Example: Y = weight of child

As a child ages they gain weight,

hence the process is the aging process

and growing older causes weight gain

A variable which has an important effect on the relationship between the independent (X) and dependent (Y) variables but which is not included in the list of variables being studied is called a lurking variable.

When a lurking variable exists we see an association between the two variables, but we cannot say that one variable is causing changes in the other.

When a lurking variable exists, then we say that the effect that X is having on Y is confounded with the effect of the lurking variable.

So it appears that X is causing changes in Y, but really the lurking variable is involved in the relationship and hence confounds the results.

We now turn our attention to describing the relationship between the independent variable X and the dependent variable Y. A complete description of this relationship includes specifying the direction, form and strength of the relationship, and to accurately describe these three things we need both a graph and a numerical descriptor. In the two sections that follow we learn about the scatterplot (a graph) and the correlation coefficient (a numerical descriptor)

Graphical procedure for displaying the relationship between two quantitative variables.

Label X along the horizontal axis.

Label Y along the vertical axis.

Plot each (X, Y) observation on the plot.

Type of association between X and Y.

1. Two variables are positively associated if small

values of X are associated with small values of Y,

and if large values of X are associated with large

values of Y.

There is an upward trend from left to right

2. Two variables are negatively associated if small values of

X are associated with large values of Y and large values of

X are associated with small values of Y.

There is a downward trend from left to right.

Describe the type of trend between X and Y.

1. Linear – points fall close to a straight line.

2. Quadratic – points follow a parabolic pattern.

3. Exponential – points follow a curved pattern,

either as exponential growth (upward) or

exponential decay (downward).

Measures the amount of scatter around the general (linear) trend.

The closer the points fall to a straight line, the stronger the linear relationship between the two variables.

To completely describe the relationship between two variables, in addition to a scatterplot we also need a numerical measure of the relationship.

Such a numerical measure is the correlation coefficient.

A numerical measure of the direction and strength of the linear relationship between two variables.

The correlation coefficient measures the amount of scatter of the observations around the regression line.

Measures the relationship regardless of whether it is causal or not.

Population correlation coefficient is denoted by the Greek letter r (read rho). This is a parameter.

Sample correlation coefficient is denoted by r. This is a statistic.

1. The correlation coefficient is always between -1and +1.

2. A negative r indicates a negative association between X

and Y.

3. A positive r indicates a positive association between X and

Y.

4. r near 0 implies that there is a very weak linear

relationship. Hence there is either much scatter in

the points, indicating no relationship between X

and Y, or the points follow some nonlinear pattern,

indicating that there is a nonlinear relationship

between X and Y.

5. The strength of the linear relationship increases as r moves away from 0 toward -1 or +1.

r near -1 or +1 indicates a strong linear relationship

between X and Y.

r equal to exactly -1 or +1 indicates a perfect linear

relationship between X and Y, meaning all points fall

exactly on a straight line.

6. The correlation coefficient is affected by extreme values in

either the X or Y directions, and hence should be used with caution when extreme values appear in the scatterplot.

(Know the formula–all written on the paper)

*The sign of the correlation coefficient will depend entirely on the sign of Sxy because both Sxx and Syy will always be positive.

Now our goal is to determine the equation of the line that best models (explains) the relationship between X and Y. This is

referred to as the regression line.

Y = intercept + slope(X)

where the intercept is the predicted value of Y when X = 0

and the slope is the amount that Y changes (increases or decreases)

when X is increased by one unit.

Example: weight (in pounds) = 6 + 10 * age (in years)

intercept = 6 : when a child is 0 years old, the child

is predicted to weigh 6 pounds

slope = 10: if a child’s age increases by one year, then

his or her weight is predicted to increase

by 10 pounds (weight increases 10

pounds each year).

Once we determine the intercept and the slope, we can use the line Y = intercept + slope(X) to predict values of Y given values of X.

The prediction equation is Y = intercept +slope(X)

y = the observed value of the dependent variable

y = the predicted value of the dependent variable

y – y is called a residual and our goal is to make

the residuals as small as possible.

Determine values of the intercept and slope such that the sum of the squared residuals is minimized:

minimize S (Y – Y)2

slope = Sxy = r sy

Sxx sx

intercept = y – slope x

where Sxx and Sxy are defined on pages 106 – 107 and r is the correlation between X and Y. Again x and y are the means of the X and Y data, respectively, and sx and sy are the standard deviations of the X and Y data, respectively.

We can predict the value of Y for any value of X simply by substituting the value of X into the regression equation.

Example: weight = 6 + 10 * age

At age = 4, we predict weight = 6 + 10 (4) = 6 +40 = 46 pounds

When predicting, it is important that the value of X at which we want to predict falls within the range of the original X data. The regression line describes the linear relationship between X and Y only for the range of data that we have.

Predicting outside the range of the original X data is called extrapolation and should be avoided.

Example: If the data used to determine the regression equation weight = 6 + 10 * age is only for kids between the ages of 2 and 10 (X between 2 and 10), then predicting the weight of a 35 year old is extrapolation:

weight = 6 + 10(35) = 356 pounds.

The difference between an observed dependent variable (Y) value and a predicted dependent variable value.

residual = y – y

This is the vertical deviation of a data point from the regression line.

The residuals can be used to analyze the quality and usefulness of the regression line

1. Compute the residual for each observation.

2. Create a scatterplot with the independent

variable (X) on the horizontal axis and the

residuals on the vertical axis. This is called a

residual plot.

Points are randomly scattered around 0, with no obvious pattern.

A residual plot that reveals a pattern like that below indicates that a linear relationship may not exist, but instead the relationship may be quadratic.

Scattered all over the place. Way all over the place.

Two variables: an outlier is an observation that falls within the range of the data in the horizontal (X) direction but that lies far from the regression line in the vertical direction and hence produces a large residual.

Observations which stand out from the other observations in the horizontal (X) direction are called influential observations.

Influential observations usually have an unusually large influence on the position of the regression line.

Measures the proportion (fraction) of the total variation in the Y values that can be explained by the X values. So we want the coefficient of determination to be as large as possible.

r2 will always be between 0 and +1

An r2 close to 1 implies that X explains most of the variation in Y and hence the regression line does a good job of predicting Y values from X.

An r2 close to 0 indicates that the regression line is rather useless and that we should not put too much faith in the predicted values that result.

Measures the proportion (fraction) of the total variation in the Y values that can be explained by the X values. So we want the coefficient of determination to be as large as possible.

For example, if the correlation coefficient is r = .60, then the coefficient of determination is r2 = (.60)2 = .36, and hence the X variable explains approximately 36% of the variation in the Y variable.

Everything to this point has assumed two quantitative variables, an independent (or explanatory) variable X and a dependent (or response) variable Y. We have talked about how to describe the relationship between the two variables (direction, form and strength) and how the scatterplot, correlation coefficient and regression line can be used to help do this.

Now suppose we have two qualitative or categorical variables: the variables vary in name, but not in magnitude, implying that they cannot be ranked.

All we can do is name the categories and count the number of observations falling in each category.

The question remains: is there a relationship between the two variables?

With two variables, we can count the number of observations that fall in each pair of categories. The counts are displayed in a two-way table.

Freshman Sophomore Junior Senior

Warning 48 36 15 23

Probation 29 42 12 14

Good standing 71 37 18 62

There exists a marginal distribution for each variable.

A marginal distribution lists the categories of the variable together with the frequency (count) or relative frequency (percentage) of observations in each category.

Example:

Two-Way Table

Cough No Cough

Smoker 43 43

Nonsmoker 19 95

Marginal Distribution for Smoking Status

Frequency Relative Frequency

Smoker 86 86/200 = 43%

Nonsmoker 114 114/200 = 57%

Marginal Distribution for Coughing Status

Frequency Relative Frequency

Cough 62 62/200 = 31%

No Cough 138 138/200 = 69%

If the conditional distributions of variable 2 are nearly the same for each category of variable 1, then we say that there is not an association between the two variables.

If there are significant differences in the conditional distributions of variable 2 for the different categories of variable 1, then we say that there is an association between the two variables.

There are two categorical variables, and we observe a relationship between the two variables. Now we divide the data set up into subgroups, and when we do so the relationship that we observe reverses. This reversing of the relationship is referred to as Simpson’s Paradox.

There exists a lurking variable that creates a reversal in the direction of a relationship between two variables when the lurking variable is ignored as opposed to the relationship between the two variables when the lurking variable is considered.

The lurking variable creates subgroups, and failure to take the lurking variable into consideration can lead to misleading conclusions regarding the association between the two variables.

This is an example of Simpson’s Paradox. When the lurking variable, school to which the student applied (business or art), is ignored, the data seem to suggest discrimination against women. However, when the school is considered the association is reversed and suggests discrimination against men.