-
Chapter 7: Scatterplots, Associations, and Correlation
-
Scatterplots: display patterns, trends, relationships, and value
-
Between two q-variables
- Assocation between them?
-
Looking at Scatterplots
-
direction
- positive
- negative
-
form
- straight line relationship
- appears as cloud or swarm of points
- linearity
-
scatter
- lots of random points
- small amount random points
-
Mechanics
- y-axis
- x-axis
- computerized scatterplots do not show origin
-
Two Variable Roles
-
explanatory
- x-axis
-
response variable
- y-axis
-
Correlation
-
find z-scores of x- and y-variables
-
multiply each coordinates z\/x and z\/y scores together, find the sum
- summaries direction + strength of assocation
- divide sum by n-1
-
Correlation Conditions
-
Correlation measures strength of Linear
- Quantitative Variables condition
- Straight enough Condition
- must be linear!
- Outlier Condtion
- outliers distort correlation
-
Between +1 and -1
- closer to -1 or +1, more linear the association
- no units
-
Correlation isn't Association
- Association: vague term describing relationship between two variables
- Correlation: very precise term describing LINEAR relationship between quantitative variables
- Expressed as r
-
Lurking Variables
- hidden variable stands behind relationship + affects both variables
-
Chapter 8: Linear Regression
-
Models for Data
-
model relationship w/ line
-
require numbers
- perameters
- linear model: equation of straight line through data
- specify Normal model w/ mean and S
- model
-
Residuals
- Linear models not perfect
-
predicted value
- estimate
-
y-hat
- observed value - predicted value
-
residuals: diff. between observed value + associated predicted value
- how far off model's prediction
- Data = model + Residual or Residual = Data - Model
-
"Best Fit" Means Least Squares
- line of best fit: line which sum of squared residuals smallest
-
square residuals
- makes positive
-
add them up
- tells how far off line is
-
Correlation and the Line
-
slope: value m
- larger m, steeper slope
-
negative
- negative association
-
zero
- horizontal line
-
correlation coefficent, r, for m
-
y = rx
- moving on S away from mean in x moves r S away from mean in y
-
Size of Predicted Values
-
regression to the mean
- predicted y tends closer to its mean than correspond x was
-
regression line
- linear equation satisfies least squares criterion
-
Units
-
y-intercept
- b sub 0
- value of y crosses y-axis
-
slope
- b sub 1
-
R-Squared
- gives fraction of data's variance accounted for by model
- 1 minus R-squared is fraction of original variance left in residuals
- given as percentage, typically
- R-squared of 100% is pefect fit w/ no scatter around line
- measures success of regression line
-
Examining Residuals
-
check whether linear model appropriate
- plot residuals
-
histogram
- displays multiple modes + y-outliers
-
scatterplot
-
residuals versus predicted values
- reveals bends, groups, model outliers
-
Chapter 9: Regression Wisdom
-
Subset
-
data consist two or more groups been thrown together
- best fit diff. linear models each group
- found by residual plots
-
Sifting Residuals for Groups
- May need to analyze groups of data i
n scatterplot separately (if diff. behavior
than most of data)
-
Extrapolation
- new x-values not part of the linear
regression model plugged into the equation
that venture far from the mean
- dangerous
-
time as x-variable
- extrapolation becomes attempt
peer into future
-
Outliers
-
can strongly influence regression
- even single point
-
outlier: any point that stands away from others
-
model outliers
- removing generally increases R-squared
- x-outliers
- y-outliers
-
leverage: x-value outliers who are far from the mean of x
- pull line close to them
- sometimes determine slope and intercept
- removing can decrease R-squared
-
Influential Points
- can hide in plots of residuals
- seen easier in
scatterplots of original data
-
Lurking Variables and Causation
- correlation isn't causation
- lurking variable: no explicitly part of model
but affect variable in model
-
Things can go wrong
- assure straightness of relationship
- Do not extrapolate
- Do not use extrapolation with time
- Subsets in regression: separate them/analyze separately
- Outliers
- Leverage points
- Lurking Variables
-
Summary statistics
-
less variable than raw data
- inflate impression of strength