Linear Regression Models

Carson West

AP Stats Home

Linear Regression Models

Linear regression models are a fundamental tool in AP Statistics used to describe the relationship between two quantitative variables when a scatterplot suggests a linear association. The primary goal is to predict the value of one variable (the response variable) from the value of another (the explanatory variable). This topic builds upon understanding Representing the Relationship Between Two Quantitative Variables and Correlation.

The Least-Squares Regression Line (LSRL)

The most common method for fitting a line to a set of data is the Least-Squares Regression Line (LSRL). This line minimizes the sum of the squared vertical distances (residuals) between the observed data points and the line itself. The equation of the LSRL is:

$$ \hat{y} = a + bx $$
Where:

For a deeper dive into its calculation, refer to Least Squares Regression.

Interpreting Slope and Y-intercept

Interpreting the slope and y-intercept in context is crucial:

Formulas for Slope and Y-intercept

The slope ( $ b $ ) and y-intercept ( $ a $ ) can be calculated using the following formulas:

$$ b = r \frac{s_y}{s_x} $$ $$ a = \bar{y} - b\bar{x} $$
Where:

Residuals

Residuals are the differences between the observed values of the response variable and the values predicted by the regression line. They represent the “error” in the prediction for each data point.

$$ \text{Residual} = \text{Observed } y - \text{Predicted } y = y - \hat{y} $$
A residual plot is a scatterplot of the residuals against the explanatory variable (or predicted values $ \hat{y} $ ). A good linear model will show no obvious pattern in the residual plot (a random scatter around zero). Patterns in a residual plot indicate that a linear model might not be appropriate, suggesting Analyzing Departures from Linearity.

Coefficient of Determination ( $ R^2 $ )

The coefficient of determination, denoted as $ R^2 $ , is a measure that tells us how well the regression line fits the data. It is the square of the correlation coefficient ( $ r $ ).

$$ R^2 = r^2 $$

Interpretation of $ R^2 $

$ R^2 $ represents the proportion (or percentage, if multiplied by 100) of the variation in the response variable ( $ y $ ) that can be explained by the linear relationship with the explanatory variable ( $ x $ ). A higher $ R^2 $ indicates that the model explains more of the variability in the response variable.

Example: If $ R^2 = 0.75 $ , it means that 75% of the variation in $ y $ can be explained by the linear relationship with $ x $ . The remaining 25% of the variation is due to other factors or random chance.

Standard Deviation of the Residuals ( $ s_e $ or $ s $ )

The standard deviation of the residuals, often denoted as $ s $ or $ s_e $ , measures the typical distance between the observed $ y $ values and the predicted $ \hat{y} $ values (the typical size of a residual).

$$ s = \sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{n-2}} $$
Where $ n $ is the number of data points.

Interpretation of $ s $

$ s $ estimates the standard deviation of the errors (residuals) and tells us the typical prediction error using the LSRL. A smaller value of $ s $ indicates that the observed $ y $ values generally fall closer to the regression line, meaning the model makes more precise predictions.

Extrapolation

Extrapolation is the use of a regression line to predict values of the response variable ( $ y $ ) for values of the explanatory variable ( $ x $ ) that are outside the range of the observed data. This is generally not recommended because the linear relationship observed within the data range may not continue outside that range. The further you extrapolate, the less reliable your prediction becomes.

Outliers and Influential Points

Points that lie far from the overall pattern of the data are considered Outliers and Influential Points. An outlier in regression is a point that has a large residual (far from the line in the y-direction). An influential point is an outlier in the x-direction that, if removed, would significantly change the slope or y-intercept of the regression line. It’s important to identify and assess the impact of such points.