Week 7: Regression as a Tool for Answering Questions

A monthly visitation series shows strong summer peaks, a gradual upward trend, and a sharp drop in 2020.

Why might we decompose the series before forecasting?

A. To eliminate randomness completely

B. To separate long-run movement from seasonal patterns

C. To guarantee a more accurate forecast

D. To avoid using a training/test split

Why do we evaluate a forecasting model on a test set instead of the same data used to estimate it?

A. Because forecasting software requires it

B. To reduce computation time

C. To simulate how the model performs on unseen future data

D. To remove seasonality

Explanatory Data Analysis

  • Many questions in economics and business are about explaining relationships between variables.
  • We can generate evidence of these relationships using regression analysis.
  • There are many other methods but regression is a good starting point.

Motivating Questions

  • Do fuel stations closer to highways charge higher prices?
  • Does advertising increase sales?
  • Do higher wages lead to higher productivity?

Running example:
Corn Yield and Drought

What is the effect of drought on corn yields?


%%{init: {"themeVariables": {"fontSize": "24px"}}}%%
flowchart LR
  G[Goals] ==> P[Problem]
  P ==> Q[Question]
  Q ==> Da[Data]
  Da ==> M[Model]
  M ==> R[Result]
  R ==> D[Decision]

  style M fill:#FFD966,stroke:#333,stroke-width:3px,color:#000


We seek a method that tells us how much corn yields change when drought conditions change.

Regression is a statistical method for understanding relationships between quantitative variables

Example: Gas prices and highway proximity

Explanation vs. Prediction

  • Regression can be used for both explanation and prediction.
  • Explanation focuses on understanding the relationship between variables (e.g., how does drought affect corn yields?).
  • Prediction focuses on accurately forecasting future values (e.g., what will corn yields be next year?).
  • A good explanation may provide a good prediction*

How does regression work?

  • Regression estimates the relationship between a dependent variable and one or more independent variables.
  • It quantifies how much the dependent variable changes when the independent variable(s) change.
  • It does this by “fitting a line” through the data points that best captures the relationship.
  • What do we mean by “best” fit?

Regression demo

  • What is the equation of a line?
  • The data give us our x and y values. Coordinates.
  • We can visualize the data as points on a scatterplot.
  • We want to find the line that best fits those points.

Go to R

Regression math

  • The regression line is defined by the equation: \[ y = \alpha + \beta x + \epsilon \]

    • \(\alpha\) is the intercept (the value of \(y\) when \(x=0\))
    • \(\beta\) is the slope (the change in \(y\) for a one-unit change in \(x\))
    • \(\epsilon\) is the error term (the difference between the observed and predicted values)

Rearranging gives us: \[ y - \alpha - \beta x = \epsilon \]

The regression line is the one that minimizes the sum of squared errors: \[ \min_{\alpha, \beta} \sum_{i=1}^n (y_i - \alpha - \beta x_i)^2 \]

Differentiating with respect to \(\alpha\) and \(\beta\) and setting the derivatives to zero gives us the normal equations, which we can solve to find the estimates of \(\alpha\) and \(\beta\).

\[ \begin{cases} \sum_{i=1}^n (y_i - \alpha - \beta x_i) = 0 \\ \sum_{i=1}^n x_i (y_i - \alpha - \beta x_i) = 0 \end{cases} \]

We have 2 equations with 2 unknowns. Solving these equations yields the least squares estimates: \[ \hat{\beta} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \] \[ \hat{\alpha} = \bar{y} - \hat{\beta} \bar{x} \]

\(\beta\) is also the covariance of \(x\) and \(y\) divided by the variance of \(x\): \[ \hat{\beta} = \frac{\text{Cov}(x, y)}{\text{Var}(x)} \]

Interpretting Regression Results

Regression Results Overview

  • The regression results provides estimates of the coefficients (intercept and slope) along with their standard errors, t-values, and p-values.
  • The output is generally reported in a table
  • You need to understand how to read and interpret this table to draw meaningful conclusions from the regression analysis.

Example regression output: Estimate


Variable Estimate
x -4.23
Intercept 4.48


This is just an estimate of the true relationship between x and y. We need to understand uncertainty.

Example regression output: Standard Errors


Variable Estimate Std. Error
x -4.23 0.23
Intercept 4.48 0.27


Does 0.23 seem small or large relative to the estimate of -4.23?

The standard error indicates typical sampling variation around the estimate.

Hypothesis Testing

  • We need a way to determine if the observed relationship is statistically significant or if it could have occurred by random chance.
  • We use a standardized test statistic (t-statistic) to test the null hypothesis that the true coefficient is zero (no relationship). \[ t = \frac{\hat{\beta} - 0}{SE(\hat{\beta})} \]

null hypothesis (\(H_0\): \(\beta = 0\)); \(~~~~~~~ t=(\beta - 0)/\sigma_\beta\)

Example regression output: t-statistics


Variable Estimate Std. Error t-stat
x -4.23 0.23 -18.39
Intercept 4.48 0.27 16.59

Example regression output: p-values


Variable Estimate Std. Error t-stat p-value
x -4.23 0.23 -18.39 <0.001
Intercept 4.48 0.27 16.59 <0.001

Interpreting Coefficient Estimates

  • The coefficient estimates tell us the expected change in the dependent variable for a one-unit change in the independent variable
  • In our example, the slope of -4.23 means that for every one-unit increase in x, y is expected to decrease by 4.23 units.
  • If x was miles from the highway, this would mean that for every additional mile from the highway, gas prices are expected to decrease by 4.23 cents.

Estimating Uncertainty: Standard Errors

  • The estimates are just that: estimates. They are subject to sampling variability.
  • What if we repeated the study with a different sample of gas stations? or a different period in time? Would we get the same estimates?
  • The standard error gives us a measure of how much the estimate would vary across different samples.

Hypothesis Testing

  • We make assumptions about the error term \(\epsilon\) (e.g., it is normally distributed with mean 0 and constant variance) that allow us to perform hypothesis tests on the coefficients.
  • The null hypothesis is typically that the coefficient is equal to zero (no effect).
  • The t-statistic is the estimate divided by its standard error, and the p-value tells us the probability of observing such an extreme t-statistic if the null hypothesis were true.
  • In our example, the t-statistic of -18.39 for the slope indicates that the coefficient is highly statistically significant (p < 0.001), suggesting a strong relationship between distance from the highway and gas prices.

Dependent variable: Gas Price ($/gallon)

Variable Estimate Std. Error t-stat p-value
Distance to Highway (miles) -0.04 0.02 -2.00 0.048
Constant 3.50 0.10 35.00 <0.001

What does the coefficient on distance to highway mean?

For every additional mile from the highway…

A. gas prices are expected to decrease by 4 cents.

B. gas prices are expected to decrease by 0.02 cents.

C. gas prices are expected to increase by 3.50 dollars.

D. gas prices are expected to decrease by 4 dollars.

Dependent variable: Daily Mountain Bike Trail Visits

Variable Estimate Std. Error t-stat p-value
Temperature (°F) 12.5 8.3 1.51 0.130
Constant -300.0 150.0 -2.00 0.048

Which of the following is true about the relationship between temperature and daily mountain bike trail visits?

A. Each 1°F increase in temperature causes 12.5 more trail visits per day.

B. A 1°F increase in temperature is associated with 12.5 more visits per day, and this relationship is statistically significant at the 5% level.

C. A 1°F increase in temperature is associated with 12.5 more visits per day, but the estimate is not statistically significant at the 5% level.

D. Temperature has no relationship with trail visits because the p-value is greater than 0.05.

Summary

  • Regression is a powerful tool for quantifying relationships between variables.
  • Coefficient estimates tell us the expected change in the dependent variable for a one-unit change in the independent variable.
  • Standard errors and hypothesis tests help us understand the uncertainty around our estimates.
  • Interpretation of regression results requires careful consideration of the context and the assumptions underlying the model. (Pirates vs. global warming example)

Lab Preview

  • We will use regression to analyze the relationship between drought conditions and corn yields.
  • We will learn how to fit a regression model, interpret the output, and evaluate the model’s assumptions and performance.

Part 2

Dependent variable: Daily Mountain Bike Trail Visits

Variable Estimate Std. Error t-stat p-value
Temperature (°F) 12.5 8.3 1.51 0.130
Constant -300.0 150.0 -2.00 0.048

The coefficient on temperature means:

A. A 1°F increase in temperature is associated with an increase of 12.5 daily trail visits, on average.

B. A 1°F increase in temperature changes trail visits by 8.3 per day.

C. A 1°F increase in temperature changes trail visits by 0.130 per day.

D. When temperature is 0°F, trail visits increase by 12.5 per day.

Dependent variable: Daily Mountain Bike Trail Visits

Variable Estimate Std. Error t-stat p-value
Temperature (°F) 12.5 8.3 1.51 0.130
Constant -300.0 150.0 -2.00 0.048

Which of the following is true about the relationship between temperature and daily mountain bike trail visits?

A. Each 1°F increase in temperature causes 12.5 more trail visits per day.

B. A 1°F increase in temperature is associated with 12.5 more visits per day, but the estimate is not statistically significant at the \(alpha=0.05\) level.

C. A 1°F increase in temperature is associated with 12.5 more visits per day, and this relationship is statistically significant at the \(alpha=0.05\) level.

D. If temperature was 0, the model would predict 0 visits.

Dependent variable: Daily Mountain Bike Trail Visits

Variable Estimate Std. Error t-stat p-value
Temperature (°F) 12.5 8.3 1.51 0.130
Constant -300.0 150.0 -2.00 0.048

The p-value tells you:

A. The probability of getting a test statistic this extreme (or more extreme) if the true coefficient were 0.

B. The probability that the null hypothesis is true.

C. The probability that the true temperature effect is exactly 12.5 visits per day.

D. The probability that temperature causes changes in trail visits.

Recap

  • Regression is a useful tool for quantifying relationships between variables (explanatory analysis)
  • Coefficient estimates tell us the expected change in the dependent variable for a one-unit change in the independent variable.
  • Standard errors and hypothesis tests help us understand the uncertainty around our estimates.

Regression can mislead

Regression finds patterns in the data.

But patterns do not always reflect causal relationships.

A regression coefficient can be misleading when:

  • Important variables are missing

  • The direction of cause and effect is unclear

  • Differences across places or people are ignored

Problem 1: Omitted Variables

Suppose we estimate:

\[\text{Gas Price} = \alpha + \beta \times \text{Distance to Highway} + \epsilon\]

But what if we are missing an important variable that explains variation in gas prices?

  • neighborhood income

  • competition from nearby stations

  • traffic volume

If these factors affect both distance and price, the estimate of \(\beta\) can be biased.

Omitted Variable Example

Estimate a regression of ice cream sales ($1000s) on shark attacks

Variable Estimate Std. Error t-stat p-value
Ice Cream 2.1 0.42 5.00 <0.001


Interpret the coefficient on ice cream sales.

  • What could be missing from this regression that would explain the relationship between ice cream sales and shark attacks?

Omitted Variable Bias

Omitting a variable correlated with both the independent and the dependent variable can bias the estimates of the regression coefficients.

Another Example: Drought and Corn Yields

If you could measure soil quality

Reverse Causality

  • Sometimes the direction of cause and effect is unclear.
  • For example, does higher wages lead to higher productivity, or does higher productivity lead to higher wages?
  • This can lead to biased estimates if we do not account for the possibility of reverse causality.

Exploit timing

  • If we have data over time, we can use the timing of events to help establish causality.
  • For example, if wages increase after a productivity improvement, this supports the idea that productivity is driving wages rather than the other way around.

Measurement Error

  • If the independent variable is measured with error, it can bias the coefficient estimates toward zero.
  • Suppose our drought variable is measured with error.

True relationship

Drought measured with error

Unobserved Differences

Places and people differ in ways that we may not be able to measure.

Corn yield varies across states because of:

  • soil quality

  • farming practices

  • irrigation infrastructure

If we ignore them, regression may attribute these differences to drought.

Drought Reduces Yield

What if its variation by state

How economists address this

Economists try to create comparisons that mimic experiments.

Instead of comparing different states, we compare the same state in different years.

This helps control for things that do not change over time.

Fixed Effects Regression

Fixed effects regression compares each unit to itself over time.

Instead of asking: Do states with more drought have lower yields?

We ask: When drought becomes worse in a state, do yields fall in that same state?

This removes time-invariant differences like:

  • soil quality

  • elevation

Demeaning the data

  • One way to implement fixed effects regression is to demean the data within each state.
  • This means we subtract the state-specific mean from each observation, effectively centering the data around zero for each state.
  • The regression is then run on the demeaned data, which controls for any time-invariant differences across states.

\[ {Y}_{it} - \bar{Y}_i = \alpha_i + \beta \times ({D}_{it} - \bar{D}_i) + \epsilon_{it} \]

Demeaned data

\[ {Y}_{it} - \bar{Y}_i = \alpha_i + \beta \times ({D}_{it} - \bar{D}_i) + \epsilon_{it} \]

FE Regression (demeaned data)

\[ {Y}_{it} - \bar{Y}_i = \alpha_i + \beta \times ({D}_{it} - \bar{D}_i) + \epsilon_{it} \]

Discussion

How does the fixed effects regression relate to a series of state-specific regressions?

Summary

  • Regression can be a powerful tool for understanding relationships between variables, but it can also be misleading if we are not careful.
  • Omitted variables, reverse causality, and unobserved differences can all lead to biased estimates.
  • Economists use techniques like fixed effects regression to try to control for these issues and get closer to the true causal relationships.