Week 7: Regression as a Tool for Answering Questions

A monthly visitation series shows strong summer peaks, a gradual upward trend, and a sharp drop in 2020.

Why might we decompose the series before forecasting?

A. To eliminate randomness completely

B. To separate long-run movement from seasonal patterns

C. To guarantee a more accurate forecast

D. To avoid using a training/test split

Why do we evaluate a forecasting model on a test set instead of the same data used to estimate it?

A. Because forecasting software requires it

B. To reduce computation time

C. To simulate how the model performs on unseen future data

D. To remove seasonality

Explanatory Data Analysis

Many questions in economics and business are about explaining relationships between variables.
We can generate evidence of these relationships using regression analysis.
There are many other methods but regression is a good starting point.

Motivating Questions

Do fuel stations closer to highways charge higher prices?
Does advertising increase sales?
Do higher wages lead to higher productivity?

Running example:
Corn Yield and Drought

What is the effect of drought on corn yields?

%%{init: {"themeVariables": {"fontSize": "24px"}}}%%
flowchart LR
  G[Goals] ==> P[Problem]
  P ==> Q[Question]
  Q ==> Da[Data]
  Da ==> M[Model]
  M ==> R[Result]
  R ==> D[Decision]

  style M fill:#FFD966,stroke:#333,stroke-width:3px,color:#000

We seek a method that tells us how much corn yields change when drought conditions change.

Regression is a statistical method for understanding relationships between quantitative variables

Example: Gas prices and highway proximity

Do fuel stations closer to highways charge higher prices?
Can you tell by looking at a map of gas stations and highways?
https://www.gasbuddy.com/gasprices/wyoming/cheyenne

Explanation vs. Prediction

Regression can be used for both explanation and prediction.
Explanation focuses on understanding the relationship between variables (e.g., how does drought affect corn yields?).
Prediction focuses on accurately forecasting future values (e.g., what will corn yields be next year?).
A good explanation may provide a good prediction*

How does regression work?

Regression estimates the relationship between a dependent variable and one or more independent variables.
It quantifies how much the dependent variable changes when the independent variable(s) change.
It does this by “fitting a line” through the data points that best captures the relationship.
What do we mean by “best” fit?

Regression demo

What is the equation of a line?
The data give us our x and y values. Coordinates.
We can visualize the data as points on a scatterplot.
We want to find the line that best fits those points.

Go to R

Regression math

The regression line is defined by the equation: \[ y = \alpha + \beta x + \epsilon \]
- $\alpha$ is the intercept (the value of $y$ when $x=0$)
- $\beta$ is the slope (the change in $y$ for a one-unit change in $x$)
- $\epsilon$ is the error term (the difference between the observed and predicted values)

Rearranging gives us: \[ y - \alpha - \beta x = \epsilon \]

The regression line is the one that minimizes the sum of squared errors: \[ \min_{\alpha, \beta} \sum_{i=1}^n (y_i - \alpha - \beta x_i)^2 \]

Differentiating with respect to $\alpha$ and $\beta$ and setting the derivatives to zero gives us the normal equations, which we can solve to find the estimates of $\alpha$ and $\beta$.

\[ \begin{cases} \sum_{i=1}^n (y_i - \alpha - \beta x_i) = 0 \\ \sum_{i=1}^n x_i (y_i - \alpha - \beta x_i) = 0 \end{cases} \]

We have 2 equations with 2 unknowns. Solving these equations yields the least squares estimates: \[ \hat{\beta} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \] \[ \hat{\alpha} = \bar{y} - \hat{\beta} \bar{x} \]

$\beta$ is also the covariance of $x$ and $y$ divided by the variance of $x$: \[ \hat{\beta} = \frac{\text{Cov}(x, y)}{\text{Var}(x)} \]

Interpretting Regression Results

Regression Results Overview

The regression results provides estimates of the coefficients (intercept and slope) along with their standard errors, t-values, and p-values.
The output is generally reported in a table
You need to understand how to read and interpret this table to draw meaningful conclusions from the regression analysis.

Example regression output: Estimate

Variable	Estimate
x	-4.23
Intercept	4.48

This is just an estimate of the true relationship between x and y. We need to understand uncertainty.

Example regression output: Standard Errors

Variable	Estimate	Std. Error
x	-4.23	0.23
Intercept	4.48	0.27

Does 0.23 seem small or large relative to the estimate of -4.23?

The standard error indicates typical sampling variation around the estimate.

Hypothesis Testing

We need a way to determine if the observed relationship is statistically significant or if it could have occurred by random chance.
We use a standardized test statistic (t-statistic) to test the null hypothesis that the true coefficient is zero (no relationship). \[ t = \frac{\hat{\beta} - 0}{SE(\hat{\beta})} \]

null hypothesis ($H_0$: $\beta = 0$); $~~~~~~~ t=(\beta - 0)/\sigma_\beta$

Example regression output: t-statistics

Variable	Estimate	Std. Error	t-stat
x	-4.23	0.23	-18.39
Intercept	4.48	0.27	16.59

Example regression output: p-values

Variable	Estimate	Std. Error	t-stat	p-value
x	-4.23	0.23	-18.39	<0.001
Intercept	4.48	0.27	16.59	<0.001

Interpreting Coefficient Estimates

The coefficient estimates tell us the expected change in the dependent variable for a one-unit change in the independent variable
In our example, the slope of -4.23 means that for every one-unit increase in x, y is expected to decrease by 4.23 units.
If x was miles from the highway, this would mean that for every additional mile from the highway, gas prices are expected to decrease by 4.23 cents.

Estimating Uncertainty: Standard Errors

The estimates are just that: estimates. They are subject to sampling variability.
What if we repeated the study with a different sample of gas stations? or a different period in time? Would we get the same estimates?
The standard error gives us a measure of how much the estimate would vary across different samples.

Hypothesis Testing

We make assumptions about the error term $\epsilon$ (e.g., it is normally distributed with mean 0 and constant variance) that allow us to perform hypothesis tests on the coefficients.
The null hypothesis is typically that the coefficient is equal to zero (no effect).
The t-statistic is the estimate divided by its standard error, and the p-value tells us the probability of observing such an extreme t-statistic if the null hypothesis were true.
In our example, the t-statistic of -18.39 for the slope indicates that the coefficient is highly statistically significant (p < 0.001), suggesting a strong relationship between distance from the highway and gas prices.

Dependent variable: Gas Price ($/gallon)

Variable	Estimate	Std. Error	t-stat	p-value
Distance to Highway (miles)	-0.04	0.02	-2.00	0.048
Constant	3.50	0.10	35.00	<0.001

What does the coefficient on distance to highway mean?

For every additional mile from the highway…

A. gas prices are expected to decrease by 4 cents.

B. gas prices are expected to decrease by 0.02 cents.

C. gas prices are expected to increase by 3.50 dollars.

D. gas prices are expected to decrease by 4 dollars.

Dependent variable: Daily Mountain Bike Trail Visits

Variable	Estimate	Std. Error	t-stat	p-value
Temperature (°F)	12.5	8.3	1.51	0.130
Constant	-300.0	150.0	-2.00	0.048

Which of the following is true about the relationship between temperature and daily mountain bike trail visits?

A. Each 1°F increase in temperature causes 12.5 more trail visits per day.

B. A 1°F increase in temperature is associated with 12.5 more visits per day, and this relationship is statistically significant at the 5% level.

C. A 1°F increase in temperature is associated with 12.5 more visits per day, but the estimate is not statistically significant at the 5% level.

D. Temperature has no relationship with trail visits because the p-value is greater than 0.05.

Summary

Regression is a powerful tool for quantifying relationships between variables.
Coefficient estimates tell us the expected change in the dependent variable for a one-unit change in the independent variable.
Standard errors and hypothesis tests help us understand the uncertainty around our estimates.
Interpretation of regression results requires careful consideration of the context and the assumptions underlying the model. (Pirates vs. global warming example)

Lab Preview

We will use regression to analyze the relationship between drought conditions and corn yields.
We will learn how to fit a regression model, interpret the output, and evaluate the model’s assumptions and performance.

Part 2

Dependent variable: Daily Mountain Bike Trail Visits

Variable	Estimate	Std. Error	t-stat	p-value
Temperature (°F)	12.5	8.3	1.51	0.130
Constant	-300.0	150.0	-2.00	0.048

The coefficient on temperature means:

A. A 1°F increase in temperature is associated with an increase of 12.5 daily trail visits, on average.

B. A 1°F increase in temperature changes trail visits by 8.3 per day.

C. A 1°F increase in temperature changes trail visits by 0.130 per day.

D. When temperature is 0°F, trail visits increase by 12.5 per day.

Dependent variable: Daily Mountain Bike Trail Visits

Variable	Estimate	Std. Error	t-stat	p-value
Temperature (°F)	12.5	8.3	1.51	0.130
Constant	-300.0	150.0	-2.00	0.048

Which of the following is true about the relationship between temperature and daily mountain bike trail visits?

A. Each 1°F increase in temperature causes 12.5 more trail visits per day.

B. A 1°F increase in temperature is associated with 12.5 more visits per day, but the estimate is not statistically significant at the $alpha=0.05$ level.

C. A 1°F increase in temperature is associated with 12.5 more visits per day, and this relationship is statistically significant at the $alpha=0.05$ level.

D. If temperature was 0, the model would predict 0 visits.

Dependent variable: Daily Mountain Bike Trail Visits

Variable	Estimate	Std. Error	t-stat	p-value
Temperature (°F)	12.5	8.3	1.51	0.130
Constant	-300.0	150.0	-2.00	0.048

The p-value tells you:

A. The probability of getting a test statistic this extreme (or more extreme) if the true coefficient were 0.

B. The probability that the null hypothesis is true.

C. The probability that the true temperature effect is exactly 12.5 visits per day.

D. The probability that temperature causes changes in trail visits.

Recap

Regression is a useful tool for quantifying relationships between variables (explanatory analysis)
Coefficient estimates tell us the expected change in the dependent variable for a one-unit change in the independent variable.
Standard errors and hypothesis tests help us understand the uncertainty around our estimates.

Regression can mislead

Regression finds patterns in the data.

But patterns do not always reflect causal relationships.

A regression coefficient can be misleading when:

Important variables are missing
The direction of cause and effect is unclear
Differences across places or people are ignored

Problem 1: Omitted Variables

Suppose we estimate:

\[\text{Gas Price} = \alpha + \beta \times \text{Distance to Highway} + \epsilon\]

But what if we are missing an important variable that explains variation in gas prices?

neighborhood income
competition from nearby stations
traffic volume

If these factors affect both distance and price, the estimate of $\beta$ can be biased.

Omitted Variable Example

Estimate a regression of ice cream sales ($1000s) on shark attacks

Variable	Estimate	Std. Error	t-stat	p-value
Ice Cream	2.1	0.42	5.00	<0.001

Interpret the coefficient on ice cream sales.

What could be missing from this regression that would explain the relationship between ice cream sales and shark attacks?

Omitted Variable Bias

Omitting a variable correlated with both the independent and the dependent variable can bias the estimates of the regression coefficients.

Another Example: Drought and Corn Yields

If you could measure soil quality

Reverse Causality

Sometimes the direction of cause and effect is unclear.
For example, does higher wages lead to higher productivity, or does higher productivity lead to higher wages?
This can lead to biased estimates if we do not account for the possibility of reverse causality.

Exploit timing

If we have data over time, we can use the timing of events to help establish causality.
For example, if wages increase after a productivity improvement, this supports the idea that productivity is driving wages rather than the other way around.

Measurement Error

If the independent variable is measured with error, it can bias the coefficient estimates toward zero.
Suppose our drought variable is measured with error.

True relationship

Drought measured with error

Unobserved Differences

Places and people differ in ways that we may not be able to measure.

Corn yield varies across states because of:

soil quality
farming practices
irrigation infrastructure

If we ignore them, regression may attribute these differences to drought.

Drought Reduces Yield

What if its variation by state

How economists address this

Economists try to create comparisons that mimic experiments.

Instead of comparing different states, we compare the same state in different years.

This helps control for things that do not change over time.

Fixed Effects Regression

Fixed effects regression compares each unit to itself over time.

Instead of asking: Do states with more drought have lower yields?

We ask: When drought becomes worse in a state, do yields fall in that same state?

This removes time-invariant differences like:

soil quality
elevation

Demeaning the data

One way to implement fixed effects regression is to demean the data within each state.
This means we subtract the state-specific mean from each observation, effectively centering the data around zero for each state.
The regression is then run on the demeaned data, which controls for any time-invariant differences across states.

\[ {Y}_{it} - \bar{Y}_i = \alpha_i + \beta \times ({D}_{it} - \bar{D}_i) + \epsilon_{it} \]

Demeaned data

\[ {Y}_{it} - \bar{Y}_i = \alpha_i + \beta \times ({D}_{it} - \bar{D}_i) + \epsilon_{it} \]

FE Regression (demeaned data)

\[ {Y}_{it} - \bar{Y}_i = \alpha_i + \beta \times ({D}_{it} - \bar{D}_i) + \epsilon_{it} \]

Discussion

How does the fixed effects regression relate to a series of state-specific regressions?

Summary

Regression can be a powerful tool for understanding relationships between variables, but it can also be misleading if we are not careful.
Omitted variables, reverse causality, and unobserved differences can all lead to biased estimates.
Economists use techniques like fixed effects regression to try to control for these issues and get closer to the true causal relationships.

Week 7: Regression as a Tool for Answering Questions

Explanatory Data Analysis

Motivating Questions

Running example: Corn Yield and Drought

Example: Gas prices and highway proximity

Explanation vs. Prediction

How does regression work?

Regression demo

Regression math

Interpretting Regression Results

Regression Results Overview

Example regression output: Estimate

Example regression output: Standard Errors

Hypothesis Testing

Example regression output: t-statistics

Example regression output: p-values

Interpreting Coefficient Estimates

Estimating Uncertainty: Standard Errors

Hypothesis Testing

Summary

Lab Preview

Part 2

Recap

Regression can mislead

Problem 1: Omitted Variables

Omitted Variable Example

Omitted Variable Bias

Another Example: Drought and Corn Yields

If you could measure soil quality

Reverse Causality

Exploit timing

Measurement Error

True relationship

Drought measured with error

Unobserved Differences

Drought Reduces Yield

What if its variation by state

How economists address this

Fixed Effects Regression

Demeaning the data

Demeaned data

FE Regression (demeaned data)

Discussion

Summary

Running example:
Corn Yield and Drought