Week 10: Midterm Review

Review Goals

Clarify the exam format and expectations
Revisit the core logic from modules 1-7
Practice interpreting models and communicating insights in plain language

Exam Parameters

The exam is on Friday in lab; about 1 hour; no calculators; 1 page of notes; you need a blue book
Expect 10 multiple choice, 3 multipart analytical questions
Provide an answer to every question. Demonstrate you reasoning, even if you are uncertain about the final answer. Partial credit is possible.
Write clearly. I need to be able to read your answer
30% of grade
Top score will be 100%, then everything relative to that
Cover readings, slides, and labs from modules 1-7

What Strong Answers Do

Answer the question that was actually asked
State the result in plain language
Reference specific evidence from the prompt
Acknowledge uncertainty or limitations when relevant

Big Picture: Course Logic

%%{init: {"themeVariables": {"fontSize": "24px"}}}%%
flowchart LR
  G[Goals] ==> P[Problem]
  P ==> Q[Question]
  Q ==> Da[Data]
  Da ==> M[Model]
  M ==> R[Result]
  R ==> D[Decision]

Module 1: Data + Analyst = Insight

Data are inputs, not conclusions
Analysis matters only if it informs a real problem or decision
Communication is part of analysis, not an optional final step
On the exam, do not stop at “the coefficient is negative”; explain what it means in context

Module 2: Asking the Right Question

Start from the problem and decision, not the dataset
A good question is decision-relevant, answerable, and specific about who, where, when, and compared to what
Know the difference between descriptive, exploratory, predictive, causal, and mechanistic questions
Many weak answers come from answering the wrong type of question

A prompt asks: “What would happen to corn yields if drought intensity increased next season?” What type of question is this?

A. Descriptive

B. Exploratory

C. Predictive

D. Causal (counterfactual)

Module 3: Data Processing Part 1

Processing aligns raw data with the question
Tidy data makes variables, observations, and units explicit
Missing data, outliers, and aggregation are analytical choices with consequences
Different processing choices can change the pattern you see and the claim you can defend

Module 4: Data Processing Part 2

Let the target unit of observation define the join
Keys matter: duplicates or mismatched units can silently distort results
inner_join() changes the sample; left_join() preserves the primary sample
Before merging, check uniqueness, coverage, and whether the rows really mean the same thing

Module 5: Exploratory Data Analysis

EDA helps you understand structure, distributions, anomalies, and relationships
EDA can help generate hypotheses and refine the question
Summary statistics and visuals should tell a coherent story

Module 6: Forecasting

A forecast is a transparent prediction about future values based on past patterns
Evaluate forecasts on held-out data, not the same data used to fit the model
Benchmarks matter; more complex models are not automatically better
Forecasts are not just a prediction of a point estimate; the uncertainty around that estimate is often more important for decision-making

Practice: Communicating a Forecast

Suppose the forecast for July visitation is 220,000 with an 80% prediction interval of [190,000, 250,000].

Weak answer: “July visitation will be 220,000.”
Stronger answer: “Expected July visitation is around 220,000, but a reasonable range is 190,000 to 250,000.”
Best decision-oriented answer: “If understaffing is costly, planning should account for outcomes near the upper end of the interval.”

Module 7: Regression Interpretation

Regression quantifies the relationship between an outcome and one or more predictors
A coefficient is an estimated change in the outcome for a one-unit change in the predictor (explanatory)
Standard errors measure uncertainty around estimates
t-statistics and p-values summarize how strongly the data are incompatible with a zero relationship
Statistical significance is not the same thing as practical importance

Practice: Interpreting Regression Output

Outcome: Corn yield (bushels per acre)

Variable	Estimate	Std. Error	p-value
Drought severity	-3.2	1.1	0.006
Soil quality	5.0	2.0	0.015
Hail damage	-1.5	2	0.45
Intercept	178.5	4.8	<0.001

A strong written answer would say:

A one-unit increase in drought severity is associated with about 3.2 fewer bushels per acre
The estimate is fairly precise
The p-value indicates the null hypothesis of zero relationship is unlikely

When Regression Misleads

Omitted variables can create misleading relationships
Reverse causality makes direction ambiguous
Measurement error can bias estimates
Unobserved differences across places or people can distort pooled comparisons

Study Priorities

Identify the question type and decision context
Explain how data structure affects interpretation
Communicate insights from tables and figures
Distinguish prediction, explanation, and causation
Communicate uncertainty honestly
Make defensible claims without overclaiming

Final Advice

Read: Yusuke Kuwayama, Alexandra Thompson, Richard Bernknopf, Benjamin Zaitchik, Peter Vail. Estimating the Impact of Drought on Agriculture Using the U.S. Drought Monitor, American Journal of Agricultural Economics, Volume 101, Issue 1, January 2019, Pages 193–210.
Read the prompt carefully before writing
Write like you are explaining the result to an intelligent manager or policymaker
If you are uncertain, explain your reasoning and state the limitation