You’ve asked a question.
You’ve asked a question.
You’ve found data.
But the data don’t come ready to analyze.
Raw data are messy, incomplete, and often built for a purpose different from your question.
Raw Data → Data Processing → Structured Data → Analysis → Insight
Data processing is the set of decisions that transform raw data into a dataset that can credibly answer your question.
“Garbage In, Garbage Out”
Your question determines:
Data reflect how they were collected. They encode assumptions about what matters, who to measure, and what to record.
Processing involves judgment. Two analysts with the same raw data can make different (reasonable) processing choices.
Choices have consequences. Processing decisions can impact results without leaving obvious traces.
This means data processing deserves the same care, documentation, and transparency as your analysis.
Which one of these does NOT make data tidy?
A. Put each variable in its own column
B. Put each observation in its own row
C. Store measurements for different observational units (e.g., individuals and households) together in the same table
D. Use a column to record the measurement type (e.g., crop)
Tidy data: structure that facilitates analysis
A dataset reports crop yields by county and year:
| county | year | corn | wheat | soy |
|---|---|---|---|---|
| County A | 2020 | 120 | 80 | 45 |
| County B | 2020 | 95 | 70 | 50 |
Problem: Crop type is hidden in column names, not stored as data.
| county | year | crop | yield |
|---|---|---|---|
| County A | 2020 | corn | 120 |
| County A | 2020 | wheat | 80 |
| County A | 2020 | soy | 45 |
| County B | 2020 | corn | 95 |
| County B | 2020 | wheat | 70 |
| County B | 2020 | soy | 50 |
Same information. Different structure. Much easier to analyze.
Look at these two data structures. Discuss:
According to the reader, which of the following is not a common cause of missing data:
A. Nonresponse or incomplete reporting
B. Measurement failure or data corruption
C. Suppression for confidentiality or privacy
D. Imputation performed by the analyst
Missing data are information - not just holes
A dataset of farm visits to agricultural extension services:
| farm_id | district | visits | reason |
|---|---|---|---|
| 1001 | North | 3 | crop_pest |
| 1002 | North | 2 | soil |
| 1003 | North | — | — |
| 1004 | South | 4 | soil |
| 1005 | South | — | — |
The farm might:
Processing choice: How you interpret missingness will shape your analysis.
If you treat missing as zero visits:
If you exclude those rows:
According to the reader, which of the following is a common reason for aggregating data?
A. To increase data resolution (e.g., turn annual data into daily)
B. To match the resolution among datasets
C. To eliminate outliers from the dataset
D. To convert categorical variables into numerical ones
Daily sales data from a retail store:
| Date | Sales |
|---|---|
| 2024-01-01 | $4,200 |
| 2024-01-02 | $3,800 |
| 2024-01-03 | $4,100 |
| 2024-01-04 | $3,900 |
| … | … |
| 2024-12-31 | $5,300 |
Process: Aggregate to monthly averages
Clear seasonal pattern emerges.
Process: Keep daily data (don’t aggregate)
Clear short-term spike visible.
Process: Different aggregation entirely
Last week’s lab had you work with drought and yield data.
Record what you did, why, and where to find it
Processing decisions should be:
Guided by your question — not by what’s easiest or most traditional
Made explicit — document assumptions and choices
Justified — explain why this processing aligns with your question
Transparent — make it so others can evaluate your decisions
Lab Session: Hands-on practice
(If time allows)
Mistake 1: Treating all missing values the same
Mistake 2: Removing outliers without understanding them
Mistake 3: Aggregating without thinking about what’s lost
Mistake 4: Not documenting why you made processing choices