Data Processing
Part 1

You’ve asked a question.

You’ve found data.

But the data don’t come ready to analyze.

Raw data are messy, incomplete, and often built for a purpose different from your question.

The Journey from Raw Data to Insight

Raw Data → Data Processing → Structured Data → Analysis → Insight

Data processing is the set of decisions that transform raw data into a dataset that can credibly answer your question.

Why This Matters

“Garbage In, Garbage Out”

Poor data preparation undermines analysis
No statistical method can compensate for fundamental problems in the data
Processing choices directly influence what conclusions are possible

Roadmap

Understand your data
Make data tidy
Make explicit processing choices (variables, aggregation, missingness, outliers)
Implement and document processing steps

Processing Is About Alignment

Your question determines:

What counts as an observation
- individual farms, counties, states
Which variables/measurements matter
- total yield, yield per acre, percent change
What level of temporal aggregation is appropriate
- daily, monthly, annual
How to handle missing or extreme values
- imputation, exclusion, flagging

What Makes Processing Tricky?

Data reflect how they were collected. They encode assumptions about what matters, who to measure, and what to record.
Processing involves judgment. Two analysts with the same raw data can make different (reasonable) processing choices.
Choices have consequences. Processing decisions can impact results without leaving obvious traces.

This means data processing deserves the same care, documentation, and transparency as your analysis.

Tidy Data

Which one of these does NOT make data tidy?

A. Put each variable in its own column
B. Put each observation in its own row
C. Store measurements for different observational units (e.g., individuals and households) together in the same table
D. Use a column to record the measurement type (e.g., crop)

What Is Tidy Data?

Tidy data: structure that facilitates analysis

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

Example: Untidy Data

A dataset reports crop yields by county and year:

county	year	corn	wheat	soy
County A	2020	120	80	45
County B	2020	95	70	50

Problem: Crop type is hidden in column names, not stored as data.

The Tidy Version

county	year	crop	yield
County A	2020	corn	120
County A	2020	wheat	80
County A	2020	soy	45
County B	2020	corn	95
County B	2020	wheat	70
County B	2020	soy	50

Same information. Different structure. Much easier to analyze.

Pair up and discuss: 3 minutes

Look at these two data structures. Discuss:

What analytical questions would be easier or harder with each structure?
When might the untidy structure actually be useful?

Missing Data

According to the reader, which of the following is not a common cause of missing data:

A. Nonresponse or incomplete reporting
B. Measurement failure or data corruption
C. Suppression for confidentiality or privacy
D. Imputation performed by the analyst

Missing data

Missing data are information - not just holes

Why: Missing at random vs. informative
Typical causes: nonresponse, measurement limits, deliberate suppression, data-entry errors
Why it matters: changes sample composition, biases estimates

Example: Ag Extension Visits

A dataset of farm visits to agricultural extension services:

farm_id	district	visits	reason
1001	North	3	crop_pest
1002	North	2	soil
1003	North	—	—
1004	South	4	soil
1005	South	—	—

What Does “Missing” Mean?

The farm might:

Have not used the service (zero visits)
Have failed to report (incomplete data)
Have been deliberately excluded (data suppression)
Have been newly registered (not yet in system)

Processing choice: How you interpret missingness will shape your analysis.

Scenario A: “Missing” = Did Not Use Service

If you treat missing as zero visits:

District North: Average 1.7 visits per farm
District South: Average 1.3 visits per farm
Conclusion: North uses extension more

Scenario B: “Missing” = Incomplete Data

If you exclude those rows:

District North: Average 2.5 visits per farm
District South: Average 4.0 visits per farm
Conclusion: South uses extension more

What Would You Do?

Regardless of your choice, document it clearly.
You might also run both scenarios and report how conclusions change.

Missing Data Checklist

quantify missingness
explore patterns
decide interpretation (zero vs. nonresponse)
choose handling (exclude, impute, flag)
document choice

Data Aggregation

According to the reader, which of the following is a common reason for aggregating data?

A. To increase data resolution (e.g., turn annual data into daily)
B. To match the resolution among datasets
C. To eliminate outliers from the dataset
D. To convert categorical variables into numerical ones

Aggregation

Aggregation is combining observations (sum/mean/count/median) by time/place/group
Why do it: match data resolution to the question, reduce noise, reveal trends
Common choices: temporal (daily→monthly→annual), spatial (farm→county→state), unit-level (individual→household)
Pitfalls: losing meaningful variation, masking short-term effects

Example: Aggregation

Daily sales data from a retail store:

Date	Sales
2024-01-01	$4,200
2024-01-02	$3,800
2024-01-03	$4,100
2024-01-04	$3,900
…	…
2024-12-31	$5,300

Question 1: Do sales change with the season?

Process: Aggregate to monthly averages

January average: $3,950
February average: $4,100
…
December average: $5,200

Clear seasonal pattern emerges.

Question 2: Do sales respond to a flash promotion on day 47?

Process: Keep daily data (don’t aggregate)

Day 46: $3,850
Day 47: $5,100 ← promotion
Day 48: $4,700
Day 49: $4,200

Clear short-term spike visible.

Question 3: Do sales reliably exceed $4,000?

Process: Different aggregation entirely

Days exceeding $4,000: 251/365
Proportion: 69%

Aggregation Checklist

pick level driven by the question
preserve raw data
document aggregation approach
run sensitivity checks

Pair up and discuss: 3 minutes

Last week’s lab had you work with drought and yield data.

What aggregation level (time and space) would make sense for analyzing drought impacts on yield?
Is that the data you have? If not, what processing would you need to do?

Document Processing Choices

Record what you did, why, and where to find it

What: filters, recodes, aggregations, imputations, and outlier rules
Why: how each choice links to the research question and assumptions
How: scripts, functions, parameters/variables
Where: processing scripts, README

The Takeaway

Processing decisions should be:

Guided by your question — not by what’s easiest or most traditional
Made explicit — document assumptions and choices
Justified — explain why this processing aligns with your question
Transparent — make it so others can evaluate your decisions

What’s Next?

Lab Session: Hands-on practice

Continue working with drought and yield data
Apply processing principles: aggregation, missingness, tidying
Document your processing choices

Bonus: Common Processing Mistakes

(If time allows)

Mistake 1: Treating all missing values the same

Some represent absence, others represent incomplete reporting
Handle them differently based on what they mean

Mistake 2: Removing outliers without understanding them

Extreme values may be exactly what you need to study
Investigate before you delete

Mistake 3: Aggregating without thinking about what’s lost

Averaging hides variation
Make sure you’re not erasing meaningful differences

Mistake 4: Not documenting why you made processing choices

Your future self will forget
Collaborators need to understand
Reviewers need to evaluate credibility

Data Processing Part 1

The Journey from Raw Data to Insight

Why This Matters

Roadmap

Processing Is About Alignment

What Makes Processing Tricky?

Tidy Data

What Is Tidy Data?

Example: Untidy Data

The Tidy Version

Pair up and discuss: 3 minutes

Missing Data

Missing data

Example: Ag Extension Visits

What Does “Missing” Mean?

Scenario A: “Missing” = Did Not Use Service

Scenario B: “Missing” = Incomplete Data

What Would You Do?

Missing Data Checklist

Data Aggregation

Aggregation

Example: Aggregation

Question 1: Do sales change with the season?

Question 2: Do sales respond to a flash promotion on day 47?

Question 3: Do sales reliably exceed $4,000?

Aggregation Checklist

Pair up and discuss: 3 minutes

Document Processing Choices

The Takeaway

What’s Next?

Bonus: Common Processing Mistakes

Data Processing
Part 1