Data Processing
Part 1

You’ve asked a question.

You’ve found data.

But the data don’t come ready to analyze.

Raw data are messy, incomplete, and often built for a purpose different from your question.

The Journey from Raw Data to Insight

Raw Data → Data Processing → Structured Data → Analysis → Insight

Data processing is the set of decisions that transform raw data into a dataset that can credibly answer your question.

Why This Matters

“Garbage In, Garbage Out”

  • Poor data preparation undermines analysis
  • No statistical method can compensate for fundamental problems in the data
  • Processing choices directly influence what conclusions are possible

Roadmap

  • Understand your data
  • Make data tidy
  • Make explicit processing choices (variables, aggregation, missingness, outliers)
  • Implement and document processing steps

Processing Is About Alignment

Your question determines:

  • What counts as an observation
    • individual farms, counties, states
  • Which variables/measurements matter
    • total yield, yield per acre, percent change
  • What level of temporal aggregation is appropriate
    • daily, monthly, annual
  • How to handle missing or extreme values
    • imputation, exclusion, flagging

What Makes Processing Tricky?

  1. Data reflect how they were collected. They encode assumptions about what matters, who to measure, and what to record.

  2. Processing involves judgment. Two analysts with the same raw data can make different (reasonable) processing choices.

  3. Choices have consequences. Processing decisions can impact results without leaving obvious traces.

This means data processing deserves the same care, documentation, and transparency as your analysis.

Tidy Data

Which one of these does NOT make data tidy?

A. Put each variable in its own column
B. Put each observation in its own row
C. Store measurements for different observational units (e.g., individuals and households) together in the same table
D. Use a column to record the measurement type (e.g., crop)

What Is Tidy Data?

Tidy data: structure that facilitates analysis

  • Each variable forms a column
  • Each observation forms a row
  • Each type of observational unit forms a table

Example: Untidy Data

A dataset reports crop yields by county and year:

county year corn wheat soy
County A 2020 120 80 45
County B 2020 95 70 50

Problem: Crop type is hidden in column names, not stored as data.

The Tidy Version

county year crop yield
County A 2020 corn 120
County A 2020 wheat 80
County A 2020 soy 45
County B 2020 corn 95
County B 2020 wheat 70
County B 2020 soy 50

Same information. Different structure. Much easier to analyze.

Pair up and discuss: 3 minutes

Look at these two data structures. Discuss:

  1. What analytical questions would be easier or harder with each structure?
  2. When might the untidy structure actually be useful?

Missing Data

According to the reader, which of the following is not a common cause of missing data:

A. Nonresponse or incomplete reporting
B. Measurement failure or data corruption
C. Suppression for confidentiality or privacy
D. Imputation performed by the analyst

Missing data

Missing data are information - not just holes

  • Why: Missing at random vs. informative
  • Typical causes: nonresponse, measurement limits, deliberate suppression, data-entry errors
  • Why it matters: changes sample composition, biases estimates

Example: Ag Extension Visits

A dataset of farm visits to agricultural extension services:

farm_id district visits reason
1001 North 3 crop_pest
1002 North 2 soil
1003 North
1004 South 4 soil
1005 South

What Does “Missing” Mean?

The farm might:

  • Have not used the service (zero visits)
  • Have failed to report (incomplete data)
  • Have been deliberately excluded (data suppression)
  • Have been newly registered (not yet in system)

Processing choice: How you interpret missingness will shape your analysis.

Scenario A: “Missing” = Did Not Use Service

If you treat missing as zero visits:

  • District North: Average 1.7 visits per farm
  • District South: Average 1.3 visits per farm
  • Conclusion: North uses extension more

Scenario B: “Missing” = Incomplete Data

If you exclude those rows:

  • District North: Average 2.5 visits per farm
  • District South: Average 4.0 visits per farm
  • Conclusion: South uses extension more

What Would You Do?

  • Regardless of your choice, document it clearly.
  • You might also run both scenarios and report how conclusions change.

Missing Data Checklist

  • quantify missingness
  • explore patterns
  • decide interpretation (zero vs. nonresponse)
  • choose handling (exclude, impute, flag)
  • document choice

Data Aggregation

According to the reader, which of the following is a common reason for aggregating data?

A. To increase data resolution (e.g., turn annual data into daily)
B. To match the resolution among datasets
C. To eliminate outliers from the dataset
D. To convert categorical variables into numerical ones

Aggregation

  • Aggregation is combining observations (sum/mean/count/median) by time/place/group
  • Why do it: match data resolution to the question, reduce noise, reveal trends
  • Common choices: temporal (daily→monthly→annual), spatial (farm→county→state), unit-level (individual→household)
  • Pitfalls: losing meaningful variation, masking short-term effects

Example: Aggregation

Daily sales data from a retail store:

Date Sales
2024-01-01 $4,200
2024-01-02 $3,800
2024-01-03 $4,100
2024-01-04 $3,900
2024-12-31 $5,300

Question 1: Do sales change with the season?

Process: Aggregate to monthly averages

  • January average: $3,950
  • February average: $4,100
  • December average: $5,200

Clear seasonal pattern emerges.

Question 2: Do sales respond to a flash promotion on day 47?

Process: Keep daily data (don’t aggregate)

  • Day 46: $3,850
  • Day 47: $5,100 ← promotion
  • Day 48: $4,700
  • Day 49: $4,200

Clear short-term spike visible.

Question 3: Do sales reliably exceed $4,000?

Process: Different aggregation entirely

  • Days exceeding $4,000: 251/365
  • Proportion: 69%

Aggregation Checklist

  • pick level driven by the question
  • preserve raw data
  • document aggregation approach
  • run sensitivity checks

Pair up and discuss: 3 minutes

Last week’s lab had you work with drought and yield data.

  1. What aggregation level (time and space) would make sense for analyzing drought impacts on yield?
  2. Is that the data you have? If not, what processing would you need to do?

Document Processing Choices

Record what you did, why, and where to find it

  • What: filters, recodes, aggregations, imputations, and outlier rules
  • Why: how each choice links to the research question and assumptions
  • How: scripts, functions, parameters/variables
  • Where: processing scripts, README

The Takeaway

Processing decisions should be:

  1. Guided by your question — not by what’s easiest or most traditional

  2. Made explicit — document assumptions and choices

  3. Justified — explain why this processing aligns with your question

  4. Transparent — make it so others can evaluate your decisions

What’s Next?

Lab Session: Hands-on practice

  • Continue working with drought and yield data
  • Apply processing principles: aggregation, missingness, tidying
  • Document your processing choices

Bonus: Common Processing Mistakes

(If time allows)

Mistake 1: Treating all missing values the same

  • Some represent absence, others represent incomplete reporting
  • Handle them differently based on what they mean

Mistake 2: Removing outliers without understanding them

  • Extreme values may be exactly what you need to study
  • Investigate before you delete

Mistake 3: Aggregating without thinking about what’s lost

  • Averaging hides variation
  • Make sure you’re not erasing meaningful differences

Mistake 4: Not documenting why you made processing choices

  • Your future self will forget
  • Collaborators need to understand
  • Reviewers need to evaluate credibility