Exploring and auditing new data with RMarkdown (in-class tutorial)

Hannes Datta

Before we get started

  • Group work will get started - ensure your teams are properly registered
  • Today's content builds on previous' week's content (R Bootcamp)
  • Any pending installation issues? Get in touch with Roshini during the coaching session.

Structure of this tutorial

  • Introduction

    • A bit of data preparation theory (“the bigger picture”)
    • How to write source code (e.g., R) files
  • Part 1: RMarkdown for exploring new data

  • Part 2: ggplot2 for plotting [updated this year]

  • After class

Data preparation theory (I) - Research workflow

Main goals in each step of the research workflow

Data preparation theory (II): Components of source code

  • 1) Initializing the script/setup
    • Loading packages
    • Making connections to databases
    • Any other global (“main”) parameters (e.g., sample size, prototype flag, etc.)
  • 2) Input
    • Read data (e.g., unstructured/unstructured data, remote/local locations, files or databases)
  • 3) Transformation
    • E.g., filtering, aggregation, merging, transformation, deduplication (more later)
  • 4) Output
    • Store (intermediate) data, figures (e.g., auditing, final)

Data preparation theory (III): A research pipeline

Research pipeline

Today's session

  • Today, we're focusing on step 1 - exploring and plotting data
  • …but, some of the source code will end up in your data preparation for your project.

Today's focus: Step 1 of research workflow

Part 1: RMarkdown

Exploring data using RMarkdown

  • Recall, there are multiple ways to code in R
    • RStudio
    • Your favorite script editor (e.g., VS Code) + running from the terminal
  • But – in all of these cases, the output will be “ugly” (i.e., just some text)
  • RMarkdown is a way to format output nicely – and ideally suited for exploring data
    • mix of code and text (formatted in markdown) - you already know the concept from Jupyter Notebook
    • possibility to compile into HTML, PDF, Word, and many more
  • great for prototyping code, but bad for “production” (this is also the case for Python's Jupter Notebook)

DO: Creating your first RMarkdown

In today's session, we create our own RMarkdown document.

  1. Select file → new → RMarkdown.
  2. Choose HTML as output, and pick a good file name (e.g., “tutorial”)
  3. Save your .Rmd file.
  4. Convert (“knit”/“compile”) the document to HTML by clicking on “knit to HTML”.

We will continue when you're done.

RMarkdown basics

  • We already know markdown syntax, right?
    • If not: find a cheat sheet online right now or use this one
  • All compiled the document? Try this first always!
  • Navigating an RMarkdown document

    • Code cells & running code (click, ctrl/cmd+enter)
    • Using inline code
    • Options to run (or not run) code cells when compiling documents
    • that 'little wheel' (upper right)
    • echo, include parameters

Let's get started with exploring and auditing data

We'll make use of the Google Mobility Datasets from The Netherlands.

1) In the tutorial on the site, you can either download the data from my server or get the data directly from Google (try downloading yourself). If downloading yourself, watch out for “conversion” issues.

2) In class (to smoothen things…), please just run the code snippet below:

library(tidyverse)
mobility <- read_csv('https://raw.githubusercontent.com/hannesdatta/course-dprep/refs/heads/master/content/docs/modules/week2/tutorial/2020_NL_Region_Mobility_Report.csv')

3) Exploring the data: use head(), tail(), and View()/glimpse()

  • Think yourself as a data detective – you just arrived at the (crime) scene, and it's time to scan for clues.
  • Turn on your critical thinking mode: Do you understand what's in the data? Are there weird patterns? What questions pop into your head?

Verifying data types

  • glimpse() gives you an overview about data types and first values
  • alternatively, you can use class(mobility$date) to inspect classes (e.g., dates, numerics, characters)
  • Recall: can only calculate with dates when they are actual dates! (same with numbers: doesn't work with characters!)
  • Also important to spot wrong encodings

Build an understanding about the unit of observation/analysis

  • What is it? → The level at which data is collected (e.g., individuals, groups, products, events).
  • Why does it matter? → It shapes how you summarize and interpret data—wrong levels lead to misleading results.
  • Example:
    • A store tracks purchases → Unit = transaction
    • To study customer behavior, we must aggregate by customer (or customer-month, etc.)


  • Key Questions:
    • 👉 What is my current unit of analysis? [DO]
    • 👉 Is this the right level for my research question?
    • 👉 Do I need to/How could I aggregate or transform my data? [DO]

Creating summary statistics

  • We can generate summary statistics using the summary() command.
    • do values make sense?
    • are there any missing values?
  • We can also zoom in on specific columns and create summary stats (e.g., on retail_and_recreation_percent_change_from_baseline) - what does it mean?

DO: Cleaning and enriching data for exploration

  • Typically, the raw data is always messy and likely not at the correct unit of analysis (e.g., city level).
  • Also, we may still want to select columns, or rename them.


Please use the dplyr commands select() (column selection), and rename().

  1. Drop these columns
    • country_region_code
    • metro_area
    • iso_3166_2_code
    • census_fips_code
  2. Rename
    • sub_region_1 to province, and
    • sub_region_2 to city

Solutions

# dropping
mobility <- mobility %>% select(-country_region_code, metro_area, iso_3166_2_code, census_fips_code)
# renaming
mobility <- mobility %>% 
  rename(province = sub_region_1,
        city = sub_region_2,
        retail_and_recreation = retail_and_recreation_percent_change_from_baseline,
        grocery_and_pharmacy = grocery_and_pharmacy_percent_change_from_baseline,
        parks = parks_percent_change_from_baseline,
        transit_stations = transit_stations_percent_change_from_baseline,
        workplaces = workplaces_percent_change_from_baseline,
        residential = residential_percent_change_from_baseline)

Understanding how the data is recorded

We want to understand how the data is coded; let's take a look at unique combinations of country_region, province, and city.

overview <- mobility %>% group_by(country_region, province, city) %>% count()

DO: Run the snippet and explore it using View().

Do: Selecting subsets

  • We want to create two new data sets
    • province only contains the observations for a province in NL (city must be missing, but province not)
    • country only contains observations for NL (city and province must be missing)
  • Use the dplyr verbs filter(), the piping character (%>%), and the “NA” detector function is.na()

Task: Create the two data sets: province and country.

Solution

Please run this snippet to be able to continue the tutorial if you have not solved it yourselves.

country  <- mobility %>% 
  filter(is.na(city) & is.na(province))

province <- mobility %>% 
  filter(is.na(city) & !is.na(province)) 
# Note, the explamation sign (!) negates a TRUE into a FALSE (and the other way around)

Let's talk about... Missing values (`NA`)

  • Why care about missing values?
    • Missing data can bias results and affect analysis
    • Some methods ignore missing values, others break if not handled
    • Solutions depend on why values are missing
  • Ways to explore missing values
    • summary()
    • sum(is.na(columnname))


  • DO: Explore missing values in the country dataset using summary()
    • Which columns?
    • Are they “actual” missings?

DO: Replacing missing values

  • mutate: create new or edit existing column
  • ifelse(X, A, B) does action A or B, depending on X
  • is.na(col) checks for missing values in a column
# toybox example, does not run (because this particular data doesn't exist)
new_data = old_data %>% 
  mutate(new_variable = ifelse(is.na(old_variable),
                               mean(old_variable, na.rm=T),
                               old_variable))


Please replace missing values by their mean for retail_and_recreation and other “place” columns in the province dataset.

Solution

province_updated <- province %>%
  mutate(
    retail_and_recreation = ifelse(
      is.na(retail_and_recreation),
      mean(retail_and_recreation, na.rm = TRUE),
      retail_and_recreation
    ),
    grocery_and_pharmacy = ifelse(
      is.na(grocery_and_pharmacy),
      mean(grocery_and_pharmacy, na.rm = TRUE),
      grocery_and_pharmacy
    ),
    parks = ifelse(
      is.na(parks),
      mean(parks, na.rm = TRUE),
      parks
    ),
    transit_stations = ifelse(
      is.na(transit_stations),
      mean(transit_stations, na.rm = TRUE),
      transit_stations
    ),
    workplaces = ifelse(
      is.na(workplaces),
      mean(workplaces, na.rm = TRUE),
      workplaces
    ),
    residential = ifelse(
      is.na(residential),
      mean(residential, na.rm = TRUE),
      residential
    )
  )

Part 2: `ggplot2`

Visualization using `ggplot2`

  • Why?

    • great plotting tool: ggplot2 (better than built-in functionality)
    • works seamlessly with dplyr pipelines
    • can create really complex charts, real quick, plus export them to PDF/PNG, etc.
  • Steps

    • define dataset: ggplot(data = your_data) or data %>% ggplot()
    • set aesthetics (aes) – what goes where?: aes(x= ..., y = ...)
    • choose geometry/chart type (geom_): geom_line(), geom_point, geom_col (barchart), geom_hist (histogram)
    • customize (e.g., titles, colors, themes, labels, etc.) - I use ChatGPT/Google to find tips

DO: First chart

library(ggplot2)
ggplot(data = country) + 
  geom_line(aes(x = date, y = retail_and_recreation), 
    color = "black") +
  labs(title = "Changes in Visits to Retail Areas Over Time",
       x = "Date",
       y = "% Change from Baseline") +
  theme_minimal()

Your tasks:

  1. Change the color of lines to blue
  2. Try out another theme: theme_light() instead of theme_minimal()
  3. Please change the y axis from retail_and_recreation to parks + change title of plot.

Useful to know: saving plots

We can save plots to a file by immediately calling this command after creating a plot.

ggsave()
  • Extensions:
    • ggsave(filename = 'plot.pdf') or ggsave(filename = 'plot.png') - pick any format
    • can specify width, height, dpi, bg (background), etc.
    • create directories with dir.create() if needed
    • wipe files using unlink('directory_name/*.*') (be careful)
    • You can use loops as well! It can easily create hundreds of plots!

Extension 1: Adding more data points to a chart (manually)

library(ggplot2)
ggplot(data = country) + 
  geom_line(aes(x = date, 
                y = retail_and_recreation, 
                color = 'Retail')) +
  geom_line(aes(x = date, 
                y = parks, 
                color = 'Parks')) +
  scale_color_manual(values = c("Retail" = "red", 
                                "Parks" = "blue")) +
  labs(title = "Time spent at retail vs. parks",
       x = "Date",
       y = "% Change from Baseline",
       color = "Locations") +
  theme_minimal()

DO: Please add a third data point: time spent at work (color: black).

Solution

library(ggplot2)
ggplot(data = country) + 
  geom_line(aes(x = date, 
                y = retail_and_recreation, 
                color = 'Retail')) +
  geom_line(aes(x = date, 
                y = parks, 
                color = 'Parks')) +
  geom_line(aes(x = date, 
                y = workplaces, 
                color = 'Work')) +
  scale_color_manual(values = c("Retail" = "red", 
                                "Parks" = "blue",
                                "Work" = "black")) +
  labs(title = "Time spent at different locations",
       x = "Date",
       y = "% Change from Baseline",
       color = "Locations") +
  theme_minimal()

Extension 2: Create grouping via facet wrap

Sometimes, it makes sense “automatically” grouping our plots, such as by province!

province_updated %>% ggplot() + 
  geom_line(aes(x = date, 
                y = retail_and_recreation, 
                color = 'Retail')) +
  geom_line(aes(x = date, 
                y = parks, 
                color = 'Parks')) +
  facet_wrap(~province) + # THIS LINE IS NEW 
  scale_color_manual(values = c("Retail" = "red", 
                                "Parks" = "blue")) +
  labs(title = "Time spent at retail vs. parks",
       x = "Date",
       y = "% Change from Baseline",
       color = "Locations") +
  theme_minimal()

Extension 3: Grouping via a column in the data set (I)

To work on this, let's first run this cell.

province_long <- province_updated %>%
  select(
    province, date, 
    retail_and_recreation, grocery_and_pharmacy, 
    parks, transit_stations, workplaces, residential
  ) %>%
  pivot_longer(
    cols = c(
      retail_and_recreation, grocery_and_pharmacy, 
      parks, transit_stations, workplaces, residential
    ),
    names_to = "place_category",
    values_to = "mobility_change"
  )

This converts your “wide” data into a “long” data set (you will learn more about this in week 4).

Extension 3: Grouping via a column in the data set (II)

province_long %>% filter(province == 'North Brabant') %>%
  ggplot(aes(x=date, 
             y=mobility_change, 
             color=place_category)) +
  geom_line()  + 
  ggtitle('Mobility in North Brabant') + 
  theme_minimal()
The plot now shows all locations without explicitly naming them in the code as before.

The plot now shows all locations without explicitly naming them in the code as before.

Extension 3: Grouping via a column in the data set (III

We can also add a facet wrap, as in extension 2.

province_long %>% # no filter this time - show all provinces!
  ggplot(aes(x=date, 
             y=mobility_change, 
             color=place_category)) +
  geom_line()  + 
  facet_wrap(~province)+
  theme_minimal()
The plot now shows all locations per province.

The plot now shows all locations per province.

Wrapping everything up in a "clean" script

Think of this as doing it at every end of your work cycle.

Clean up your script(s)!

  1. Setup done/packages and other prerequisites loaded?
  2. Adhered to input - transformation - output?
    • input: loading data
    • transformation: filtering, grouping and summarizing, creating new variables, etc.; mostly leads to a “different primary key than the input data” (→ unit of analysis!)
    • output: e.g., saving plots, saving a new data set
  3. Cleaned up any remaining garbage?
  4. File callable from the command prompt?
    • We will need this for automation in subsequent tutorials
    • Try from the command prompt/terminal:
    • Rscript your_r_script.R or
    • R --vanilla < your_r_script.R

Concluding thoughts

Remember the workflow from the start of today's session?

Main goals in each step of the research workflow

We've covered step 1! Concepts and programming knowledge extends to subsequent parts.

Next steps

  1. Work through RMarkdown & ggplot2 chapters in R for Social Scientists
  2. Work through this week's tutorial (not just “understand”, but actually “do”), see here
  3. Coaching session: Starting up your team projects
    • Finalize team registration (on Canvas and on GitHub Classrooms)
    • Choose data set: IMDb or Yelp
    • Find research question (but note: this course is about engineering a research pipeline, not about rigorously answering the RQ!!!)
    • Start your data exploration (see workplan)