Introduction

In this tutorial, you’ll explore a data set and learn how to understand the context in which it was collected. By the end, you'll be able to create an RMarkdown document and render it as a PDF or HTML file—perfect for sharing insights or starting discussions!

Prerequisites

Ready to start? Here's what we recommend you do before going through this tutorial.

Optional (for starting R users):

Datacamp Intermediate R

The data set

Google’s COVID-19 Community Mobility Reports helped governments and researchers understand how the pandemic affected daily life and the economy. The data showed daily percentage changes in visits to different places, compared to the same day of the week before COVID-19 (January 3 – February 6, 2020).

The data includes six place categories:

Retail & recreation (e.g., malls, restaurants)
Grocery & pharmacy (e.g., supermarkets, drugstores)
Parks
Transit stations (e.g., bus and train stations)
Workplaces
Residences

We will work with the Dutch part of the data (tagged with NL).

As you go through the tutorial, you'll apply the skills you've learned in the prerequisite material - so this is a great chance to put them into practice! 🚀

What You'll Learn in this Tutorial

📊 Working with and Exploring Data in R
- Load data from text files into data frames
- Work with different data types (vectors, matrices, data frames)
- Modify data frames:
  - Add, remove, or rename columns
  - Filter data using indexes, conditions, or by handling missing values
- Use basic programming concepts:
  - if-else statements
  - for loops
  - Functions
🔍 Understanding and Analyzing Data
- Assess data quality and explore patterns
- Perform basic calculations (e.g., find averages)
- Generate summary statistics (summary(), table())
- Create visualizations (bar plots, time series plots)
- Format your findings in RMarkdown (HTML or PDF reports)

Support Needed?

For technical issues outside of scheduled classes, please check the support section on the course website.

Part 1: Loading and Inspecting the Data

Let's Create a New R Project

Create a new R project:
- File → New Project → New Directory → New Project
- Name it data_exploration
- Choose a location and click Create
Add a raw_data folder:
- Option 1 (Code): Run dir.create("raw_data")
- Option 2 (Manual): Click New Folder (📁) in the Files pane

Downloading and Loading the Data Set

Download the Region CSVs archive (ZIP file) from Google.
Unzip the file—it contains many country-specific datasets.
Move 2020_NL_Region_Mobility_Report.csv into your raw_data folder.
Create a new R script in your project folder.

Now, load the data into R using dplyr's read_csv function:

library(dplyr)
mobility <- read_csv("raw_data/2020_NL_Region_Mobility_Report.csv")

## Rows: 110241 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): country_region_code, country_region, sub_region_1, sub_region_2, i...
## dbl  (6): retail_and_recreation_percent_change_from_baseline, grocery_and_ph...
## lgl  (2): metro_area, census_fips_code
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Troubleshooting

If R doesn’t find your file: Click Session → Set working directory → To source file location. Avoid using setwd() — hardcoding directories can cause problems!
Don't manage to download the file directly from Google? You can also use an archived version: mobility <- read_csv('https://raw.githubusercontent.com/hannesdatta/course-dprep/refs/heads/master/content/docs/modules/week2/tutorial/2020_NL_Region_Mobility_Report.csv')

Exercise 1: Data Inspection

Examine the mobility data frame and describe its structure in your own words. Consider the following:

What are the column names?
How are the variables/metrics defined?
What is the unit of analysis (i.e., what uniquely identifies each row in the dataset)?
Is the data sorted in any specific way?
Are all values in each column complete and meaningful?
Are text-based data ("strings") properly encoded, or do they contain unusual characters that make them hard to read?

Tips:

Take a look at the data using the function head(mobility), or View(mobility). If you'd like to view more rows with head, use head(mobility, 100) (for, e.g., the first 100 rows). (tail() gives you the last rows of the data).
The command summary(mobility) generates descriptive statistics for all variables in the data. You can also use this command on individual columns (e.g., summary(mobility$retail_and_recreation_percent_change_from_baseline)).
Character or factor columns are best inspected using the table() command. These will create frequency tables.

head(mobility)

## # A tibble: 6 × 15
##   country_region_code country_region sub_region_1 sub_region_2 metro_area
##   <chr>               <chr>          <chr>        <chr>        <lgl>     
## 1 NL                  Netherlands    <NA>         <NA>         NA        
## 2 NL                  Netherlands    <NA>         <NA>         NA        
## 3 NL                  Netherlands    <NA>         <NA>         NA        
## 4 NL                  Netherlands    <NA>         <NA>         NA        
## 5 NL                  Netherlands    <NA>         <NA>         NA        
## 6 NL                  Netherlands    <NA>         <NA>         NA        
## # ℹ 10 more variables: iso_3166_2_code <chr>, census_fips_code <lgl>,
## #   place_id <chr>, date <date>,
## #   retail_and_recreation_percent_change_from_baseline <dbl>,
## #   grocery_and_pharmacy_percent_change_from_baseline <dbl>,
## #   parks_percent_change_from_baseline <dbl>,
## #   transit_stations_percent_change_from_baseline <dbl>,
## #   workplaces_percent_change_from_baseline <dbl>, …

summary(mobility)

##  country_region_code country_region     sub_region_1       sub_region_2      
##  Length:110241       Length:110241      Length:110241      Length:110241     
##  Class :character    Class :character   Class :character   Class :character  
##  Mode  :character    Mode  :character   Mode  :character   Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  metro_area     iso_3166_2_code    census_fips_code   place_id        
##  Mode:logical   Length:110241      Mode:logical     Length:110241     
##  NA's:110241    Class :character   NA's:110241      Class :character  
##                 Mode  :character                    Mode  :character  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##       date            retail_and_recreation_percent_change_from_baseline
##  Min.   :2020-02-15   Min.   :-97.00                                    
##  1st Qu.:2020-05-04   1st Qu.:-30.00                                    
##  Median :2020-07-21   Median :-15.00                                    
##  Mean   :2020-07-24   Mean   :-14.86                                    
##  3rd Qu.:2020-10-17   3rd Qu.: -1.00                                    
##  Max.   :2020-12-31   Max.   :199.00                                    
##                       NA's   :47051                                     
##  grocery_and_pharmacy_percent_change_from_baseline
##  Min.   :-96.00                                   
##  1st Qu.: -8.00                                   
##  Median : -2.00                                   
##  Mean   : -1.32                                   
##  3rd Qu.:  5.00                                   
##  Max.   :222.00                                   
##  NA's   :39935                                    
##  parks_percent_change_from_baseline
##  Min.   :-88.00                    
##  1st Qu.:-10.00                    
##  Median : 16.00                    
##  Mean   : 31.53                    
##  3rd Qu.: 56.00                    
##  Max.   :728.00                    
##  NA's   :87811                     
##  transit_stations_percent_change_from_baseline
##  Min.   :-93.00                               
##  1st Qu.:-52.00                               
##  Median :-40.00                               
##  Mean   :-36.95                               
##  3rd Qu.:-26.00                               
##  Max.   :222.00                               
##  NA's   :47938                                
##  workplaces_percent_change_from_baseline
##  Min.   :-90.00                         
##  1st Qu.:-41.00                         
##  Median :-28.00                         
##  Mean   :-27.31                         
##  3rd Qu.:-16.00                         
##  Max.   : 66.00                         
##  NA's   :6893                           
##  residential_percent_change_from_baseline
##  Min.   :-4.00                           
##  1st Qu.: 6.00                           
##  Median : 8.00                           
##  Mean   : 8.84                           
##  3rd Qu.:12.00                           
##  Max.   :32.00                           
##  NA's   :48247

table(mobility$sub_region_1)

## 
##       Drenthe     Flevoland     Friesland    Gelderland     Groningen 
##          3922          2127          5303         15372          4701 
##       Limburg North Brabant North Holland    Overijssel South Holland 
##          9134         18029         14189          7573         18115 
##       Utrecht       Zeeland 
##          7495          3960

# The dataset shows the mobility changes at a national, regional, and city level in the Netherlands, for multiple categories of places. The records are sorted by date, and are grouped by location (starting with country stats).

# Moreover, three of the columns (`metro_area`, `iso_3166_2_code`, `census_flips_code`) contain only empty data, and two other columns (`sub_region_1`, `sub_region_2`) are blank for a subset of the values. This also holds for the columns related to percentage changes in mobility scores which are missing in some cases (especially on a city-level).

Exercise 2: Filtering

Suppose you want to extract data for the Netherlands as a whole (excluding regional and city-level data). How would you filter the dataset to achieve this?

Once you've created this filtered dataset, generate a summary using summary() and check for missing values. What do you notice?

# hint: when sub_region_1 and sub_region_2 are NA, 
# the data pertains to the whole of The Netherlands.
filtered_df <- mobility %>% filter(is.na(sub_region_1) & is.na(sub_region_2))
summary(filtered_df)

##  country_region_code country_region     sub_region_1       sub_region_2      
##  Length:321          Length:321         Length:321         Length:321        
##  Class :character    Class :character   Class :character   Class :character  
##  Mode  :character    Mode  :character   Mode  :character   Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##  metro_area     iso_3166_2_code    census_fips_code   place_id        
##  Mode:logical   Length:321         Mode:logical     Length:321        
##  NA's:321       Class :character   NA's:321         Class :character  
##                 Mode  :character                    Mode  :character  
##                                                                       
##                                                                       
##                                                                       
##       date            retail_and_recreation_percent_change_from_baseline
##  Min.   :2020-02-15   Min.   :-83.0                                     
##  1st Qu.:2020-05-05   1st Qu.:-34.0                                     
##  Median :2020-07-24   Median :-19.0                                     
##  Mean   :2020-07-24   Mean   :-19.5                                     
##  3rd Qu.:2020-10-12   3rd Qu.: -3.0                                     
##  Max.   :2020-12-31   Max.   : 21.0                                     
##  grocery_and_pharmacy_percent_change_from_baseline
##  Min.   :-75.000                                  
##  1st Qu.: -7.000                                  
##  Median : -4.000                                  
##  Mean   : -4.153                                  
##  3rd Qu.:  0.000                                  
##  Max.   : 24.000                                  
##  parks_percent_change_from_baseline
##  Min.   :-59.0                     
##  1st Qu.: 11.0                     
##  Median : 43.0                     
##  Mean   : 66.2                     
##  3rd Qu.:107.0                     
##  Max.   :286.0                     
##  transit_stations_percent_change_from_baseline
##  Min.   :-75.00                               
##  1st Qu.:-50.00                               
##  Median :-44.00                               
##  Mean   :-41.33                               
##  3rd Qu.:-36.00                               
##  Max.   : 10.00                               
##  workplaces_percent_change_from_baseline
##  Min.   :-85.00                         
##  1st Qu.:-39.00                         
##  Median :-29.00                         
##  Mean   :-26.04                         
##  3rd Qu.: -7.00                         
##  Max.   : 24.00                         
##  residential_percent_change_from_baseline
##  Min.   :-1.000                          
##  1st Qu.: 5.000                          
##  Median : 8.000                          
##  Mean   : 8.548                          
##  3rd Qu.:11.000                          
##  Max.   :27.000

# The number of missing values on most variables has drastically decreased. That's good news! It seems like the data is most "complete" at the national level.

Exercise 3: Summary statistics

Let's now start zooming in on some of the metrics in the data to check what they really mean.

3a: The dataset reports percentage changes in visits compared to a baseline (median values from January 3 – February 6, 2020).

Are percentages expressed as 0 to 1 (e.g., 0.50 means +50%) or 0 to 100 (e.g., 50 means +50%)?
Use summary() to check the range of values. What do you find?

# Exploring the range
summary(mobility$retail_and_recreation_percent_change_from_baseline)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -97.00  -30.00  -15.00  -14.86   -1.00  199.00   47051

# --> the range seems to be measured on a scale on which 100 means 100%.

3b: Identify Potential Issues with the Baseline

The baseline period (January–February 2020) was chosen to compare mobility changes. However, this might not always be ideal.

Can you think of situations where this baseline could lead to misleading results?
For example, how might seasonal effects (winter vs. summer) or special events (holidays, lockdown measures) affect the interpretation of mobility changes? Come up with at least one concrete example!

Problems that could arise from the baseline choice:

The data does not account for seasonality (e.g., park visits increase as the weather improves)
In the Netherlands, we experienced a relatively mild winter period throughout 2020 (e.g., the average temperature in January 2020 was the 3rd highest ever). As a result, the time distribution of certain categories (e.g., parks, residence) may not be representative for a typical year.
Probably a lot of other reasons! Try to think about more!

Part 2: Data Cleaning & Transformation

When working with a new dataset, the first step is often cleaning and narrowing it down to focus on what matters. Raw data can contain: - Columns that are irrelevant or unnecessary for our analysis - Subsets of data that we don’t need or want to filter out

Selecting or Dropping Columns

To make the dataset easier to work with, we can select only the columns we need using dplyr's select() function.

mobility_selected_columns = mobility %>% 
  select(country_region_code,
         transit_stations_percent_change_from_baseline)

There's also a way to delete columns: rather than selecting (country_region_code), you can add a - to the column name: -country_region_code.

Exercise 4: Dropping columns

Please delete the columns country_region_code, metro_area, iso_3166_2_code and census_fips_code from the data. How many columns do you end up with?

# Good solution
# ------------- 

# Define columns to remove
cols_to_drop <- c("country_region_code", "metro_area", "iso_3166_2_code", "census_fips_code")

# Drop the selected columns
mobility <- mobility %>% select(-all_of(cols_to_drop))

# Alternatively, you can specify columns directly:
# mobility <- mobility %>% select(-country_region_code, -metro_area, -iso_3166_2_code, -census_fips_code)

# Check the first few rows to confirm changes
head(mobility)

## # A tibble: 6 × 11
##   country_region sub_region_1 sub_region_2 place_id                   date      
##   <chr>          <chr>        <chr>        <chr>                      <date>    
## 1 Netherlands    <NA>         <NA>         ChIJu-SH28MJxkcRnwq9_851o… 2020-02-15
## 2 Netherlands    <NA>         <NA>         ChIJu-SH28MJxkcRnwq9_851o… 2020-02-16
## 3 Netherlands    <NA>         <NA>         ChIJu-SH28MJxkcRnwq9_851o… 2020-02-17
## 4 Netherlands    <NA>         <NA>         ChIJu-SH28MJxkcRnwq9_851o… 2020-02-18
## 5 Netherlands    <NA>         <NA>         ChIJu-SH28MJxkcRnwq9_851o… 2020-02-19
## 6 Netherlands    <NA>         <NA>         ChIJu-SH28MJxkcRnwq9_851o… 2020-02-20
## # ℹ 6 more variables: retail_and_recreation_percent_change_from_baseline <dbl>,
## #   grocery_and_pharmacy_percent_change_from_baseline <dbl>,
## #   parks_percent_change_from_baseline <dbl>,
## #   transit_stations_percent_change_from_baseline <dbl>,
## #   workplaces_percent_change_from_baseline <dbl>,
## #   residential_percent_change_from_baseline <dbl>

# Count the remaining columns
ncol(mobility)

## [1] 11

# Note: never explicitly select columns by their indexes (e.g., index 2:4, index 8:15) - this code is likely to break!

Renaming Columns

Now that we have cleaned up the dataset, let's take a closer look at the remaining column names.

Some of them are quite long or not very clear, which can make analysis harder. Renaming columns can help make the data easier to read and work with.

We continue with our updated mobility data frame. Take a look—do you think any column names could be shortened or made clearer? 🚀

# show column names
colnames(mobility)

##  [1] "country_region"                                    
##  [2] "sub_region_1"                                      
##  [3] "sub_region_2"                                      
##  [4] "place_id"                                          
##  [5] "date"                                              
##  [6] "retail_and_recreation_percent_change_from_baseline"
##  [7] "grocery_and_pharmacy_percent_change_from_baseline" 
##  [8] "parks_percent_change_from_baseline"                
##  [9] "transit_stations_percent_change_from_baseline"     
## [10] "workplaces_percent_change_from_baseline"           
## [11] "residential_percent_change_from_baseline"

A good way to rename specific columns in a data frame is using dplyr's rename() function:

df <- df %>%
  rename(
    new_col_name_1 = old_col_name_1,
    new_col_name_2 = old_col_name_2,
    new_col_name_3 = old_col_name_3
  )

Exercise 5: Renaming columns

Please: - Rename sub_region_1 and sub_region_2 by province and city, respectively. - Change the very long column names (e.g., retail_and_recreation_percent_change_from_baseline) to retail_and_recreation.

# First use the rename function to rename sub_region_1 & sub_region_2. 
mobility_updated <- mobility %>% rename(province = sub_region_1,
                                city = sub_region_2,
                                retail_and_recreation = retail_and_recreation_percent_change_from_baseline,
                                grocery_and_pharmacy = grocery_and_pharmacy_percent_change_from_baseline,
                                parks = parks_percent_change_from_baseline,
                                transit_stations = transit_stations_percent_change_from_baseline,
                                workplaces = workplaces_percent_change_from_baseline,
                                residential = residential_percent_change_from_baseline)
                                

# Tip: See that you had to copy-paste a lot of variable names in constructing this solution?
# A better way to rename all those _percent_change_from_baseline columns is this: do you understand
# what exactly is happening here?

mobility_updated2 <- mobility %>% rename(province = sub_region_1,
                                city = sub_region_2) %>%
                         rename_with(~str_remove(., '_percent_change_from_baseline'))

# This solution works but is really not optimal, as the *format* of the input data may change (and hence, render the renamed columns wrong).

# Let's first take copy of our data so we don't run it on the "real one".

mobility_tmp = mobility

colnames(mobility_tmp) = c("country_region", 
                      "province",
                      "city", 
                      "place_id",
                      "date", 
                      "retail_and_recreation",
                      "grocery_and_pharmacy", 
                      "parks", 
                      "transit_stations", 
                      "workplaces", 
                      "residential")

Data Conversion

At first glance, dates in data sets might look like regular dates (e.g., 2020-02-15). However, R sometimes treats them as character strings rather than actual date objects. You can check this by running:

class(mobility_updated$date)

## [1] "Date"

In this particular case, R has properly recognized them as being in the date format. So all is good.

💡 Why does this matter? Date conversion can sometimes be tricky, especially with different formats across geographic regions (e.g., MM/DD/YYYY vs. DD/MM/YYYY). Here’s how to safely convert a character-encoded date into a proper date format in R: mobility_updated$date <- as.Date(mobility_updated$date).

Exercise 6: Calculating with dates

Now let's start calculating with our dates.

What's the first and last date in our data frame? Tip: you can use min() and max() now that the date column has been converted into date format!

The data runs from February 15, 2020 (min(mobility$date)) to December 31, 2020 (max(mobility$date)).

Observe above that you can not only use R code in code cells, but also in text by enclosing it in special characters:

`r min(mobility$date)`

Adding a New Column

Raw data often contains useful information, but sometimes we need to go beyond what’s given to make meaningful comparisons. By creating "derived metrics", we can summarize trends, spot patterns, and make analysis easier.
Example: Measuring Overall Movement Trends

Instead of looking at each mobility category separately, we can create a single metric to capture general movement patterns. Let’s define avg_mobility, which represents the average movement across different places (e.g., retail_and_recreation, grocery_and_pharmacy, etc.):

Exercise 7: Adding columns

Add a new column, called avg_mobility, to your dataset. Define it as the mean of all of the "place" columns.

Tip: use the rowMeans() function, and ensure NAs are not taken into consideration in calculating the averages.

E.g.,

mutate(rowMeans(cbind(col1, col2, col3)))

mobility_updated <- mobility_updated %>%
  mutate(avg_mobility = rowMeans(cbind(
    retail_and_recreation, 
    grocery_and_pharmacy, 
    parks, 
    transit_stations, 
    workplaces, 
    residential
  ), na.rm = TRUE))

# alternatively, you *could* use the function `rowMeans` - which calculates, PER ROW of the data, a particular mean. That would look like this:
columns <- c('retail_and_recreation', 'grocery_and_pharmacy', 'parks', 'transit_stations', 'workplaces', 'residential')
mobility_updated <- mobility_updated %>% mutate(avg_mobility2 = rowMeans(select(., all_of(columns)), na.rm =TRUE))

mobility_updated <- mobility_updated %>% mutate(avg_mobility_wrong = (retail_and_recreation + grocery_and_pharmacy + parks + transit_stations + workplaces + residential)/7)

# While this looks easy to implement (a mean is the sum, divided by the number of data points) - it's incorrect! Sometimes, columns are NA, and hence "drop" out of the data set. Plus: even if these values were ignored, then the average wouldn't be an average of 7 columns, but of fewer!

Creating New (Filtered) Data Sets

So far, we’ve seen that the dataset contains information at three levels: country, province, and city. To make analysis easier, we’ll separate these into three distinct datasets:

One for provinces
One for cities
One for the Netherlands as a whole

To do this, we need to filter the data and store each subset in a new data frame.

💡 How are the levels structured?
We can identify the different levels based on missing values in the province and city columns:

Country-level data has empty values in both province and city.
Province-level data has values in province but empty values in city.
City-level data has values in both province and city.

This is summarized in the table below, where X indicates the presence of data in a column:

Aggregation Level	country_region	province	city
Country	X
Province	X	X
City	X	X	X

Now, let’s filter the dataset accordingly!

Exercise 8: Filtering and Creating New Data Sets

Create three new datasets: country, province, and city from mobility, based on the descriptions above.

To filter missing values (NA), use is.na(column), which checks if a column is empty.

country <- mobility_updated %>% filter(is.na(city) & is.na(province))
province <- mobility_updated %>% filter(is.na(city) & !is.na(province))
city <- mobility_updated %>% filter(!is.na(city))

Exercise 9: Writing a Function

Sometimes, it's useful to write functions to repeat actions efficiently—especially when working with multiple datasets.

Here, we create a function inspect_data() that takes a data frame (df) and:
- Generates descriptive statistics (summary()), - Reports the number of rows (nrow()) and columns (ncol()), and - Displays the date range in the dataset

Start with the code snippet below and expand it to include all the required information in this exercise.

inspect_data <- function(df) {
  cat("Generating descriptive statistics...\n\n")
  cat("\n\n") # Add new line
  print(summary(df))  # Print Basic summary statistics
  
  # Add more here...
}

inspect_data <- function(df) {
  cat("Generating descriptive statistics...\n\n")
  cat("\n\n") # Add new line
  
  cat('Summary statistics\n')
  print(summary(df))
  cat('\n\n')
  
  cat('Number of columns: ')
  cat(ncol(df))
  cat('\n\n')
  
  cat('Number of observations: ')
  cat(nrow(df))
  cat('\n\n')
  
  cat('Range of the data:\n')
  summary(df$date)
  cat('\n\n')

}

inspect_data(country)

## Generating descriptive statistics...
## 
## 
## 
## Summary statistics
##  country_region       province             city             place_id        
##  Length:321         Length:321         Length:321         Length:321        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       date            retail_and_recreation grocery_and_pharmacy
##  Min.   :2020-02-15   Min.   :-83.0         Min.   :-75.000     
##  1st Qu.:2020-05-05   1st Qu.:-34.0         1st Qu.: -7.000     
##  Median :2020-07-24   Median :-19.0         Median : -4.000     
##  Mean   :2020-07-24   Mean   :-19.5         Mean   : -4.153     
##  3rd Qu.:2020-10-12   3rd Qu.: -3.0         3rd Qu.:  0.000     
##  Max.   :2020-12-31   Max.   : 21.0         Max.   : 24.000     
##      parks       transit_stations   workplaces      residential    
##  Min.   :-59.0   Min.   :-75.00   Min.   :-85.00   Min.   :-1.000  
##  1st Qu.: 11.0   1st Qu.:-50.00   1st Qu.:-39.00   1st Qu.: 5.000  
##  Median : 43.0   Median :-44.00   Median :-29.00   Median : 8.000  
##  Mean   : 66.2   Mean   :-41.33   Mean   :-26.04   Mean   : 8.548  
##  3rd Qu.:107.0   3rd Qu.:-36.00   3rd Qu.: -7.00   3rd Qu.:11.000  
##  Max.   :286.0   Max.   : 10.00   Max.   : 24.00   Max.   :27.000  
## 
## 
## Number of columns: 11
## 
## Number of observations: 321
## 
## Range of the data:

inspect_data(province)

## Generating descriptive statistics...
## 
## 
## 
## Summary statistics
##  country_region       province             city             place_id        
##  Length:3852        Length:3852        Length:3852        Length:3852       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       date            retail_and_recreation grocery_and_pharmacy
##  Min.   :2020-02-15   Min.   :-86.00        Min.   :-86.00      
##  1st Qu.:2020-05-05   1st Qu.:-32.00        1st Qu.: -8.00      
##  Median :2020-07-24   Median :-17.00        Median : -3.00      
##  Mean   :2020-07-24   Mean   :-15.74        Mean   : -2.51      
##  3rd Qu.:2020-10-12   3rd Qu.: -1.00        3rd Qu.:  2.00      
##  Max.   :2020-12-31   Max.   :142.00        Max.   : 75.00      
##                       NA's   :19            NA's   :3           
##      parks        transit_stations   workplaces      residential    
##  Min.   :-67.00   Min.   :-82.00   Min.   :-88.00   Min.   :-3.000  
##  1st Qu.: 12.00   1st Qu.:-51.00   1st Qu.:-40.00   1st Qu.: 5.000  
##  Median : 45.00   Median :-40.00   Median :-27.00   Median : 8.000  
##  Mean   : 77.95   Mean   :-36.34   Mean   :-25.04   Mean   : 8.135  
##  3rd Qu.:107.00   3rd Qu.:-26.00   3rd Qu.: -8.00   3rd Qu.:11.000  
##  Max.   :728.00   Max.   : 74.00   Max.   : 31.00   Max.   :28.000  
##  NA's   :287      NA's   :40       NA's   :9                        
## 
## 
## Number of columns: 11
## 
## Number of observations: 3852
## 
## Range of the data:

inspect_data(city)

## Generating descriptive statistics...
## 
## 
## 
## Summary statistics
##  country_region       province             city             place_id        
##  Length:106068      Length:106068      Length:106068      Length:106068     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       date            retail_and_recreation grocery_and_pharmacy
##  Min.   :2020-02-15   Min.   :-97.00        Min.   :-96.00      
##  1st Qu.:2020-05-04   1st Qu.:-30.00        1st Qu.: -8.00      
##  Median :2020-07-21   Median :-15.00        Median : -2.00      
##  Mean   :2020-07-24   Mean   :-14.77        Mean   : -1.24      
##  3rd Qu.:2020-10-17   3rd Qu.: -1.00        3rd Qu.:  5.00      
##  Max.   :2020-12-31   Max.   :199.00        Max.   :222.00      
##                       NA's   :47032         NA's   :39932       
##      parks        transit_stations   workplaces     residential   
##  Min.   :-88.00   Min.   :-93.00   Min.   :-90.0   Min.   :-4.00  
##  1st Qu.:-14.00   1st Qu.:-52.00   1st Qu.:-41.0   1st Qu.: 6.00  
##  Median : 11.00   Median :-40.00   Median :-28.0   Median : 8.00  
##  Mean   : 22.01   Mean   :-36.97   Mean   :-27.4   Mean   : 8.89  
##  3rd Qu.: 46.00   3rd Qu.:-26.00   3rd Qu.:-16.0   3rd Qu.:12.00  
##  Max.   :629.00   Max.   :222.00   Max.   : 66.0   Max.   :32.00  
##  NA's   :87524    NA's   :47898    NA's   :6884    NA's   :48247  
## 
## 
## Number of columns: 11
## 
## Number of observations: 106068
## 
## Range of the data:

Observe the high occurrence of missing levels at lower aggregation level (e.g., cities; especially parks). Further inspection of the documentation tells us that these data gaps are intentional because the data doesn't meet the quality and privacy threshold!

Why Use a Function?

Of course, you could generate all these statistics manually, like this:

summary(country)
nrow(country)
ncol(country)

summary(province)
nrow(province)
ncol(province)

# etc.

However, the goal here isn't just to compute summary statistics—it’s about writing functions to make your code more efficient and reusable. Functions help automate repetitive tasks, making your workflow cleaner and easier to maintain!

Missing Observations

Missing values can impact analysis, so let’s first identify how many are missing before deciding how to handle them. Here's code that shows the percentage missings per variable mentioned in columns, for each of the three data sets. No worries - the code below is quite complex. Try to read it - but we don't expect you to write code like this at this moment!

columns <- c('retail_and_recreation', 'grocery_and_pharmacy', 'parks', 'transit_stations', 'workplaces', 'residential')

missing_summary <- bind_rows(
  country %>% summarise(across(all_of(columns), ~ mean(is.na(.)) * 100)) %>% mutate(subset = "country"),
  province %>% summarise(across(all_of(columns), ~ mean(is.na(.)) * 100)) %>% mutate(subset = "province"),
  city %>% summarise(across(all_of(columns), ~ mean(is.na(.)) * 100)) %>% mutate(subset = "city")
) %>% relocate(subset)

missing_summary %>% kable()

subset	retail_and_recreation	grocery_and_pharmacy	parks	transit_stations	workplaces	residential
country	0.0000000	0.0000000	0.000000	0.000000	0.0000000	0.00000
province	0.4932503	0.0778816	7.450675	1.038422	0.2336449	0.00000
city	44.3413659	37.6475469	82.516876	45.157823	6.4901761	45.48686

After seeing the summary, you should realize that some datasets have a lot more missing values than others!

Exercise 10: Replacing Missing Values

Let's assume that we want to replace missing values in the province and city data sets with their averages across the whole data set.

Please implement this for all columns: retail_and_recreation, grocery_and_pharmacy, parks, transit_stations, workplaces, residential.

# Replace missing values in `province` dataset with column means
province <- province %>%
  mutate(
    retail_and_recreation = ifelse(is.na(retail_and_recreation), mean(retail_and_recreation, na.rm = TRUE), retail_and_recreation),
    grocery_and_pharmacy = ifelse(is.na(grocery_and_pharmacy), mean(grocery_and_pharmacy, na.rm = TRUE), grocery_and_pharmacy),
    parks = ifelse(is.na(parks), mean(parks, na.rm = TRUE), parks),
    transit_stations = ifelse(is.na(transit_stations), mean(transit_stations, na.rm = TRUE), transit_stations),
    workplaces = ifelse(is.na(workplaces), mean(workplaces, na.rm = TRUE), workplaces),
    residential = ifelse(is.na(residential), mean(residential, na.rm = TRUE), residential)
  )

# Replace missing values in `city` dataset with column means
city <- city %>%
  mutate(
    retail_and_recreation = ifelse(is.na(retail_and_recreation), mean(retail_and_recreation, na.rm = TRUE), retail_and_recreation),
    grocery_and_pharmacy = ifelse(is.na(grocery_and_pharmacy), mean(grocery_and_pharmacy, na.rm = TRUE), grocery_and_pharmacy),
    parks = ifelse(is.na(parks), mean(parks, na.rm = TRUE), parks),
    transit_stations = ifelse(is.na(transit_stations), mean(transit_stations, na.rm = TRUE), transit_stations),
    workplaces = ifelse(is.na(workplaces), mean(workplaces, na.rm = TRUE), workplaces),
    residential = ifelse(is.na(residential), mean(residential, na.rm = TRUE), residential)
  )

# You could also use the following code - which has a lot of benefits as you don't have to write down all column names over and over again.
# Can you read and understand this code?

# Define the columns to update
columns <- c("retail_and_recreation", "grocery_and_pharmacy", "parks", 
             "transit_stations", "workplaces", "residential")

# Replace missing values in `province` dataset with overall mean
province <- province %>%
  mutate(across(all_of(columns), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

# Replace missing values in `city` dataset with overall mean
city <- city %>%
  mutate(across(all_of(columns), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

Part 3: Data Exploration and Plotting

Visualizing Data with `ggplot2`

Now that our data is clean and structured, it’s time to visualize it! Instead of using the base R plot() function, we will use ggplot2, a powerful and flexible plotting package.

Why Use `ggplot2`?

ggplot2 is part of the tidyverse and follows a layered approach to building plots. It:

Allows easy customization
Works seamlessly with dplyr pipelines
Supports multiple data visualizations in a single plot

ggplot2 is widely popular, and creates much betters visualizations than the built-in R plots such as plot() or hist().

Basics of `ggplot2`

A ggplot2 plot is built in layers:

1️⃣ Define the dataset: ggplot(data = your_data)
2️⃣ Set aesthetics (aes): aes(x = ..., y = ...)
3️⃣ Choose a geometry (geom_): geom_line(), geom_point(), etc.
4️⃣ Customize (titles, colors, themes, labels, etc.)

Time-Series Plots

One set of data points

We’ll start by plotting percentage changes in park visits over time using geom_line().

library(ggplot2)

ggplot(data = country, aes(x = date, y = parks)) +
  geom_line(color = "black") +
  labs(title = "Changes in Visits to Parks Over Time",
       x = "Date",
       y = "% Change from Baseline") +
  theme_minimal()

Multiple data points

Now, let’s compare two trends in the same plot—time spent at home (residential) vs. at work (workplaces).

ggplot(data = country) +
  geom_line(aes(x = date, y = residential, color = "Residential")) +
  geom_line(aes(x = date, y = workplaces, color = "Workplace")) +
  scale_color_manual(values = c("Residential" = "red", "Workplace" = "blue")) +
  labs(title = "Less Time at Work, More Time at Home",
       x = "Date",
       y = "% Change from Baseline",
       color = "Legend") +
  theme_minimal()

Explanation of the Code

ggplot(data = country) → Uses country as the dataset
aes(x = date, y = parks) → Maps date to the x-axis and parks to the y-axis
geom_line(color = "green") → Draws a green line for parks data
geom_line(aes(y = residential, color = "Residential")) → Adds a second line for residential data
scale_color_manual() → Defines colors for different lines
labs() → Adds titles, labels, and legends
theme_minimal() → Uses a cleaner theme for better readability

Now, try customizing these plots by changing colors, adding more categories, or adjusting labels! 🚀

Exercise 11: Plotting

Please create a time series chart, in which you plot the time spent at home (residential) vs. at work (workplaces) for the province of "North Brabant", using the province data. Remember to adjust the title of the plot!

Now, let’s compare two trends in the same plot—time spent at home (residential) vs. at work (workplaces).

province %>% filter(province == 'North Brabant') %>% ggplot() +
  geom_line(aes(x = date, y = residential, color = "Residential")) +
  geom_line(aes(x = date, y = workplaces, color = "Workplace")) +
  scale_color_manual(values = c("Residential" = "red", "Workplace" = "blue")) +
  labs(title = "Time at Home vs. at Work in North Brabant",
       x = "Date",
       y = "% Change from Baseline",
       color = "Legend") +
  theme_minimal()

Bar charts

Bar charts are great for comparing categories in a dataset. Instead of showing how values change over time (like line charts), bar charts visualize differences between groups at a single point in time.

When to Use a Bar Chart?

To compare values across different categories
To show aggregated statistics (e.g., averages, sums, or counts)
To highlight differences in groups

Basics of ggplot2 Bar Charts

A bar chart in ggplot2 typically follows this structure:

1️⃣ Define the dataset → ggplot(data = your_data)
2️⃣ Set aesthetics (aes()) → aes(x = category, y = value)
3️⃣ Choose a bar geometry → geom_col() or geom_bar()
4️⃣ Customize (titles, colors, labels, etc.)

Example: Visits to Different Places

Let's create a bar chart showing the average mobility change across different locations using the country data set.

library(ggplot2)
library(dplyr)

# Pivot the country dataset to long format
country_long <- country %>%
  pivot_longer(cols = c(retail_and_recreation, grocery_and_pharmacy, parks, 
                        transit_stations, workplaces, residential),
               names_to = "place_category",
               values_to = "mobility_change")

# View the transformed data
head(country_long)

## # A tibble: 6 × 7
##   country_region province city  place_id               date       place_category
##   <chr>          <chr>    <chr> <chr>                  <date>     <chr>         
## 1 Netherlands    <NA>     <NA>  ChIJu-SH28MJxkcRnwq9_… 2020-02-15 retail_and_re…
## 2 Netherlands    <NA>     <NA>  ChIJu-SH28MJxkcRnwq9_… 2020-02-15 grocery_and_p…
## 3 Netherlands    <NA>     <NA>  ChIJu-SH28MJxkcRnwq9_… 2020-02-15 parks         
## 4 Netherlands    <NA>     <NA>  ChIJu-SH28MJxkcRnwq9_… 2020-02-15 transit_stati…
## 5 Netherlands    <NA>     <NA>  ChIJu-SH28MJxkcRnwq9_… 2020-02-15 workplaces    
## 6 Netherlands    <NA>     <NA>  ChIJu-SH28MJxkcRnwq9_… 2020-02-15 residential   
## # ℹ 1 more variable: mobility_change <dbl>

# Aggregate data
country_summary <- country_long %>% group_by(place_category) %>% summarize(avg_mobility_change = mean(mobility_change, na.rm=T))

# Create bar chart
country_summary %>% ggplot(aes(x = place_category, y = avg_mobility_change, fill = place_category)) +
  geom_col() +
  labs(title = "Average Mobility Change by Place",
       x = "Place Category", y = "% Change from Baseline") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate labels for readability

Exercise 12: Creating a Bar Chart

In this exercise, we will create a grouped bar chart to compare the average mobility change for different provinces. Instead of focusing on just one province, we will compare multiple provinces side by side.

To help you get started, we already create the "correct" data set you can use for plotting, called province_summary.

# Pivot the province dataset to long format
province_long <- province %>%
  pivot_longer(cols = c(retail_and_recreation, grocery_and_pharmacy, parks, 
                        transit_stations, workplaces, residential),
               names_to = "place_category",
               values_to = "mobility_change")
# Aggregate data
province_summary <- province_long %>% group_by(province, place_category) %>% summarize(avg_mobility_change = mean(mobility_change, na.rm=T))

## `summarise()` has grouped output by 'province'. You can override using the
## `.groups` argument.

Tips:

Use ggplot2 with geom_col() to compare categories across provinces.
Use facet_wrap(~province) to create separate charts for each province.
Adjust colors, titles, and labels for clarity.

province_summary %>% ggplot(aes(x = place_category, y = avg_mobility_change, fill = place_category)) +
  geom_col(position = "dodge") +
  facet_wrap(~province) +
  labs(title = "Average Mobility Change by Place Category Across Provinces",
       x = "Place Category",
       y = "% Change from Baseline",
       fill = "Place Category") +
  theme_minimal() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

On a final note: you can conveniently save your plots using the function ggsave(), directly after plotting.

ggsave('plot.pdf')

## Saving 7 x 5 in image

This snippet will create plot.pdf in your current working directory!

Conclusion

Congratulations! You've worked through a full data exploration and visualization pipeline in R. Here are the key takeaways from this tutorial:

Data Handling & Cleaning

Learned how to load and inspect data using read_csv(), summary(), and head().
Used filtering (filter()) and selecting (select()) to focus on relevant data.
Managed missing values, replacing them with averages when necessary.
Created new columns (mutate()) to derive meaningful metrics like avg_mobility.

Writing Efficient & Reusable Code

Use functions (like the one we have written - inspect_data()) to automate repetitive tasks.
Worked with tidy data principles, including pivot_longer() for reshaping datasets.

Data Visualization with ggplot2

Created time-series plots (geom_line()) to track changes over time.
Built bar charts (geom_col()) to compare mobility changes across locations.
Customized plots with titles, labels, colors, and themes for better readability.

Next Steps? 🚀

Experiment with other datasets—the skills you've learned apply everywhere!
Try to explore your data set using RMarkdown, and actually render it as a PDF or HTML file (click)
Keep practicing and tweaking your code—real-world data is rarely perfect, and exploring, cleaning, and visualizing data is an iterative process!

Well done, and happy coding! 🎉

Data exploration with RMarkdown and ggplot2 (dPrep)

Last updated: 4 February 2025

Introduction

Prerequisites

The data set

What You'll Learn in this Tutorial

Support Needed?

Part 1: Loading and Inspecting the Data

Let's Create a New R Project

Downloading and Loading the Data Set

Exercise 1: Data Inspection

Exercise 2: Filtering

Exercise 3: Summary statistics

Part 2: Data Cleaning & Transformation

Selecting or Dropping Columns

Exercise 4: Dropping columns

Renaming Columns

Exercise 5: Renaming columns

Data Conversion

Exercise 6: Calculating with dates

Adding a New Column

Exercise 7: Adding columns

Creating New (Filtered) Data Sets

Exercise 8: Filtering and Creating New Data Sets

Exercise 9: Writing a Function

Why Use a Function?

Missing Observations

Exercise 10: Replacing Missing Values

Part 3: Data Exploration and Plotting

Visualizing Data with `ggplot2`

Why Use `ggplot2`?

Basics of `ggplot2`

Time-Series Plots

Explanation of the Code

Exercise 11: Plotting

Bar charts

Exercise 12: Creating a Bar Chart

Conclusion

Data exploration with RMarkdown and ggplot2 (dPrep)

Last updated: 4 February 2025

Introduction

Prerequisites

The data set

What You'll Learn in this Tutorial

Support Needed?

Part 1: Loading and Inspecting the Data

Let's Create a New R Project

Downloading and Loading the Data Set

Exercise 1: Data Inspection

Exercise 2: Filtering

Exercise 3: Summary statistics

Part 2: Data Cleaning & Transformation

Selecting or Dropping Columns

Exercise 4: Dropping columns

Renaming Columns

Exercise 5: Renaming columns

Data Conversion

Exercise 6: Calculating with dates

Adding a New Column

Exercise 7: Adding columns

Creating New (Filtered) Data Sets

Exercise 8: Filtering and Creating New Data Sets

Exercise 9: Writing a Function

Why Use a Function?

Missing Observations

Exercise 10: Replacing Missing Values

Part 3: Data Exploration and Plotting

Visualizing Data with ggplot2

Why Use ggplot2?

Basics of ggplot2

Time-Series Plots

Explanation of the Code

Exercise 11: Plotting

Bar charts

Exercise 12: Creating a Bar Chart

Conclusion

Visualizing Data with `ggplot2`

Why Use `ggplot2`?

Basics of `ggplot2`