In this tutorial, you’ll explore a data set and learn how to understand the context in which it was collected. By the end, you'll be able to create an RMarkdown document and render it as a PDF or HTML file—perfect for sharing insights or starting discussions!
Ready to start? Here's what we recommend you do before going through this tutorial.
Optional (for starting R users):
Google’s COVID-19 Community Mobility Reports helped governments and researchers understand how the pandemic affected daily life and the economy. The data showed daily percentage changes in visits to different places, compared to the same day of the week before COVID-19 (January 3 – February 6, 2020).
The data includes six place categories:
We will work with the Dutch part of the data (tagged
with NL
).
As you go through the tutorial, you'll apply the skills you've learned in the prerequisite material - so this is a great chance to put them into practice! 🚀
if-else
statementsfor
loopssummary()
,
table()
)For technical issues outside of scheduled classes, please check the support section on the course website.
raw_data
folder:
dir.create("raw_data")
2020_NL_Region_Mobility_Report.csv
into your
raw_data
folder.Now, load the data into R using dplyr
's
read_csv
function:
## Rows: 110241 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): country_region_code, country_region, sub_region_1, sub_region_2, i...
## dbl (6): retail_and_recreation_percent_change_from_baseline, grocery_and_ph...
## lgl (2): metro_area, census_fips_code
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Troubleshooting
setwd()
— hardcoding directories can cause problems!mobility <- read_csv('https://raw.githubusercontent.com/hannesdatta/course-dprep/refs/heads/master/content/docs/modules/week2/tutorial/2020_NL_Region_Mobility_Report.csv')
Examine the mobility
data frame and describe its
structure in your own words. Consider the following:
Tips:
Take a look at the data using the function
head(mobility)
, or View(mobility)
. If you'd
like to view more rows with head
, use
head(mobility, 100)
(for, e.g., the first 100 rows).
(tail()
gives you the last rows of the data).
The command summary(mobility)
generates descriptive
statistics for all variables in the data. You can also use this command
on individual columns (e.g.,
summary(mobility$retail_and_recreation_percent_change_from_baseline)
).
Character or factor columns are best inspected using the
table()
command. These will create frequency
tables.
## # A tibble: 6 × 15
## country_region_code country_region sub_region_1 sub_region_2 metro_area
## <chr> <chr> <chr> <chr> <lgl>
## 1 NL Netherlands <NA> <NA> NA
## 2 NL Netherlands <NA> <NA> NA
## 3 NL Netherlands <NA> <NA> NA
## 4 NL Netherlands <NA> <NA> NA
## 5 NL Netherlands <NA> <NA> NA
## 6 NL Netherlands <NA> <NA> NA
## # ℹ 10 more variables: iso_3166_2_code <chr>, census_fips_code <lgl>,
## # place_id <chr>, date <date>,
## # retail_and_recreation_percent_change_from_baseline <dbl>,
## # grocery_and_pharmacy_percent_change_from_baseline <dbl>,
## # parks_percent_change_from_baseline <dbl>,
## # transit_stations_percent_change_from_baseline <dbl>,
## # workplaces_percent_change_from_baseline <dbl>, …
## country_region_code country_region sub_region_1 sub_region_2
## Length:110241 Length:110241 Length:110241 Length:110241
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## metro_area iso_3166_2_code census_fips_code place_id
## Mode:logical Length:110241 Mode:logical Length:110241
## NA's:110241 Class :character NA's:110241 Class :character
## Mode :character Mode :character
##
##
##
##
## date retail_and_recreation_percent_change_from_baseline
## Min. :2020-02-15 Min. :-97.00
## 1st Qu.:2020-05-04 1st Qu.:-30.00
## Median :2020-07-21 Median :-15.00
## Mean :2020-07-24 Mean :-14.86
## 3rd Qu.:2020-10-17 3rd Qu.: -1.00
## Max. :2020-12-31 Max. :199.00
## NA's :47051
## grocery_and_pharmacy_percent_change_from_baseline
## Min. :-96.00
## 1st Qu.: -8.00
## Median : -2.00
## Mean : -1.32
## 3rd Qu.: 5.00
## Max. :222.00
## NA's :39935
## parks_percent_change_from_baseline
## Min. :-88.00
## 1st Qu.:-10.00
## Median : 16.00
## Mean : 31.53
## 3rd Qu.: 56.00
## Max. :728.00
## NA's :87811
## transit_stations_percent_change_from_baseline
## Min. :-93.00
## 1st Qu.:-52.00
## Median :-40.00
## Mean :-36.95
## 3rd Qu.:-26.00
## Max. :222.00
## NA's :47938
## workplaces_percent_change_from_baseline
## Min. :-90.00
## 1st Qu.:-41.00
## Median :-28.00
## Mean :-27.31
## 3rd Qu.:-16.00
## Max. : 66.00
## NA's :6893
## residential_percent_change_from_baseline
## Min. :-4.00
## 1st Qu.: 6.00
## Median : 8.00
## Mean : 8.84
## 3rd Qu.:12.00
## Max. :32.00
## NA's :48247
##
## Drenthe Flevoland Friesland Gelderland Groningen
## 3922 2127 5303 15372 4701
## Limburg North Brabant North Holland Overijssel South Holland
## 9134 18029 14189 7573 18115
## Utrecht Zeeland
## 7495 3960
# The dataset shows the mobility changes at a national, regional, and city level in the Netherlands, for multiple categories of places. The records are sorted by date, and are grouped by location (starting with country stats).
# Moreover, three of the columns (`metro_area`, `iso_3166_2_code`, `census_flips_code`) contain only empty data, and two other columns (`sub_region_1`, `sub_region_2`) are blank for a subset of the values. This also holds for the columns related to percentage changes in mobility scores which are missing in some cases (especially on a city-level).
Suppose you want to extract data for the Netherlands as a whole (excluding regional and city-level data). How would you filter the dataset to achieve this?
Once you've created this filtered dataset, generate a
summary using summary()
and check for
missing values. What do you notice?
# hint: when sub_region_1 and sub_region_2 are NA,
# the data pertains to the whole of The Netherlands.
filtered_df <- mobility %>% filter(is.na(sub_region_1) & is.na(sub_region_2))
summary(filtered_df)
## country_region_code country_region sub_region_1 sub_region_2
## Length:321 Length:321 Length:321 Length:321
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## metro_area iso_3166_2_code census_fips_code place_id
## Mode:logical Length:321 Mode:logical Length:321
## NA's:321 Class :character NA's:321 Class :character
## Mode :character Mode :character
##
##
##
## date retail_and_recreation_percent_change_from_baseline
## Min. :2020-02-15 Min. :-83.0
## 1st Qu.:2020-05-05 1st Qu.:-34.0
## Median :2020-07-24 Median :-19.0
## Mean :2020-07-24 Mean :-19.5
## 3rd Qu.:2020-10-12 3rd Qu.: -3.0
## Max. :2020-12-31 Max. : 21.0
## grocery_and_pharmacy_percent_change_from_baseline
## Min. :-75.000
## 1st Qu.: -7.000
## Median : -4.000
## Mean : -4.153
## 3rd Qu.: 0.000
## Max. : 24.000
## parks_percent_change_from_baseline
## Min. :-59.0
## 1st Qu.: 11.0
## Median : 43.0
## Mean : 66.2
## 3rd Qu.:107.0
## Max. :286.0
## transit_stations_percent_change_from_baseline
## Min. :-75.00
## 1st Qu.:-50.00
## Median :-44.00
## Mean :-41.33
## 3rd Qu.:-36.00
## Max. : 10.00
## workplaces_percent_change_from_baseline
## Min. :-85.00
## 1st Qu.:-39.00
## Median :-29.00
## Mean :-26.04
## 3rd Qu.: -7.00
## Max. : 24.00
## residential_percent_change_from_baseline
## Min. :-1.000
## 1st Qu.: 5.000
## Median : 8.000
## Mean : 8.548
## 3rd Qu.:11.000
## Max. :27.000
Let's now start zooming in on some of the metrics in the data to check what they really mean.
3a: The dataset reports percentage changes in visits compared to a baseline (median values from January 3 – February 6, 2020).
0.50
means +50%) or 0 to 100 (e.g.,
50
means +50%)?summary()
to check the range of values. What do you
find?## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -97.00 -30.00 -15.00 -14.86 -1.00 199.00 47051
3b: Identify Potential Issues with the Baseline
The baseline period (January–February 2020) was chosen to compare mobility changes. However, this might not always be ideal.
Problems that could arise from the baseline choice:
When working with a new dataset, the first step is often cleaning and narrowing it down to focus on what matters. Raw data can contain: - Columns that are irrelevant or unnecessary for our analysis - Subsets of data that we don’t need or want to filter out
To make the dataset easier to work with, we can select only the
columns we need using dplyr
's select()
function.
mobility_selected_columns = mobility %>%
select(country_region_code,
transit_stations_percent_change_from_baseline)
There's also a way to delete columns: rather than selecting
(country_region_code
), you can add a -
to the
column name: -country_region_code
.
Please delete the columns country_region_code
,
metro_area
, iso_3166_2_code
and
census_fips_code
from the data. How many columns do you end
up with?
# Good solution
# -------------
# Define columns to remove
cols_to_drop <- c("country_region_code", "metro_area", "iso_3166_2_code", "census_fips_code")
# Drop the selected columns
mobility <- mobility %>% select(-all_of(cols_to_drop))
# Alternatively, you can specify columns directly:
# mobility <- mobility %>% select(-country_region_code, -metro_area, -iso_3166_2_code, -census_fips_code)
# Check the first few rows to confirm changes
head(mobility)
## # A tibble: 6 × 11
## country_region sub_region_1 sub_region_2 place_id date
## <chr> <chr> <chr> <chr> <date>
## 1 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_851o… 2020-02-15
## 2 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_851o… 2020-02-16
## 3 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_851o… 2020-02-17
## 4 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_851o… 2020-02-18
## 5 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_851o… 2020-02-19
## 6 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_851o… 2020-02-20
## # ℹ 6 more variables: retail_and_recreation_percent_change_from_baseline <dbl>,
## # grocery_and_pharmacy_percent_change_from_baseline <dbl>,
## # parks_percent_change_from_baseline <dbl>,
## # transit_stations_percent_change_from_baseline <dbl>,
## # workplaces_percent_change_from_baseline <dbl>,
## # residential_percent_change_from_baseline <dbl>
## [1] 11
Now that we have cleaned up the dataset, let's take a closer look at the remaining column names.
Some of them are quite long or not very clear, which can make analysis harder. Renaming columns can help make the data easier to read and work with.
We continue with our updated mobility
data frame. Take a
look—do you think any column names could be shortened or made clearer?
🚀
## [1] "country_region"
## [2] "sub_region_1"
## [3] "sub_region_2"
## [4] "place_id"
## [5] "date"
## [6] "retail_and_recreation_percent_change_from_baseline"
## [7] "grocery_and_pharmacy_percent_change_from_baseline"
## [8] "parks_percent_change_from_baseline"
## [9] "transit_stations_percent_change_from_baseline"
## [10] "workplaces_percent_change_from_baseline"
## [11] "residential_percent_change_from_baseline"
A good way to rename specific columns in a data frame is using
dplyr
's rename()
function:
df <- df %>%
rename(
new_col_name_1 = old_col_name_1,
new_col_name_2 = old_col_name_2,
new_col_name_3 = old_col_name_3
)
Please: - Rename sub_region_1
and
sub_region_2
by province
and
city
, respectively. - Change the very long column names
(e.g., retail_and_recreation_percent_change_from_baseline
)
to retail_and_recreation
.
# First use the rename function to rename sub_region_1 & sub_region_2.
mobility_updated <- mobility %>% rename(province = sub_region_1,
city = sub_region_2,
retail_and_recreation = retail_and_recreation_percent_change_from_baseline,
grocery_and_pharmacy = grocery_and_pharmacy_percent_change_from_baseline,
parks = parks_percent_change_from_baseline,
transit_stations = transit_stations_percent_change_from_baseline,
workplaces = workplaces_percent_change_from_baseline,
residential = residential_percent_change_from_baseline)
# Tip: See that you had to copy-paste a lot of variable names in constructing this solution?
# A better way to rename all those _percent_change_from_baseline columns is this: do you understand
# what exactly is happening here?
mobility_updated2 <- mobility %>% rename(province = sub_region_1,
city = sub_region_2) %>%
rename_with(~str_remove(., '_percent_change_from_baseline'))
# This solution works but is really not optimal, as the *format* of the input data may change (and hence, render the renamed columns wrong).
# Let's first take copy of our data so we don't run it on the "real one".
mobility_tmp = mobility
colnames(mobility_tmp) = c("country_region",
"province",
"city",
"place_id",
"date",
"retail_and_recreation",
"grocery_and_pharmacy",
"parks",
"transit_stations",
"workplaces",
"residential")
At first glance, dates in data sets might look like regular
dates (e.g., 2020-02-15
). However, R sometimes treats them
as character strings rather than actual date objects.
You can check this by running:
## [1] "Date"
In this particular case, R has properly recognized them as being in
the date
format. So all is good.
💡 Why does this matter? Date conversion can
sometimes be tricky, especially with different formats across geographic
regions (e.g., MM/DD/YYYY vs. DD/MM/YYYY). Here’s how to safely convert
a character-encoded date into a proper date format in R:
mobility_updated$date <- as.Date(mobility_updated$date)
.
Now let's start calculating with our dates.
What's the first and last date in our data frame? Tip: you can use
min()
and max()
now that the date column has
been converted into date format!
Raw data often contains useful information, but sometimes we need to
go beyond what’s given to make meaningful comparisons.
By creating "derived metrics", we can summarize trends,
spot patterns, and make analysis easier.
Example: Measuring Overall Movement Trends
Instead of looking at each mobility category separately, we can
create a single metric to capture general movement
patterns. Let’s define avg_mobility
, which represents the
average movement across different places (e.g.,
retail_and_recreation
, grocery_and_pharmacy
,
etc.):
Add a new column, called avg_mobility
, to your dataset.
Define it as the mean of all of the "place" columns.
Tip: use the rowMeans()
function, and ensure
NA
s are not taken into consideration in calculating the
averages.
E.g.,
mobility_updated <- mobility_updated %>%
mutate(avg_mobility = rowMeans(cbind(
retail_and_recreation,
grocery_and_pharmacy,
parks,
transit_stations,
workplaces,
residential
), na.rm = TRUE))
# alternatively, you *could* use the function `rowMeans` - which calculates, PER ROW of the data, a particular mean. That would look like this:
columns <- c('retail_and_recreation', 'grocery_and_pharmacy', 'parks', 'transit_stations', 'workplaces', 'residential')
mobility_updated <- mobility_updated %>% mutate(avg_mobility2 = rowMeans(select(., all_of(columns)), na.rm =TRUE))
mobility_updated <- mobility_updated %>% mutate(avg_mobility_wrong = (retail_and_recreation + grocery_and_pharmacy + parks + transit_stations + workplaces + residential)/7)
# While this looks easy to implement (a mean is the sum, divided by the number of data points) - it's incorrect! Sometimes, columns are NA, and hence "drop" out of the data set. Plus: even if these values were ignored, then the average wouldn't be an average of 7 columns, but of fewer!
So far, we’ve seen that the dataset contains information at three levels: country, province, and city. To make analysis easier, we’ll separate these into three distinct datasets:
To do this, we need to filter the data and store each subset in a new data frame.
💡 How are the levels structured?
We can identify the different levels based on missing values in the
province
and city
columns:
province
and city
.province
but empty values in city
.province
and city
.This is summarized in the table below, where X
indicates
the presence of data in a column:
Aggregation Level | country_region | province | city |
---|---|---|---|
Country | X | ||
Province | X | X | |
City | X | X | X |
Now, let’s filter the dataset accordingly!
Create three new datasets: country
,
province
, and city
from mobility
,
based on the descriptions above.
To filter missing values (NA
), use
is.na(column)
, which checks if a column is empty.
Sometimes, it's useful to write functions to repeat actions efficiently—especially when working with multiple datasets.
Here, we create a function
inspect_data()
that takes a data frame
(df
) and:
- Generates descriptive statistics
(summary()
), - Reports the number of rows
(nrow()
) and columns
(ncol()
), and - Displays the date range in
the dataset
Start with the code snippet below and expand it to include all the required information in this exercise.
inspect_data <- function(df) {
cat("Generating descriptive statistics...\n\n")
cat("\n\n") # Add new line
print(summary(df)) # Print Basic summary statistics
# Add more here...
}
inspect_data <- function(df) {
cat("Generating descriptive statistics...\n\n")
cat("\n\n") # Add new line
cat('Summary statistics\n')
print(summary(df))
cat('\n\n')
cat('Number of columns: ')
cat(ncol(df))
cat('\n\n')
cat('Number of observations: ')
cat(nrow(df))
cat('\n\n')
cat('Range of the data:\n')
summary(df$date)
cat('\n\n')
}
inspect_data(country)
## Generating descriptive statistics...
##
##
##
## Summary statistics
## country_region province city place_id
## Length:321 Length:321 Length:321 Length:321
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## date retail_and_recreation grocery_and_pharmacy
## Min. :2020-02-15 Min. :-83.0 Min. :-75.000
## 1st Qu.:2020-05-05 1st Qu.:-34.0 1st Qu.: -7.000
## Median :2020-07-24 Median :-19.0 Median : -4.000
## Mean :2020-07-24 Mean :-19.5 Mean : -4.153
## 3rd Qu.:2020-10-12 3rd Qu.: -3.0 3rd Qu.: 0.000
## Max. :2020-12-31 Max. : 21.0 Max. : 24.000
## parks transit_stations workplaces residential
## Min. :-59.0 Min. :-75.00 Min. :-85.00 Min. :-1.000
## 1st Qu.: 11.0 1st Qu.:-50.00 1st Qu.:-39.00 1st Qu.: 5.000
## Median : 43.0 Median :-44.00 Median :-29.00 Median : 8.000
## Mean : 66.2 Mean :-41.33 Mean :-26.04 Mean : 8.548
## 3rd Qu.:107.0 3rd Qu.:-36.00 3rd Qu.: -7.00 3rd Qu.:11.000
## Max. :286.0 Max. : 10.00 Max. : 24.00 Max. :27.000
##
##
## Number of columns: 11
##
## Number of observations: 321
##
## Range of the data:
## Generating descriptive statistics...
##
##
##
## Summary statistics
## country_region province city place_id
## Length:3852 Length:3852 Length:3852 Length:3852
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## date retail_and_recreation grocery_and_pharmacy
## Min. :2020-02-15 Min. :-86.00 Min. :-86.00
## 1st Qu.:2020-05-05 1st Qu.:-32.00 1st Qu.: -8.00
## Median :2020-07-24 Median :-17.00 Median : -3.00
## Mean :2020-07-24 Mean :-15.74 Mean : -2.51
## 3rd Qu.:2020-10-12 3rd Qu.: -1.00 3rd Qu.: 2.00
## Max. :2020-12-31 Max. :142.00 Max. : 75.00
## NA's :19 NA's :3
## parks transit_stations workplaces residential
## Min. :-67.00 Min. :-82.00 Min. :-88.00 Min. :-3.000
## 1st Qu.: 12.00 1st Qu.:-51.00 1st Qu.:-40.00 1st Qu.: 5.000
## Median : 45.00 Median :-40.00 Median :-27.00 Median : 8.000
## Mean : 77.95 Mean :-36.34 Mean :-25.04 Mean : 8.135
## 3rd Qu.:107.00 3rd Qu.:-26.00 3rd Qu.: -8.00 3rd Qu.:11.000
## Max. :728.00 Max. : 74.00 Max. : 31.00 Max. :28.000
## NA's :287 NA's :40 NA's :9
##
##
## Number of columns: 11
##
## Number of observations: 3852
##
## Range of the data:
## Generating descriptive statistics...
##
##
##
## Summary statistics
## country_region province city place_id
## Length:106068 Length:106068 Length:106068 Length:106068
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## date retail_and_recreation grocery_and_pharmacy
## Min. :2020-02-15 Min. :-97.00 Min. :-96.00
## 1st Qu.:2020-05-04 1st Qu.:-30.00 1st Qu.: -8.00
## Median :2020-07-21 Median :-15.00 Median : -2.00
## Mean :2020-07-24 Mean :-14.77 Mean : -1.24
## 3rd Qu.:2020-10-17 3rd Qu.: -1.00 3rd Qu.: 5.00
## Max. :2020-12-31 Max. :199.00 Max. :222.00
## NA's :47032 NA's :39932
## parks transit_stations workplaces residential
## Min. :-88.00 Min. :-93.00 Min. :-90.0 Min. :-4.00
## 1st Qu.:-14.00 1st Qu.:-52.00 1st Qu.:-41.0 1st Qu.: 6.00
## Median : 11.00 Median :-40.00 Median :-28.0 Median : 8.00
## Mean : 22.01 Mean :-36.97 Mean :-27.4 Mean : 8.89
## 3rd Qu.: 46.00 3rd Qu.:-26.00 3rd Qu.:-16.0 3rd Qu.:12.00
## Max. :629.00 Max. :222.00 Max. : 66.0 Max. :32.00
## NA's :87524 NA's :47898 NA's :6884 NA's :48247
##
##
## Number of columns: 11
##
## Number of observations: 106068
##
## Range of the data:
Observe the high occurrence of missing levels at lower aggregation level (e.g., cities; especially parks). Further inspection of the documentation tells us that these data gaps are intentional because the data doesn't meet the quality and privacy threshold!
Of course, you could generate all these statistics manually, like this:
However, the goal here isn't just to compute summary statistics—it’s about writing functions to make your code more efficient and reusable. Functions help automate repetitive tasks, making your workflow cleaner and easier to maintain!
Missing values can impact analysis, so let’s first identify
how many are missing before deciding how to handle them. Here's
code that shows the percentage missings per variable mentioned in
columns
, for each of the three data sets. No worries - the
code below is quite complex. Try to read it - but we don't expect you to
write code like this at this moment!
columns <- c('retail_and_recreation', 'grocery_and_pharmacy', 'parks', 'transit_stations', 'workplaces', 'residential')
missing_summary <- bind_rows(
country %>% summarise(across(all_of(columns), ~ mean(is.na(.)) * 100)) %>% mutate(subset = "country"),
province %>% summarise(across(all_of(columns), ~ mean(is.na(.)) * 100)) %>% mutate(subset = "province"),
city %>% summarise(across(all_of(columns), ~ mean(is.na(.)) * 100)) %>% mutate(subset = "city")
) %>% relocate(subset)
missing_summary %>% kable()
subset | retail_and_recreation | grocery_and_pharmacy | parks | transit_stations | workplaces | residential |
---|---|---|---|---|---|---|
country | 0.0000000 | 0.0000000 | 0.000000 | 0.000000 | 0.0000000 | 0.00000 |
province | 0.4932503 | 0.0778816 | 7.450675 | 1.038422 | 0.2336449 | 0.00000 |
city | 44.3413659 | 37.6475469 | 82.516876 | 45.157823 | 6.4901761 | 45.48686 |
After seeing the summary, you should realize that some datasets have a lot more missing values than others!
Let's assume that we want to replace missing values in the
province
and city
data sets with their
averages across the whole data set.
Please implement this for all columns:
retail_and_recreation
, grocery_and_pharmacy
,
parks
, transit_stations
,
workplaces
, residential
.
# Replace missing values in `province` dataset with column means
province <- province %>%
mutate(
retail_and_recreation = ifelse(is.na(retail_and_recreation), mean(retail_and_recreation, na.rm = TRUE), retail_and_recreation),
grocery_and_pharmacy = ifelse(is.na(grocery_and_pharmacy), mean(grocery_and_pharmacy, na.rm = TRUE), grocery_and_pharmacy),
parks = ifelse(is.na(parks), mean(parks, na.rm = TRUE), parks),
transit_stations = ifelse(is.na(transit_stations), mean(transit_stations, na.rm = TRUE), transit_stations),
workplaces = ifelse(is.na(workplaces), mean(workplaces, na.rm = TRUE), workplaces),
residential = ifelse(is.na(residential), mean(residential, na.rm = TRUE), residential)
)
# Replace missing values in `city` dataset with column means
city <- city %>%
mutate(
retail_and_recreation = ifelse(is.na(retail_and_recreation), mean(retail_and_recreation, na.rm = TRUE), retail_and_recreation),
grocery_and_pharmacy = ifelse(is.na(grocery_and_pharmacy), mean(grocery_and_pharmacy, na.rm = TRUE), grocery_and_pharmacy),
parks = ifelse(is.na(parks), mean(parks, na.rm = TRUE), parks),
transit_stations = ifelse(is.na(transit_stations), mean(transit_stations, na.rm = TRUE), transit_stations),
workplaces = ifelse(is.na(workplaces), mean(workplaces, na.rm = TRUE), workplaces),
residential = ifelse(is.na(residential), mean(residential, na.rm = TRUE), residential)
)
# You could also use the following code - which has a lot of benefits as you don't have to write down all column names over and over again.
# Can you read and understand this code?
# Define the columns to update
columns <- c("retail_and_recreation", "grocery_and_pharmacy", "parks",
"transit_stations", "workplaces", "residential")
# Replace missing values in `province` dataset with overall mean
province <- province %>%
mutate(across(all_of(columns), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))
# Replace missing values in `city` dataset with overall mean
city <- city %>%
mutate(across(all_of(columns), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))
ggplot2
Now that our data is clean and structured, it’s time to visualize it!
Instead of using the base R plot()
function, we will use
ggplot2
, a powerful and flexible plotting
package.
ggplot2
?ggplot2
is part of the tidyverse and
follows a layered approach to building plots. It:
dplyr
pipelinesggplot2
is widely popular, and creates much betters
visualizations than the built-in R plots such as plot()
or
hist()
.
ggplot2
A ggplot2
plot is built in layers:
ggplot(data = your_data)
aes(x = ..., y = ...)
geom_
):
geom_line()
, geom_point()
, etc.One set of data points
We’ll start by plotting percentage changes in park
visits over time using geom_line()
.
library(ggplot2)
ggplot(data = country, aes(x = date, y = parks)) +
geom_line(color = "black") +
labs(title = "Changes in Visits to Parks Over Time",
x = "Date",
y = "% Change from Baseline") +
theme_minimal()
Multiple data points
Now, let’s compare two trends in the same plot—time
spent at home (residential
) vs. at
work (workplaces
).
ggplot(data = country) +
geom_line(aes(x = date, y = residential, color = "Residential")) +
geom_line(aes(x = date, y = workplaces, color = "Workplace")) +
scale_color_manual(values = c("Residential" = "red", "Workplace" = "blue")) +
labs(title = "Less Time at Work, More Time at Home",
x = "Date",
y = "% Change from Baseline",
color = "Legend") +
theme_minimal()
ggplot(data = country)
→ Uses country
as
the datasetaes(x = date, y = parks)
→ Maps date to the x-axis and
parks to the y-axisgeom_line(color = "green")
→ Draws a green line for
parks datageom_line(aes(y = residential, color = "Residential"))
→ Adds a second line for residential datascale_color_manual()
→ Defines colors for different
lineslabs()
→ Adds titles, labels, and
legendstheme_minimal()
→ Uses a cleaner theme
for better readabilityNow, try customizing these plots by changing colors, adding more categories, or adjusting labels! 🚀
Please create a time series chart, in which you plot the time spent
at home (residential
) vs. at
work (workplaces
) for the province of
"North Brabant", using the province
data. Remember to
adjust the title of the plot!
Now, let’s compare two trends in the same plot—time
spent at home (residential
) vs. at
work (workplaces
).
province %>% filter(province == 'North Brabant') %>% ggplot() +
geom_line(aes(x = date, y = residential, color = "Residential")) +
geom_line(aes(x = date, y = workplaces, color = "Workplace")) +
scale_color_manual(values = c("Residential" = "red", "Workplace" = "blue")) +
labs(title = "Time at Home vs. at Work in North Brabant",
x = "Date",
y = "% Change from Baseline",
color = "Legend") +
theme_minimal()
Bar charts are great for comparing categories in a dataset. Instead of showing how values change over time (like line charts), bar charts visualize differences between groups at a single point in time.
When to Use a Bar Chart?
Basics of ggplot2
Bar Charts
A bar chart in ggplot2
typically follows this
structure:
1️⃣ Define the dataset →
ggplot(data = your_data)
2️⃣ Set aesthetics (aes()
) →
aes(x = category, y = value)
3️⃣ Choose a bar geometry → geom_col()
or
geom_bar()
4️⃣ Customize (titles, colors, labels, etc.)
Example: Visits to Different Places
Let's create a bar chart showing the average mobility
change across different locations using the
country
data set.
library(ggplot2)
library(dplyr)
# Pivot the country dataset to long format
country_long <- country %>%
pivot_longer(cols = c(retail_and_recreation, grocery_and_pharmacy, parks,
transit_stations, workplaces, residential),
names_to = "place_category",
values_to = "mobility_change")
# View the transformed data
head(country_long)
## # A tibble: 6 × 7
## country_region province city place_id date place_category
## <chr> <chr> <chr> <chr> <date> <chr>
## 1 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_… 2020-02-15 retail_and_re…
## 2 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_… 2020-02-15 grocery_and_p…
## 3 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_… 2020-02-15 parks
## 4 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_… 2020-02-15 transit_stati…
## 5 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_… 2020-02-15 workplaces
## 6 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_… 2020-02-15 residential
## # ℹ 1 more variable: mobility_change <dbl>
# Aggregate data
country_summary <- country_long %>% group_by(place_category) %>% summarize(avg_mobility_change = mean(mobility_change, na.rm=T))
# Create bar chart
country_summary %>% ggplot(aes(x = place_category, y = avg_mobility_change, fill = place_category)) +
geom_col() +
labs(title = "Average Mobility Change by Place",
x = "Place Category", y = "% Change from Baseline") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate labels for readability
In this exercise, we will create a grouped bar chart to compare the average mobility change for different provinces. Instead of focusing on just one province, we will compare multiple provinces side by side.
To help you get started, we already create the "correct" data set you
can use for plotting, called province_summary
.
# Pivot the province dataset to long format
province_long <- province %>%
pivot_longer(cols = c(retail_and_recreation, grocery_and_pharmacy, parks,
transit_stations, workplaces, residential),
names_to = "place_category",
values_to = "mobility_change")
# Aggregate data
province_summary <- province_long %>% group_by(province, place_category) %>% summarize(avg_mobility_change = mean(mobility_change, na.rm=T))
## `summarise()` has grouped output by 'province'. You can override using the
## `.groups` argument.
Tips:
ggplot2
with geom_col()
to compare
categories across provinces.facet_wrap(~province)
to create separate charts for
each province.province_summary %>% ggplot(aes(x = place_category, y = avg_mobility_change, fill = place_category)) +
geom_col(position = "dodge") +
facet_wrap(~province) +
labs(title = "Average Mobility Change by Place Category Across Provinces",
x = "Place Category",
y = "% Change from Baseline",
fill = "Place Category") +
theme_minimal() +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
On a final note: you can conveniently save your plots using the
function ggsave()
, directly after plotting.
## Saving 7 x 5 in image
This snippet will create plot.pdf
in your current
working directory!
Congratulations! You've worked through a full data exploration and visualization pipeline in R. Here are the key takeaways from this tutorial:
Data Handling & Cleaning
read_csv()
, summary()
, and
head()
.filter()
) and selecting
(select()
) to focus on relevant data.mutate()
) to
derive meaningful metrics like avg_mobility
.Writing Efficient & Reusable Code
inspect_data()
) to automate repetitive tasks.pivot_longer()
for reshaping datasets.Data Visualization with ggplot2
geom_line()
) to track changes over time.geom_col()
) to
compare mobility changes across locations.Next Steps? 🚀
Well done, and happy coding! 🎉