Hannes Datta
Introduction
Part 1: RMarkdown for exploring new data
Part 2: ggplot2
for plotting [updated this year]
After class
In today's session, we create our own RMarkdown document.
.Rmd
file.We will continue when you're done.
Navigating an RMarkdown document
echo
, include
parametersWe'll make use of the Google Mobility Datasets from The Netherlands.
1) In the tutorial on the site, you can either download the data from my server or get the data directly from Google (try downloading yourself). If downloading yourself, watch out for “conversion” issues.
2) In class (to smoothen things…), please just run the code snippet below:
library(tidyverse)
mobility <- read_csv('https://raw.githubusercontent.com/hannesdatta/course-dprep/refs/heads/master/content/docs/modules/week2/tutorial/2020_NL_Region_Mobility_Report.csv')
3) Exploring the data: use head()
, tail()
, and View()
/glimpse()
glimpse()
gives you an overview about data types and first valuesclass(mobility$date)
to inspect classes (e.g., dates, numerics, characters)summary()
command.
retail_and_recreation_percent_change_from_baseline
) - what does it mean?Please use the dplyr
commands select()
(column selection), and rename()
.
country_region_code
metro_area
iso_3166_2_code
census_fips_code
sub_region_1
to province
, and sub_region_2
to city
# dropping
mobility <- mobility %>% select(-country_region_code, metro_area, iso_3166_2_code, census_fips_code)
# renaming
mobility <- mobility %>%
rename(province = sub_region_1,
city = sub_region_2,
retail_and_recreation = retail_and_recreation_percent_change_from_baseline,
grocery_and_pharmacy = grocery_and_pharmacy_percent_change_from_baseline,
parks = parks_percent_change_from_baseline,
transit_stations = transit_stations_percent_change_from_baseline,
workplaces = workplaces_percent_change_from_baseline,
residential = residential_percent_change_from_baseline)
We want to understand how the data is coded; let's take a look at unique combinations of country_region, province, and city.
overview <- mobility %>% group_by(country_region, province, city) %>% count()
DO: Run the snippet and explore it using View()
.
province
only contains the observations for a province in NL (city must be missing, but province not)country
only contains observations for NL (city and province must be missing)dplyr
verbs filter()
, the piping character (%>%
), and the “NA” detector function is.na()
Task: Create the two data sets: province
and country
.
Please run this snippet to be able to continue the tutorial if you have not solved it yourselves.
country <- mobility %>%
filter(is.na(city) & is.na(province))
province <- mobility %>%
filter(is.na(city) & !is.na(province))
# Note, the explamation sign (!) negates a TRUE into a FALSE (and the other way around)
summary()
sum(is.na(columnname))
country
dataset using summary()
mutate
: create new or edit existing columnifelse(X, A, B)
does action A or B, depending on Xis.na(col)
checks for missing values in a column# toybox example, does not run (because this particular data doesn't exist)
new_data = old_data %>%
mutate(new_variable = ifelse(is.na(old_variable),
mean(old_variable, na.rm=T),
old_variable))
Please replace missing values by their mean for retail_and_recreation
and other “place” columns in the province
dataset.
province_updated <- province %>%
mutate(
retail_and_recreation = ifelse(
is.na(retail_and_recreation),
mean(retail_and_recreation, na.rm = TRUE),
retail_and_recreation
),
grocery_and_pharmacy = ifelse(
is.na(grocery_and_pharmacy),
mean(grocery_and_pharmacy, na.rm = TRUE),
grocery_and_pharmacy
),
parks = ifelse(
is.na(parks),
mean(parks, na.rm = TRUE),
parks
),
transit_stations = ifelse(
is.na(transit_stations),
mean(transit_stations, na.rm = TRUE),
transit_stations
),
workplaces = ifelse(
is.na(workplaces),
mean(workplaces, na.rm = TRUE),
workplaces
),
residential = ifelse(
is.na(residential),
mean(residential, na.rm = TRUE),
residential
)
)
Why?
ggplot2
(better than built-in functionality)dplyr
pipelinesSteps
ggplot(data = your_data)
or data %>% ggplot()
aes(x= ..., y = ...)
geom_
): geom_line()
, geom_point
, geom_col
(barchart), geom_hist
(histogram)library(ggplot2)
ggplot(data = country) +
geom_line(aes(x = date, y = retail_and_recreation),
color = "black") +
labs(title = "Changes in Visits to Retail Areas Over Time",
x = "Date",
y = "% Change from Baseline") +
theme_minimal()
Your tasks:
theme_light()
instead of theme_minimal()
retail_and_recreation
to parks
+ change title of plot.We can save plots to a file by immediately calling this command after creating a plot.
ggsave()
ggsave(filename = 'plot.pdf')
or ggsave(filename = 'plot.png')
- pick any formatwidth
, height
, dpi
, bg
(background), etc.dir.create()
if neededunlink('directory_name/*.*')
(be careful)library(ggplot2)
ggplot(data = country) +
geom_line(aes(x = date,
y = retail_and_recreation,
color = 'Retail')) +
geom_line(aes(x = date,
y = parks,
color = 'Parks')) +
scale_color_manual(values = c("Retail" = "red",
"Parks" = "blue")) +
labs(title = "Time spent at retail vs. parks",
x = "Date",
y = "% Change from Baseline",
color = "Locations") +
theme_minimal()
DO: Please add a third data point: time spent at work (color: black).
library(ggplot2)
ggplot(data = country) +
geom_line(aes(x = date,
y = retail_and_recreation,
color = 'Retail')) +
geom_line(aes(x = date,
y = parks,
color = 'Parks')) +
geom_line(aes(x = date,
y = workplaces,
color = 'Work')) +
scale_color_manual(values = c("Retail" = "red",
"Parks" = "blue",
"Work" = "black")) +
labs(title = "Time spent at different locations",
x = "Date",
y = "% Change from Baseline",
color = "Locations") +
theme_minimal()
Sometimes, it makes sense “automatically” grouping our plots, such as by province!
province_updated %>% ggplot() +
geom_line(aes(x = date,
y = retail_and_recreation,
color = 'Retail')) +
geom_line(aes(x = date,
y = parks,
color = 'Parks')) +
facet_wrap(~province) + # THIS LINE IS NEW
scale_color_manual(values = c("Retail" = "red",
"Parks" = "blue")) +
labs(title = "Time spent at retail vs. parks",
x = "Date",
y = "% Change from Baseline",
color = "Locations") +
theme_minimal()
To work on this, let's first run this cell.
province_long <- province_updated %>%
select(
province, date,
retail_and_recreation, grocery_and_pharmacy,
parks, transit_stations, workplaces, residential
) %>%
pivot_longer(
cols = c(
retail_and_recreation, grocery_and_pharmacy,
parks, transit_stations, workplaces, residential
),
names_to = "place_category",
values_to = "mobility_change"
)
This converts your “wide” data into a “long” data set (you will learn more about this in week 4).
province_long %>% filter(province == 'North Brabant') %>%
ggplot(aes(x=date,
y=mobility_change,
color=place_category)) +
geom_line() +
ggtitle('Mobility in North Brabant') +
theme_minimal()
The plot now shows all locations without explicitly naming them in the code as before.
We can also add a facet wrap, as in extension 2.
province_long %>% # no filter this time - show all provinces!
ggplot(aes(x=date,
y=mobility_change,
color=place_category)) +
geom_line() +
facet_wrap(~province)+
theme_minimal()
The plot now shows all locations per province.
Think of this as doing it at every end of your work cycle.
Clean up your script(s)!
Rscript your_r_script.R
or R --vanilla < your_r_script.R
Remember the workflow from the start of today's session?
We've covered step 1! Concepts and programming knowledge extends to subsequent parts.