In this tutorial, we'll walk you through the steps of exploring a data set and obtaining an understanding about the setting in which data was collected. After completion of the tutorial, you'll be able to produce an RMarkdown document, which you can render as a PDF or HTML document (e.g., to show others, initiate a discussion about data, etc.).
We've chosen a data set used by Sim et al. 2022, in which the authors estimate the effect that COVID-19 had on music consumption on Spotify!
Along the way, you'll practice the skills you have been equipped with in the Datacamp tutorials.
summary()
and
table()
)plot()
)For technical issues outside of scheduled classes, please check the support section on the course website.
Google’s COVID-19 Community Mobility Reports are used extensively by governments and researchers to get an understanding about the pandemic's effect on social life and the economy. In the data, Google provides information on daily percentage changes in the number of visits by place category, compared with the corresponding day of the week during the 5-week (pre-Covid) period between January 3 and February 6, 2020. The data consists of many observations that cover six categories of places: retail & recreation, grocery & pharmacy, parks, transit stations, workplaces, and residence.
With the help of dplyr
's read_csv()
function, we can now import the mobility data for the Netherlands into
R. Download the Region
CSVs archive from Google (you have to unzip it first!),
create a new R script in the same directory, and read in
2020_NL_Region_Mobility_Report.csv
.
Tip: If R doesn't find your data, set your working directory properly
(we typically click on SESSION --> SET WORKING DIRECTORY --> TO
SOURCE FILE LOCATION). Hardcoding directories in your code (e.g., with
setwd()
) is not good practice.
mobility <- read_csv("2020_NL_Region_Mobility_Report.csv")
## Rows: 110241 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): country_region_code, country_region, sub_region_1, sub_region_2, i...
## dbl (6): retail_and_recreation_percent_change_from_baseline, grocery_and_ph...
## lgl (2): metro_area, census_fips_code
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Alternatively, you can use the data set stored on the course website.
# mobility <- read_csv('https://media.githubusercontent.com/media/hannesdatta/course-dprep/master/content/docs/modules/week3/tutorial/2020_NL_Region_Mobility_Report.csv', sep = ',')
Exercise 1
Inspect the mobility
data frame, and describe the
structure of the data set in your own words. - Which column
names does the data have? - How are variables/metrics operationalized? -
What is the unit of analysis at which the data is provided ("what
uniquely identifies a row in this dataset?"). - How is the data sorted
(if at all)? - Are the values contained in each column complete? Do
these values all make sense? - Are “strings” properly encoded (i.e., do
they contain any weird characters, or do they seem appropriately
rendered so you can read them)?
Tips:
Take a look at the data using the function
head(mobility)
, or View(mobility)
. If you'd
like to view more rows with head
, use
head(mobility, 100)
(for, e.g., the first 100 rows).
(tail()
gives you the last rows of the data).
The command summary(mobility)
generates descriptive
statistics for all variables in the data. You can also use this command
on individual columns (e.g.,
summary(mobility$retail_and_recreation_percent_change_from_baseline)
).
Character or factor columns are best inspected using the
table()
command.
head(mobility)
## # A tibble: 6 × 15
## country_region_code country_region sub_region_1 sub_region_2 metro_area
## <chr> <chr> <chr> <chr> <lgl>
## 1 NL Netherlands <NA> <NA> NA
## 2 NL Netherlands <NA> <NA> NA
## 3 NL Netherlands <NA> <NA> NA
## 4 NL Netherlands <NA> <NA> NA
## 5 NL Netherlands <NA> <NA> NA
## 6 NL Netherlands <NA> <NA> NA
## # ℹ 10 more variables: iso_3166_2_code <chr>, census_fips_code <lgl>,
## # place_id <chr>, date <date>,
## # retail_and_recreation_percent_change_from_baseline <dbl>,
## # grocery_and_pharmacy_percent_change_from_baseline <dbl>,
## # parks_percent_change_from_baseline <dbl>,
## # transit_stations_percent_change_from_baseline <dbl>,
## # workplaces_percent_change_from_baseline <dbl>, …
summary(mobility)
## country_region_code country_region sub_region_1 sub_region_2
## Length:110241 Length:110241 Length:110241 Length:110241
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## metro_area iso_3166_2_code census_fips_code place_id
## Mode:logical Length:110241 Mode:logical Length:110241
## NA's:110241 Class :character NA's:110241 Class :character
## Mode :character Mode :character
##
##
##
##
## date retail_and_recreation_percent_change_from_baseline
## Min. :2020-02-15 Min. :-97.00
## 1st Qu.:2020-05-04 1st Qu.:-30.00
## Median :2020-07-21 Median :-15.00
## Mean :2020-07-24 Mean :-14.86
## 3rd Qu.:2020-10-17 3rd Qu.: -1.00
## Max. :2020-12-31 Max. :199.00
## NA's :47051
## grocery_and_pharmacy_percent_change_from_baseline
## Min. :-96.00
## 1st Qu.: -8.00
## Median : -2.00
## Mean : -1.32
## 3rd Qu.: 5.00
## Max. :222.00
## NA's :39935
## parks_percent_change_from_baseline
## Min. :-88.00
## 1st Qu.:-10.00
## Median : 16.00
## Mean : 31.53
## 3rd Qu.: 56.00
## Max. :728.00
## NA's :87811
## transit_stations_percent_change_from_baseline
## Min. :-93.00
## 1st Qu.:-52.00
## Median :-40.00
## Mean :-36.95
## 3rd Qu.:-26.00
## Max. :222.00
## NA's :47938
## workplaces_percent_change_from_baseline
## Min. :-90.00
## 1st Qu.:-41.00
## Median :-28.00
## Mean :-27.31
## 3rd Qu.:-16.00
## Max. : 66.00
## NA's :6893
## residential_percent_change_from_baseline
## Min. :-4.00
## 1st Qu.: 6.00
## Median : 8.00
## Mean : 8.84
## 3rd Qu.:12.00
## Max. :32.00
## NA's :48247
table(mobility$sub_region_1)
##
## Drenthe Flevoland Friesland Gelderland Groningen
## 3922 2127 5303 15372 4701
## Limburg North Brabant North Holland Overijssel South Holland
## 9134 18029 14189 7573 18115
## Utrecht Zeeland
## 7495 3960
# The dataset shows the mobility changes at a national, regional, and city level in the Netherlands, for multiple categories of places. The records are sorted by date, and are grouped by location (starting with country stats).
# Moreover, three of the columns (`metro_area`, `iso_3166_2_code`, `census_flips_code`) contain only empty data, and two other columns (`sub_region_1`, `sub_region_2`) are blank for a subset of the values. This also holds for the columns related to percentage changes in mobility scores which are missing in some cases (especially on a city-level).
Exercise 2
Suppose you'd like to extract data for the Netherlands as a whole (i.e.,
NOT at the regional or city level), how would you have to filter for it?
Re-generate a data summary (summary()
) on this new table,
and inspect missing values.
filtered_df <- mobility %>% filter(is.na(sub_region_1) & is.na(sub_region_2))
summary(filtered_df)
## country_region_code country_region sub_region_1 sub_region_2
## Length:321 Length:321 Length:321 Length:321
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## metro_area iso_3166_2_code census_fips_code place_id
## Mode:logical Length:321 Mode:logical Length:321
## NA's:321 Class :character NA's:321 Class :character
## Mode :character Mode :character
##
##
##
## date retail_and_recreation_percent_change_from_baseline
## Min. :2020-02-15 Min. :-83.0
## 1st Qu.:2020-05-05 1st Qu.:-34.0
## Median :2020-07-24 Median :-19.0
## Mean :2020-07-24 Mean :-19.5
## 3rd Qu.:2020-10-12 3rd Qu.: -3.0
## Max. :2020-12-31 Max. : 21.0
## grocery_and_pharmacy_percent_change_from_baseline
## Min. :-75.000
## 1st Qu.: -7.000
## Median : -4.000
## Mean : -4.153
## 3rd Qu.: 0.000
## Max. : 24.000
## parks_percent_change_from_baseline
## Min. :-59.0
## 1st Qu.: 11.0
## Median : 43.0
## Mean : 66.2
## 3rd Qu.:107.0
## Max. :286.0
## transit_stations_percent_change_from_baseline
## Min. :-75.00
## 1st Qu.:-50.00
## Median :-44.00
## Mean :-41.33
## 3rd Qu.:-36.00
## Max. : 10.00
## workplaces_percent_change_from_baseline
## Min. :-85.00
## 1st Qu.:-39.00
## Median :-29.00
## Mean :-26.04
## 3rd Qu.: -7.00
## Max. : 24.00
## residential_percent_change_from_baseline
## Min. :-1.000
## 1st Qu.: 5.000
## Median : 8.000
## Mean : 8.548
## 3rd Qu.:11.000
## Max. :27.000
The number of missing values on most variables has drastically decreased. That's good news! It seems like the data is most "complete" at the national level.
Exercise 3
Let's now start zooming in on some of the metrics in the data to check
what they really mean.
The data generally shows the percentage changes in visits to categorized places, compared to a normal value for that day of the week (i.e., baseline). The baseline day is the median value between January 3 and February 6, 2020. In other words, all reported numerical values are relative to the first few weeks of 2020.
First, try to understand the "range" of the values. For example, percentages are commonly scaled between 0 and 1 (e.g., such that .50 means +50%), but sometimes also between 0 and 100 (e.g., such that 50 means 50%). What applies here?
Second, think of one or more problems that can arise as a result of how percentage changes were calculated (i.e., the choice for the baseline between January 3 and February 6, 2020). Come up with specific examples that illustrate problems!
### Part 1 ###
# Exploring the range
summary(mobility$retail_and_recreation_percent_change_from_baseline)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -97.00 -30.00 -15.00 -14.86 -1.00 199.00 47051
# --> the range seems to be measured on a scale on which 100 means 100%.
### Part 2 ###
# Problems that could arise from the baseline choice:
# - The data does not account for seasonality (e.g., park visits increase as the weather improves)
# - In the Netherlands, we experienced a relatively mild winter period throughout 2020 (e.g., the average temperature in January 2020 was the 3rd highest ever). As a result, the time distribution of certain categories (e.g., parks, residence) may not be representative for a typical year.
# - Probably a lot of other reasons! Try to think about more!
Next, let's narrow down the data to a smaller set of data that we would like to concentrate on. For example, the raw data may contain columns we don't need or want, or specific subsets that we want to get rid of.
To make working with the data frame more manageable, we drop columns
we don't need. More specifically, we overwrite the dataframe with a
selection of necessary columns (e.g.,
df = df[,c("col_1", "col3", "col"4", "col5", ...)])
).
Exercise 4
Drop the columns country_region_code
,
metro_area
, iso_3166_2_code
and
census_fips_code
from the mobility
dataframe.
How many columns do you end up with?
# bad solution - think about WHY?
mobility[, c(2:4, 8:15)]
# good solution
cols_to_drop <- c('country_region_code', 'metro_area', 'iso_3166_2_code', 'census_fips_code')
mobility <- mobility %>% select(-cols_to_drop)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(cols_to_drop)
##
## # Now:
## data %>% select(all_of(cols_to_drop))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
head(mobility)
## # A tibble: 6 × 11
## country_region sub_region_1 sub_region_2 place_id date
## <chr> <chr> <chr> <chr> <date>
## 1 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_851o… 2020-02-15
## 2 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_851o… 2020-02-16
## 3 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_851o… 2020-02-17
## 4 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_851o… 2020-02-18
## 5 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_851o… 2020-02-19
## 6 Netherlands <NA> <NA> ChIJu-SH28MJxkcRnwq9_851o… 2020-02-20
## # ℹ 6 more variables: retail_and_recreation_percent_change_from_baseline <dbl>,
## # grocery_and_pharmacy_percent_change_from_baseline <dbl>,
## # parks_percent_change_from_baseline <dbl>,
## # transit_stations_percent_change_from_baseline <dbl>,
## # workplaces_percent_change_from_baseline <dbl>,
## # residential_percent_change_from_baseline <dbl>
# how many columns do we retain?
ncol(mobility)
## [1] 11
We continue with the (updated) data frame mobility
.
As you can see below, some of the remaining column names of
mobility
are rather long (or can be described more
precisely).
# current column names
colnames(mobility)
## [1] "country_region"
## [2] "sub_region_1"
## [3] "sub_region_2"
## [4] "place_id"
## [5] "date"
## [6] "retail_and_recreation_percent_change_from_baseline"
## [7] "grocery_and_pharmacy_percent_change_from_baseline"
## [8] "parks_percent_change_from_baseline"
## [9] "transit_stations_percent_change_from_baseline"
## [10] "workplaces_percent_change_from_baseline"
## [11] "residential_percent_change_from_baseline"
You can rename a column of a data frame (df
) by
overwriting the current column names with a vector of new column names,
like this:
colnames(df) = c("new_col_name_1", "new_col_name_2", "new_col_name_3", ...)
Exercise 5
Trim the long column names (e.g.,
retail_and_recreation_percent_change_from_baseline
becomes
retail_and_recreation
) and replace
sub_region_1
and sub_region_2
by
province
and city
, respectively.
# This solution works but is suboptimal, as the *format* of the input data may change (and hence, render the renamed columns wrong).
colnames(mobility) = c("country_region",
"province",
"city",
"place_id",
"date",
"retail_and_recreation",
"grocery_and_pharmacy",
"parks",
"transit_stations",
"workplaces",
"residential")
# First use the rename function to rename sub_region_1 & sub_region_2. After that, use the rename_with function to remove all the suffixes with "_percent_change_from_baseline'
mobility_updated <- mobility %>% rename(province = sub_region_1,
city = sub_region_2) %>%
rename_with(~str_remove(., '_percent_change_from_baseline'))
Although the dates may look like "regular" dates (e.g.,
2020-02-15
), R sometimes treats them like a character
string (try running class(mobility$date)
for
yourself!).
When working with dplyr
/tidyverse
, though,
dates are mostly imported correctly.
If they are still characters rather than dates, you need to convert
the date
column into date format. Date conversion can cause
you a mild headache, and it even gets more complex with varying date
formats across different geographic regions.
This is how to convert a character-encoded date to a "real date":
mobility$date <- as.Date(mobility$date)
Exercise 6
Now let's start calculating with our dates.
What's the first and last date in our data frame? Tip: you can use
min()
and max()
now that the date column has
been converted into date format!
The data runs from February 15, 2020
(min(mobility$date)
) to December 31, 2020
(max(mobility$date)
).
Observe above that you can not only use R code in code cells, but also in text by enclosing it in special characters: 2020-02-15.
Let's start generating some "derived metrics" - i.e., metrics that are based on the raw data, but that were not part of the actual raw data. Think of this as a first attempt to operationalize some variables that you're interested in.
For example, to get a proxy for overall movement trends, we could add
avg_mobility
to the data, which we define as the (row) mean
of retail_and recreation
,
grocery_and_pharmacy
, parks
,
transit_stations
, workplaces
, and
residential
.
Exercise 7
Add the avg_mobility
column to your dataset (keep in mind
that the data frame may contain one or more missing values). You may
want to look up the documentation of the na.rm
parameter.
# solution (most elegant and sustainable)
columns <- c('retail_and_recreation', 'grocery_and_pharmacy', 'parks', 'transit_stations', 'workplaces', 'residential')
mobility_updated <- mobility_updated %>% mutate(avg_mobility2 = rowMeans(select(., all_of(columns)), na.rm =TRUE))
# prone to errors of referencing to the correct columns
avg_mobility_array <- mobility[c(6,7,8,9,10,11)]
# this is inefficient, too, because it does not make use of the dplyr functions (and hence, is considerably slower on large datasets)
mobility$avg_mobility <- rowMeans(avg_mobility_array, na.rm = TRUE)
# The answer above, same as for exercise 4 makes use of counting which is prone to error if we add or remove a column!
# solution (also not sustainable as column names could change!!!)
mobility$avg_mobility <- rowMeans(mobility[,6:ncol(mobility)], na.rm = TRUE)
# solution (least elegant - and far less efficient!!!) - do you notice the difference in running time?
for(counter in 1:nrow(mobility)){
row = mobility[counter, ]
mobility[counter, 'avg_mobility'] = mean(c(row$retail_and_recreation,
row$grocery_and_pharmacy,
row$parks,
row$transit_stations,
row$workplaces,
row$residential),
na.rm=TRUE)
}
At this point, we've realized the data contains information at the country, province and city level. We'd ideally like to split these data in three separate data sets.
For this, we need to select a subset of the data, and store it in new data frames.
We know that the values in the province
and
city
columns are empty for the country level stats. In the
same way, the city level records are blank for the data aggregated on
the province level. This is shown in the table below where each
X
denotes the availability of data in a respective
column.
Aggregation Level | country_region | province | city |
---|---|---|---|
Country | X | ||
Province | X | X | |
City | X | X | X |
Exercise 8
Create three subsets (country
, province
,
city
) of mobility
according to the
descriptions above. To filter for NA strings (NA
) you can
use the boolean expression is.na(df$column)
. This
essentially checks whether the column contains missing values.
country <- mobility_updated %>% filter(is.na(city) & is.na(province))
province <- mobility_updated %>% filter(is.na(city) & !is.na(province))
city <- mobility_updated %>% filter(!is.na(city))
Exercise 9
Write a function inspect_data()
that takes in a dataframe
(df
), and produces descriptive stats
(summary
), reports on the number of observations
(nrow()
) and columns (ncol()
), and the range
of dates in the data.
Start from the code snippet below:
inspect_data <- function(df) {
cat('Generating some descriptive statistics...\n\n')
cat('Data: ') # using cat, you can write text to the console/screen; "\n" is a new line.
cat(deparse(substitute(df))) # this one prints the NAME of the input data set
cat('\n\n')
print(summary(df))
# more here...
}
# inspect_data
inspect_data <- function(df){
cat('Generating some descriptive statistics...\n\n')
cat('Data: ')
cat(deparse(substitute(df)))
cat('\n\n')
cat('Summary statistics\n')
print(summary(df))
cat('\n\n')
cat('Number of columns: ')
cat(ncol(df))
cat('\n\n')
cat('Number of observations: ')
cat(nrow(df))
cat('\n\n')
cat('Range of the data:\n')
summary(df$date)
cat('\n\n')
}
inspect_data(country)
## Generating some descriptive statistics...
##
## Data: country
##
## Summary statistics
## country_region province city place_id
## Length:321 Length:321 Length:321 Length:321
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## date retail_and_recreation grocery_and_pharmacy
## Min. :2020-02-15 Min. :-83.0 Min. :-75.000
## 1st Qu.:2020-05-05 1st Qu.:-34.0 1st Qu.: -7.000
## Median :2020-07-24 Median :-19.0 Median : -4.000
## Mean :2020-07-24 Mean :-19.5 Mean : -4.153
## 3rd Qu.:2020-10-12 3rd Qu.: -3.0 3rd Qu.: 0.000
## Max. :2020-12-31 Max. : 21.0 Max. : 24.000
## parks transit_stations workplaces residential
## Min. :-59.0 Min. :-75.00 Min. :-85.00 Min. :-1.000
## 1st Qu.: 11.0 1st Qu.:-50.00 1st Qu.:-39.00 1st Qu.: 5.000
## Median : 43.0 Median :-44.00 Median :-29.00 Median : 8.000
## Mean : 66.2 Mean :-41.33 Mean :-26.04 Mean : 8.548
## 3rd Qu.:107.0 3rd Qu.:-36.00 3rd Qu.: -7.00 3rd Qu.:11.000
## Max. :286.0 Max. : 10.00 Max. : 24.00 Max. :27.000
##
##
## Number of columns: 11
##
## Number of observations: 321
##
## Range of the data:
inspect_data(province)
## Generating some descriptive statistics...
##
## Data: province
##
## Summary statistics
## country_region province city place_id
## Length:3852 Length:3852 Length:3852 Length:3852
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## date retail_and_recreation grocery_and_pharmacy
## Min. :2020-02-15 Min. :-86.00 Min. :-86.00
## 1st Qu.:2020-05-05 1st Qu.:-32.00 1st Qu.: -8.00
## Median :2020-07-24 Median :-17.00 Median : -3.00
## Mean :2020-07-24 Mean :-15.74 Mean : -2.51
## 3rd Qu.:2020-10-12 3rd Qu.: -1.00 3rd Qu.: 2.00
## Max. :2020-12-31 Max. :142.00 Max. : 75.00
## NA's :19 NA's :3
## parks transit_stations workplaces residential
## Min. :-67.00 Min. :-82.00 Min. :-88.00 Min. :-3.000
## 1st Qu.: 12.00 1st Qu.:-51.00 1st Qu.:-40.00 1st Qu.: 5.000
## Median : 45.00 Median :-40.00 Median :-27.00 Median : 8.000
## Mean : 77.95 Mean :-36.34 Mean :-25.04 Mean : 8.135
## 3rd Qu.:107.00 3rd Qu.:-26.00 3rd Qu.: -8.00 3rd Qu.:11.000
## Max. :728.00 Max. : 74.00 Max. : 31.00 Max. :28.000
## NA's :287 NA's :40 NA's :9
##
##
## Number of columns: 11
##
## Number of observations: 3852
##
## Range of the data:
inspect_data(city)
## Generating some descriptive statistics...
##
## Data: city
##
## Summary statistics
## country_region province city place_id
## Length:106068 Length:106068 Length:106068 Length:106068
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## date retail_and_recreation grocery_and_pharmacy
## Min. :2020-02-15 Min. :-97.00 Min. :-96.00
## 1st Qu.:2020-05-04 1st Qu.:-30.00 1st Qu.: -8.00
## Median :2020-07-21 Median :-15.00 Median : -2.00
## Mean :2020-07-24 Mean :-14.77 Mean : -1.24
## 3rd Qu.:2020-10-17 3rd Qu.: -1.00 3rd Qu.: 5.00
## Max. :2020-12-31 Max. :199.00 Max. :222.00
## NA's :47032 NA's :39932
## parks transit_stations workplaces residential
## Min. :-88.00 Min. :-93.00 Min. :-90.0 Min. :-4.00
## 1st Qu.:-14.00 1st Qu.:-52.00 1st Qu.:-41.0 1st Qu.: 6.00
## Median : 11.00 Median :-40.00 Median :-28.0 Median : 8.00
## Mean : 22.01 Mean :-36.97 Mean :-27.4 Mean : 8.89
## 3rd Qu.: 46.00 3rd Qu.:-26.00 3rd Qu.:-16.0 3rd Qu.:12.00
## Max. :629.00 Max. :222.00 Max. : 66.0 Max. :32.00
## NA's :87524 NA's :47898 NA's :6884 NA's :48247
##
##
## Number of columns: 11
##
## Number of observations: 106068
##
## Range of the data:
Observe the high occurrence of missing levels at lower aggregation level (e.g., cities; especially parks). Further inspection of the documentation tells us that these data gaps are intentional because the data doesn't meet the quality and privacy threshold.
Exercise 10
Let's zoom in on these missing observations, and create a table that shows the percentage missings, per data set and each of the columns below.
subset | retail_and_recreation | grocery_and_pharmacy | parks | transit_stations | workplaces | residential |
---|---|---|---|---|---|---|
country |
0 | 0 | 0 | 0 | 0 | 0 |
province |
0.49 | 0.08 | 7.34 | 1.02 | 0.23 | 0 |
city |
44.32 | 37.65 | 82.40 | 45.11 | 6.43 | 45.39 |
Specifically, run the code below, and make it "run" on all data sets (country, province, city).
# Code snippet
missings <- function(df, cols) {
sapply(cols, function(x) length(which(is.na(df[, x])))/nrow(df))
}
columns <- c('retail_and_recreation', 'grocery_and_pharmacy', 'parks', 'transit_stations', 'workplaces', 'residential')
# Another way to define columns is as follows using the 'grep' function:
# columns <- grep('retail|grocery|park|transit|workplace|residential', colnames(mobility),value=T)
missings(country, columns)
## retail_and_recreation grocery_and_pharmacy parks
## 0 0 0
## transit_stations workplaces residential
## 0 0 0
result <- lapply(list(country, province, city), missings, cols = columns)
# we can also nicely "bind" them together
bind_rows(result)
## # A tibble: 3 × 6
## retail_and_recreation grocery_and_pharmacy parks transit_stations workplaces
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0 0
## 2 0.00493 0.000779 0.0745 0.0104 0.00234
## 3 0.443 0.376 0.825 0.452 0.0649
## # ℹ 1 more variable: residential <dbl>
#... and add data set names
data.frame(dataset = c('country','province','city'), bind_rows(result))
## dataset retail_and_recreation grocery_and_pharmacy parks
## 1 country 0.000000000 0.0000000000 0.00000000
## 2 province 0.004932503 0.0007788162 0.07450675
## 3 city 0.443413659 0.3764754686 0.82516876
## transit_stations workplaces residential
## 1 0.00000000 0.000000000 0.0000000
## 2 0.01038422 0.002336449 0.0000000
## 3 0.45157823 0.064901761 0.4548686
Now that the data is in the right shape, it's finally time to
visualize it, e.g., by using the built-in plot()
function.
If you're really into plotting, also check out the amazing package ggplot2
.
The basic plot()
function takes the following input
parameters:
x
= the horizontal valuesy
= vertical valuescol
= color of the fill (optional)type
= l
is a line chart, p
is a scatter plot (optional)Here's an example of a time-series plot of the number of visits to parks:
plot(x = country$date, # horizontal
y = country$parks, # vertical
col = "green", # color
type = "l") # line chart
And here's another one that shows how to plot two lines in one graph:
plot(x = country$date, # horizontal
y = country$residential, # vertical
col = "red", # color of chart
type = "l", # line chart
ylim = c(-80, 30), # range of vertical axis
ylab = "% change from baseline", # label of vertical axis
xlab = "", # no label of horizontal axis
main = "Less time at work partially compensated by more time at home") # plot title
lines(country$date, country$workplaces, col="blue") # add a second line chart
legend("bottomleft", c("residential", "workplace"), fill=c("red", "blue")) # legend (location, names, colors)
Exercise 11
Replicate the the following plot with R code (consider the labels, axes, colors, etc.):
plot(x = country[country$date > "2020-09-01",]$date,
y = country[country$date > "2020-09-01",]$transit_stations,
col = "black",
type = "l",
ylim = c(-85, 0), # range of vertical axis
ylab = "% change from baseline (January 2020)",
xlab = "",
main = "Transits to stations are down 75%")