In this tutorial, we'll walk you through the steps of exploring a data set and obtaining an understanding about the setting in which data was collected. After completion of the tutorial, you'll be able to produce an RMarkdown document, which you can render as a PDF or HTML document (e.g., to show others, initiate a discussion about data, etc.).

We've chosen a data set used by Sim et al. 2022, in which the authors estimate the effect that COVID-19 had on music consumption on Spotify!

Along the way, you'll practice the skills you have been equipped with in the Datacamp tutorials.


Prerequisites


Learning Objectives


Support Needed?

For technical issues outside of scheduled classes, please check the support section on the course website.


1. Loading and Inspecting the Data

Google’s COVID-19 Community Mobility Reports are used extensively by governments and researchers to get an understanding about the pandemic's effect on social life and the economy. In the data, Google provides information on daily percentage changes in the number of visits by place category, compared with the corresponding day of the week during the 5-week (pre-Covid) period between January 3 and February 6, 2020. The data consists of many observations that cover six categories of places: retail & recreation, grocery & pharmacy, parks, transit stations, workplaces, and residence.

With the help of dplyr's read_csv() function, we can now import the mobility data for the Netherlands into R. Download the Region CSVs archive from Google (you have to unzip it first!), create a new R script in the same directory, and read in 2020_NL_Region_Mobility_Report.csv.

Tip: If R doesn't find your data, set your working directory properly (we typically click on SESSION --> SET WORKING DIRECTORY --> TO SOURCE FILE LOCATION). Hardcoding directories in your code (e.g., with setwd()) is not good practice.

mobility <- read_csv("2020_NL_Region_Mobility_Report.csv")
## Rows: 110241 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): country_region_code, country_region, sub_region_1, sub_region_2, i...
## dbl  (6): retail_and_recreation_percent_change_from_baseline, grocery_and_ph...
## lgl  (2): metro_area, census_fips_code
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Alternatively, you can use the data set stored on the course website.
# mobility <- read_csv('https://media.githubusercontent.com/media/hannesdatta/course-dprep/master/content/docs/modules/week3/tutorial/2020_NL_Region_Mobility_Report.csv', sep = ',')


Exercise 1
Inspect the mobility data frame, and describe the structure of the data set in your own words. - Which column names does the data have? - How are variables/metrics operationalized? - What is the unit of analysis at which the data is provided ("what uniquely identifies a row in this dataset?"). - How is the data sorted (if at all)? - Are the values contained in each column complete? Do these values all make sense? - Are “strings” properly encoded (i.e., do they contain any weird characters, or do they seem appropriately rendered so you can read them)?

Tips:


head(mobility)
## # A tibble: 6 × 15
##   country_region_code country_region sub_region_1 sub_region_2 metro_area
##   <chr>               <chr>          <chr>        <chr>        <lgl>     
## 1 NL                  Netherlands    <NA>         <NA>         NA        
## 2 NL                  Netherlands    <NA>         <NA>         NA        
## 3 NL                  Netherlands    <NA>         <NA>         NA        
## 4 NL                  Netherlands    <NA>         <NA>         NA        
## 5 NL                  Netherlands    <NA>         <NA>         NA        
## 6 NL                  Netherlands    <NA>         <NA>         NA        
## # ℹ 10 more variables: iso_3166_2_code <chr>, census_fips_code <lgl>,
## #   place_id <chr>, date <date>,
## #   retail_and_recreation_percent_change_from_baseline <dbl>,
## #   grocery_and_pharmacy_percent_change_from_baseline <dbl>,
## #   parks_percent_change_from_baseline <dbl>,
## #   transit_stations_percent_change_from_baseline <dbl>,
## #   workplaces_percent_change_from_baseline <dbl>, …
summary(mobility)
##  country_region_code country_region     sub_region_1       sub_region_2      
##  Length:110241       Length:110241      Length:110241      Length:110241     
##  Class :character    Class :character   Class :character   Class :character  
##  Mode  :character    Mode  :character   Mode  :character   Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  metro_area     iso_3166_2_code    census_fips_code   place_id        
##  Mode:logical   Length:110241      Mode:logical     Length:110241     
##  NA's:110241    Class :character   NA's:110241      Class :character  
##                 Mode  :character                    Mode  :character  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##       date            retail_and_recreation_percent_change_from_baseline
##  Min.   :2020-02-15   Min.   :-97.00                                    
##  1st Qu.:2020-05-04   1st Qu.:-30.00                                    
##  Median :2020-07-21   Median :-15.00                                    
##  Mean   :2020-07-24   Mean   :-14.86                                    
##  3rd Qu.:2020-10-17   3rd Qu.: -1.00                                    
##  Max.   :2020-12-31   Max.   :199.00                                    
##                       NA's   :47051                                     
##  grocery_and_pharmacy_percent_change_from_baseline
##  Min.   :-96.00                                   
##  1st Qu.: -8.00                                   
##  Median : -2.00                                   
##  Mean   : -1.32                                   
##  3rd Qu.:  5.00                                   
##  Max.   :222.00                                   
##  NA's   :39935                                    
##  parks_percent_change_from_baseline
##  Min.   :-88.00                    
##  1st Qu.:-10.00                    
##  Median : 16.00                    
##  Mean   : 31.53                    
##  3rd Qu.: 56.00                    
##  Max.   :728.00                    
##  NA's   :87811                     
##  transit_stations_percent_change_from_baseline
##  Min.   :-93.00                               
##  1st Qu.:-52.00                               
##  Median :-40.00                               
##  Mean   :-36.95                               
##  3rd Qu.:-26.00                               
##  Max.   :222.00                               
##  NA's   :47938                                
##  workplaces_percent_change_from_baseline
##  Min.   :-90.00                         
##  1st Qu.:-41.00                         
##  Median :-28.00                         
##  Mean   :-27.31                         
##  3rd Qu.:-16.00                         
##  Max.   : 66.00                         
##  NA's   :6893                           
##  residential_percent_change_from_baseline
##  Min.   :-4.00                           
##  1st Qu.: 6.00                           
##  Median : 8.00                           
##  Mean   : 8.84                           
##  3rd Qu.:12.00                           
##  Max.   :32.00                           
##  NA's   :48247
table(mobility$sub_region_1)
## 
##       Drenthe     Flevoland     Friesland    Gelderland     Groningen 
##          3922          2127          5303         15372          4701 
##       Limburg North Brabant North Holland    Overijssel South Holland 
##          9134         18029         14189          7573         18115 
##       Utrecht       Zeeland 
##          7495          3960
# The dataset shows the mobility changes at a national, regional, and city level in the Netherlands, for multiple categories of places. The records are sorted by date, and are grouped by location (starting with country stats).

# Moreover, three of the columns (`metro_area`, `iso_3166_2_code`, `census_flips_code`) contain only empty data, and two other columns (`sub_region_1`, `sub_region_2`) are blank for a subset of the values. This also holds for the columns related to percentage changes in mobility scores which are missing in some cases (especially on a city-level).


Exercise 2
Suppose you'd like to extract data for the Netherlands as a whole (i.e., NOT at the regional or city level), how would you have to filter for it? Re-generate a data summary (summary()) on this new table, and inspect missing values.

filtered_df <- mobility %>% filter(is.na(sub_region_1) & is.na(sub_region_2))
summary(filtered_df)
##  country_region_code country_region     sub_region_1       sub_region_2      
##  Length:321          Length:321         Length:321         Length:321        
##  Class :character    Class :character   Class :character   Class :character  
##  Mode  :character    Mode  :character   Mode  :character   Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##  metro_area     iso_3166_2_code    census_fips_code   place_id        
##  Mode:logical   Length:321         Mode:logical     Length:321        
##  NA's:321       Class :character   NA's:321         Class :character  
##                 Mode  :character                    Mode  :character  
##                                                                       
##                                                                       
##                                                                       
##       date            retail_and_recreation_percent_change_from_baseline
##  Min.   :2020-02-15   Min.   :-83.0                                     
##  1st Qu.:2020-05-05   1st Qu.:-34.0                                     
##  Median :2020-07-24   Median :-19.0                                     
##  Mean   :2020-07-24   Mean   :-19.5                                     
##  3rd Qu.:2020-10-12   3rd Qu.: -3.0                                     
##  Max.   :2020-12-31   Max.   : 21.0                                     
##  grocery_and_pharmacy_percent_change_from_baseline
##  Min.   :-75.000                                  
##  1st Qu.: -7.000                                  
##  Median : -4.000                                  
##  Mean   : -4.153                                  
##  3rd Qu.:  0.000                                  
##  Max.   : 24.000                                  
##  parks_percent_change_from_baseline
##  Min.   :-59.0                     
##  1st Qu.: 11.0                     
##  Median : 43.0                     
##  Mean   : 66.2                     
##  3rd Qu.:107.0                     
##  Max.   :286.0                     
##  transit_stations_percent_change_from_baseline
##  Min.   :-75.00                               
##  1st Qu.:-50.00                               
##  Median :-44.00                               
##  Mean   :-41.33                               
##  3rd Qu.:-36.00                               
##  Max.   : 10.00                               
##  workplaces_percent_change_from_baseline
##  Min.   :-85.00                         
##  1st Qu.:-39.00                         
##  Median :-29.00                         
##  Mean   :-26.04                         
##  3rd Qu.: -7.00                         
##  Max.   : 24.00                         
##  residential_percent_change_from_baseline
##  Min.   :-1.000                          
##  1st Qu.: 5.000                          
##  Median : 8.000                          
##  Mean   : 8.548                          
##  3rd Qu.:11.000                          
##  Max.   :27.000

The number of missing values on most variables has drastically decreased. That's good news! It seems like the data is most "complete" at the national level.


Exercise 3
Let's now start zooming in on some of the metrics in the data to check what they really mean.

The data generally shows the percentage changes in visits to categorized places, compared to a normal value for that day of the week (i.e., baseline). The baseline day is the median value between January 3 and February 6, 2020. In other words, all reported numerical values are relative to the first few weeks of 2020.

First, try to understand the "range" of the values. For example, percentages are commonly scaled between 0 and 1 (e.g., such that .50 means +50%), but sometimes also between 0 and 100 (e.g., such that 50 means 50%). What applies here?

Second, think of one or more problems that can arise as a result of how percentage changes were calculated (i.e., the choice for the baseline between January 3 and February 6, 2020). Come up with specific examples that illustrate problems!

### Part 1 ###

# Exploring the range
summary(mobility$retail_and_recreation_percent_change_from_baseline)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -97.00  -30.00  -15.00  -14.86   -1.00  199.00   47051
# --> the range seems to be measured on a scale on which 100 means 100%.


### Part 2 ###
# Problems that could arise from the baseline choice:

# - The data does not account for seasonality (e.g., park visits increase as the weather improves)  

# - In the Netherlands, we experienced a relatively mild winter period throughout 2020 (e.g., the average temperature in January 2020 was the 3rd highest ever). As a result, the time distribution of certain categories (e.g., parks, residence) may not be representative for a typical year. 

# - Probably a lot of other reasons! Try to think about more!



2. Data Cleaning & Transformation

Next, let's narrow down the data to a smaller set of data that we would like to concentrate on. For example, the raw data may contain columns we don't need or want, or specific subsets that we want to get rid of.


Drop Columns

To make working with the data frame more manageable, we drop columns we don't need. More specifically, we overwrite the dataframe with a selection of necessary columns (e.g., df = df[,c("col_1", "col3", "col"4", "col5", ...)])).


Exercise 4
Drop the columns country_region_code, metro_area, iso_3166_2_code and census_fips_code from the mobility dataframe. How many columns do you end up with?

# bad solution - think about WHY?
mobility[, c(2:4, 8:15)]
# good solution
cols_to_drop <- c('country_region_code', 'metro_area', 'iso_3166_2_code', 'census_fips_code')
mobility <- mobility %>% select(-cols_to_drop)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(cols_to_drop)
## 
##   # Now:
##   data %>% select(all_of(cols_to_drop))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
head(mobility)
## # A tibble: 6 × 11
##   country_region sub_region_1 sub_region_2 place_id                   date      
##   <chr>          <chr>        <chr>        <chr>                      <date>    
## 1 Netherlands    <NA>         <NA>         ChIJu-SH28MJxkcRnwq9_851o… 2020-02-15
## 2 Netherlands    <NA>         <NA>         ChIJu-SH28MJxkcRnwq9_851o… 2020-02-16
## 3 Netherlands    <NA>         <NA>         ChIJu-SH28MJxkcRnwq9_851o… 2020-02-17
## 4 Netherlands    <NA>         <NA>         ChIJu-SH28MJxkcRnwq9_851o… 2020-02-18
## 5 Netherlands    <NA>         <NA>         ChIJu-SH28MJxkcRnwq9_851o… 2020-02-19
## 6 Netherlands    <NA>         <NA>         ChIJu-SH28MJxkcRnwq9_851o… 2020-02-20
## # ℹ 6 more variables: retail_and_recreation_percent_change_from_baseline <dbl>,
## #   grocery_and_pharmacy_percent_change_from_baseline <dbl>,
## #   parks_percent_change_from_baseline <dbl>,
## #   transit_stations_percent_change_from_baseline <dbl>,
## #   workplaces_percent_change_from_baseline <dbl>,
## #   residential_percent_change_from_baseline <dbl>
# how many columns do we retain?
ncol(mobility)
## [1] 11


Rename Columns

We continue with the (updated) data frame mobility.

As you can see below, some of the remaining column names of mobility are rather long (or can be described more precisely).

# current column names
colnames(mobility)
##  [1] "country_region"                                    
##  [2] "sub_region_1"                                      
##  [3] "sub_region_2"                                      
##  [4] "place_id"                                          
##  [5] "date"                                              
##  [6] "retail_and_recreation_percent_change_from_baseline"
##  [7] "grocery_and_pharmacy_percent_change_from_baseline" 
##  [8] "parks_percent_change_from_baseline"                
##  [9] "transit_stations_percent_change_from_baseline"     
## [10] "workplaces_percent_change_from_baseline"           
## [11] "residential_percent_change_from_baseline"

You can rename a column of a data frame (df) by overwriting the current column names with a vector of new column names, like this: colnames(df) = c("new_col_name_1", "new_col_name_2", "new_col_name_3", ...)


Exercise 5
Trim the long column names (e.g., retail_and_recreation_percent_change_from_baseline becomes retail_and_recreation) and replace sub_region_1 and sub_region_2 by province and city, respectively.

# This solution works but is suboptimal, as the *format* of the input data may change (and hence, render the renamed columns wrong).
colnames(mobility) = c("country_region", 
                      "province",
                      "city", 
                      "place_id",
                      "date", 
                      "retail_and_recreation",
                      "grocery_and_pharmacy", 
                      "parks", 
                      "transit_stations", 
                      "workplaces", 
                      "residential")
# First use the rename function to rename sub_region_1 & sub_region_2. After that, use the rename_with function to remove all the suffixes with "_percent_change_from_baseline'
mobility_updated <- mobility %>% rename(province = sub_region_1,
                                city = sub_region_2) %>%
                         rename_with(~str_remove(., '_percent_change_from_baseline'))


Convert into Date

Although the dates may look like "regular" dates (e.g., 2020-02-15), R sometimes treats them like a character string (try running class(mobility$date) for yourself!).

When working with dplyr/tidyverse, though, dates are mostly imported correctly.

If they are still characters rather than dates, you need to convert the date column into date format. Date conversion can cause you a mild headache, and it even gets more complex with varying date formats across different geographic regions.

This is how to convert a character-encoded date to a "real date":

mobility$date <- as.Date(mobility$date)


Exercise 6
Now let's start calculating with our dates.

What's the first and last date in our data frame? Tip: you can use min() and max() now that the date column has been converted into date format!

The data runs from February 15, 2020 (min(mobility$date)) to December 31, 2020 (max(mobility$date)).

Observe above that you can not only use R code in code cells, but also in text by enclosing it in special characters: 2020-02-15.


Adding a New Column

Let's start generating some "derived metrics" - i.e., metrics that are based on the raw data, but that were not part of the actual raw data. Think of this as a first attempt to operationalize some variables that you're interested in.

For example, to get a proxy for overall movement trends, we could add avg_mobility to the data, which we define as the (row) mean of retail_and recreation, grocery_and_pharmacy, parks, transit_stations, workplaces, and residential.

Exercise 7
Add the avg_mobility column to your dataset (keep in mind that the data frame may contain one or more missing values). You may want to look up the documentation of the na.rm parameter.

# solution (most elegant and sustainable)

columns <- c('retail_and_recreation', 'grocery_and_pharmacy', 'parks', 'transit_stations', 'workplaces', 'residential')
mobility_updated <- mobility_updated %>% mutate(avg_mobility2 = rowMeans(select(., all_of(columns)), na.rm =TRUE))
# prone to errors of referencing to the correct columns
avg_mobility_array <- mobility[c(6,7,8,9,10,11)]
# this is inefficient, too, because it does not make use of the dplyr functions (and hence, is considerably slower on large datasets)
mobility$avg_mobility <- rowMeans(avg_mobility_array, na.rm = TRUE)

# The answer above, same as for exercise 4 makes use of counting which is prone to error if we add or remove a column!
# solution (also not sustainable as column names could change!!!)
mobility$avg_mobility <- rowMeans(mobility[,6:ncol(mobility)], na.rm = TRUE)

# solution (least elegant - and far less efficient!!!) - do you notice the difference in running time?  

for(counter in 1:nrow(mobility)){
  row = mobility[counter, ]
  mobility[counter, 'avg_mobility'] = mean(c(row$retail_and_recreation,
                                             row$grocery_and_pharmacy, 
                                             row$parks, 
                                             row$transit_stations, 
                                             row$workplaces, 
                                             row$residential), 
                                           na.rm=TRUE)
}


Selecting a Subset of Data

At this point, we've realized the data contains information at the country, province and city level. We'd ideally like to split these data in three separate data sets.

For this, we need to select a subset of the data, and store it in new data frames.

We know that the values in the province and city columns are empty for the country level stats. In the same way, the city level records are blank for the data aggregated on the province level. This is shown in the table below where each X denotes the availability of data in a respective column.

Aggregation Level country_region province city
Country X
Province X X
City X X X


Exercise 8
Create three subsets (country, province, city) of mobility according to the descriptions above. To filter for NA strings (NA) you can use the boolean expression is.na(df$column). This essentially checks whether the column contains missing values.

country <- mobility_updated %>% filter(is.na(city) & is.na(province))
province <- mobility_updated %>% filter(is.na(city) & !is.na(province))
city <- mobility_updated %>% filter(!is.na(city))


Exercise 9
Write a function inspect_data() that takes in a dataframe (df), and produces descriptive stats (summary), reports on the number of observations (nrow()) and columns (ncol()), and the range of dates in the data.

Start from the code snippet below:

inspect_data <- function(df) {
  cat('Generating some descriptive statistics...\n\n')
  cat('Data: ') # using cat, you can write text to the console/screen; "\n" is a new line.
  cat(deparse(substitute(df))) # this one prints the NAME of the input data set
  cat('\n\n')
  
  print(summary(df))
  # more here...
}
# inspect_data
inspect_data <- function(df){
  cat('Generating some descriptive statistics...\n\n')
  cat('Data: ')
  cat(deparse(substitute(df)))
  cat('\n\n')
  
  cat('Summary statistics\n')
  print(summary(df))
  cat('\n\n')
  
  cat('Number of columns: ')
  cat(ncol(df))
  cat('\n\n')
  
  cat('Number of observations: ')
  cat(nrow(df))
  cat('\n\n')
  
  cat('Range of the data:\n')
  summary(df$date)
  cat('\n\n')

}

inspect_data(country)
## Generating some descriptive statistics...
## 
## Data: country
## 
## Summary statistics
##  country_region       province             city             place_id        
##  Length:321         Length:321         Length:321         Length:321        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       date            retail_and_recreation grocery_and_pharmacy
##  Min.   :2020-02-15   Min.   :-83.0         Min.   :-75.000     
##  1st Qu.:2020-05-05   1st Qu.:-34.0         1st Qu.: -7.000     
##  Median :2020-07-24   Median :-19.0         Median : -4.000     
##  Mean   :2020-07-24   Mean   :-19.5         Mean   : -4.153     
##  3rd Qu.:2020-10-12   3rd Qu.: -3.0         3rd Qu.:  0.000     
##  Max.   :2020-12-31   Max.   : 21.0         Max.   : 24.000     
##      parks       transit_stations   workplaces      residential    
##  Min.   :-59.0   Min.   :-75.00   Min.   :-85.00   Min.   :-1.000  
##  1st Qu.: 11.0   1st Qu.:-50.00   1st Qu.:-39.00   1st Qu.: 5.000  
##  Median : 43.0   Median :-44.00   Median :-29.00   Median : 8.000  
##  Mean   : 66.2   Mean   :-41.33   Mean   :-26.04   Mean   : 8.548  
##  3rd Qu.:107.0   3rd Qu.:-36.00   3rd Qu.: -7.00   3rd Qu.:11.000  
##  Max.   :286.0   Max.   : 10.00   Max.   : 24.00   Max.   :27.000  
## 
## 
## Number of columns: 11
## 
## Number of observations: 321
## 
## Range of the data:
inspect_data(province)
## Generating some descriptive statistics...
## 
## Data: province
## 
## Summary statistics
##  country_region       province             city             place_id        
##  Length:3852        Length:3852        Length:3852        Length:3852       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       date            retail_and_recreation grocery_and_pharmacy
##  Min.   :2020-02-15   Min.   :-86.00        Min.   :-86.00      
##  1st Qu.:2020-05-05   1st Qu.:-32.00        1st Qu.: -8.00      
##  Median :2020-07-24   Median :-17.00        Median : -3.00      
##  Mean   :2020-07-24   Mean   :-15.74        Mean   : -2.51      
##  3rd Qu.:2020-10-12   3rd Qu.: -1.00        3rd Qu.:  2.00      
##  Max.   :2020-12-31   Max.   :142.00        Max.   : 75.00      
##                       NA's   :19            NA's   :3           
##      parks        transit_stations   workplaces      residential    
##  Min.   :-67.00   Min.   :-82.00   Min.   :-88.00   Min.   :-3.000  
##  1st Qu.: 12.00   1st Qu.:-51.00   1st Qu.:-40.00   1st Qu.: 5.000  
##  Median : 45.00   Median :-40.00   Median :-27.00   Median : 8.000  
##  Mean   : 77.95   Mean   :-36.34   Mean   :-25.04   Mean   : 8.135  
##  3rd Qu.:107.00   3rd Qu.:-26.00   3rd Qu.: -8.00   3rd Qu.:11.000  
##  Max.   :728.00   Max.   : 74.00   Max.   : 31.00   Max.   :28.000  
##  NA's   :287      NA's   :40       NA's   :9                        
## 
## 
## Number of columns: 11
## 
## Number of observations: 3852
## 
## Range of the data:
inspect_data(city)
## Generating some descriptive statistics...
## 
## Data: city
## 
## Summary statistics
##  country_region       province             city             place_id        
##  Length:106068      Length:106068      Length:106068      Length:106068     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##       date            retail_and_recreation grocery_and_pharmacy
##  Min.   :2020-02-15   Min.   :-97.00        Min.   :-96.00      
##  1st Qu.:2020-05-04   1st Qu.:-30.00        1st Qu.: -8.00      
##  Median :2020-07-21   Median :-15.00        Median : -2.00      
##  Mean   :2020-07-24   Mean   :-14.77        Mean   : -1.24      
##  3rd Qu.:2020-10-17   3rd Qu.: -1.00        3rd Qu.:  5.00      
##  Max.   :2020-12-31   Max.   :199.00        Max.   :222.00      
##                       NA's   :47032         NA's   :39932       
##      parks        transit_stations   workplaces     residential   
##  Min.   :-88.00   Min.   :-93.00   Min.   :-90.0   Min.   :-4.00  
##  1st Qu.:-14.00   1st Qu.:-52.00   1st Qu.:-41.0   1st Qu.: 6.00  
##  Median : 11.00   Median :-40.00   Median :-28.0   Median : 8.00  
##  Mean   : 22.01   Mean   :-36.97   Mean   :-27.4   Mean   : 8.89  
##  3rd Qu.: 46.00   3rd Qu.:-26.00   3rd Qu.:-16.0   3rd Qu.:12.00  
##  Max.   :629.00   Max.   :222.00   Max.   : 66.0   Max.   :32.00  
##  NA's   :87524    NA's   :47898    NA's   :6884    NA's   :48247  
## 
## 
## Number of columns: 11
## 
## Number of observations: 106068
## 
## Range of the data:

Observe the high occurrence of missing levels at lower aggregation level (e.g., cities; especially parks). Further inspection of the documentation tells us that these data gaps are intentional because the data doesn't meet the quality and privacy threshold.

Exercise 10

Let's zoom in on these missing observations, and create a table that shows the percentage missings, per data set and each of the columns below.

subset retail_and_recreation grocery_and_pharmacy parks transit_stations workplaces residential
country 0 0 0 0 0 0
province 0.49 0.08 7.34 1.02 0.23 0
city 44.32 37.65 82.40 45.11 6.43 45.39

Specifically, run the code below, and make it "run" on all data sets (country, province, city).

# Code snippet
missings <- function(df, cols) {
  sapply(cols, function(x) length(which(is.na(df[, x])))/nrow(df))
}

columns <- c('retail_and_recreation', 'grocery_and_pharmacy', 'parks', 'transit_stations', 'workplaces', 'residential')

# Another way to define columns is as follows using the 'grep' function:
# columns <- grep('retail|grocery|park|transit|workplace|residential', colnames(mobility),value=T)

missings(country, columns)
## retail_and_recreation  grocery_and_pharmacy                 parks 
##                     0                     0                     0 
##      transit_stations            workplaces           residential 
##                     0                     0                     0
result <- lapply(list(country, province, city), missings, cols = columns)

# we can also nicely "bind" them together
bind_rows(result)
## # A tibble: 3 × 6
##   retail_and_recreation grocery_and_pharmacy  parks transit_stations workplaces
##                   <dbl>                <dbl>  <dbl>            <dbl>      <dbl>
## 1               0                   0        0                0         0      
## 2               0.00493             0.000779 0.0745           0.0104    0.00234
## 3               0.443               0.376    0.825            0.452     0.0649 
## # ℹ 1 more variable: residential <dbl>
#... and add data set names
data.frame(dataset = c('country','province','city'), bind_rows(result))
##    dataset retail_and_recreation grocery_and_pharmacy      parks
## 1  country           0.000000000         0.0000000000 0.00000000
## 2 province           0.004932503         0.0007788162 0.07450675
## 3     city           0.443413659         0.3764754686 0.82516876
##   transit_stations  workplaces residential
## 1       0.00000000 0.000000000   0.0000000
## 2       0.01038422 0.002336449   0.0000000
## 3       0.45157823 0.064901761   0.4548686


3. Data Exploration


Now that the data is in the right shape, it's finally time to visualize it, e.g., by using the built-in plot() function. If you're really into plotting, also check out the amazing package ggplot2.

The basic plot() function takes the following input parameters:

Here's an example of a time-series plot of the number of visits to parks:

plot(x = country$date, # horizontal
     y = country$parks, # vertical
     col = "green", # color
     type = "l") # line chart 

And here's another one that shows how to plot two lines in one graph:

plot(x = country$date, # horizontal
     y = country$residential, # vertical
     col = "red", # color of chart
     type = "l",  # line chart
     ylim = c(-80, 30), # range of vertical axis
     ylab = "% change from baseline", # label of vertical axis
     xlab = "", # no label of horizontal axis
     main = "Less time at work partially compensated by more time at home") # plot title

lines(country$date, country$workplaces, col="blue") # add a second line chart

legend("bottomleft", c("residential", "workplace"), fill=c("red", "blue")) # legend (location, names, colors)

Exercise 11

Replicate the the following plot with R code (consider the labels, axes, colors, etc.):


plot(x = country[country$date > "2020-09-01",]$date, 
     y = country[country$date > "2020-09-01",]$transit_stations, 
     col = "black", 
     type = "l",  
     ylim = c(-85, 0), # range of vertical axis
     ylab = "% change from baseline (January 2020)", 
     xlab = "", 
     main = "Transits to stations are down 75%")