Hannes Datta
Data preparation theory (“the bigger picture”)
In-class part of this week's tutorial: we will use RMarkdown to start exploring new data
After class
1) Initializing the script/setup
In today's session, we create our own RMarkdown document.
We will continue when you're done.
echo
, include
Let's create some data!
x <- c(10, 20, NA, 5, 3, 100)
Write code that displays…
download.file("https://raw.githubusercontent.com/hannesdatta/course-dprep/master/content/docs/modules/week4/regional-global-daily-latest.csv", "streams.csv")
library(tidyverse)
streams <- read_csv('streams.csv', skip=1, n_max = Inf)
prototype
is TRUE
. If prototype
is FALSE
, all rows need to be loaded.download.file("https://raw.githubusercontent.com/hannesdatta/course-dprep/master/content/docs/modules/week4/regional-global-daily-latest.csv", "streams.csv")
library(tidyverse)
prototype = TRUE
n_max = Inf
if (prototype==TRUE) n_max = 100
streams <- read_csv('streams.csv', skip=1, n_max = n_max)
Beware:
read.csv
is NOT read_csv
(the latter is more efficient!)read.csv
) are __critically important_!Remeber our research workflow? Before we can start with exactly the same data, we all need to download this data set first.
Question: Why not do it manually? What could go wrong?
urls = c('http://data.insideairbnb.com/the-netherlands/north-holland/amsterdam/2022-09-07/visualisations/listings.csv', 'http://data.insideairbnb.com/belgium/vlg/antwerp/2022-06-22/visualisations/listings.csv', 'http://data.insideairbnb.com/united-states/nc/asheville/2023-06-18/visualisations/listings.csv')
for (url in urls) {
filename = paste(gsub('[^a-zA-Z]', '', url), '.csv') # keep only letter
filename = gsub('httpdatainsideairbnbcom', '', filename) # wipe httpdatainsideairbnbcom from filename
download.file(url, destfile = filename) # download file
}
Do: Use the code snippet from the previous slide to download all historical listing datasets for the city of Barcelona. Check whether files have been saved properly!
# your code here
# urls = # Assemble list of URLs here
# then, copy paste code snippet for downloading and renaming data
apply
family, though, DO return information.lapply
(loops over a vector
or list
, returns a list
)sapply
(loops over a vector
, returns as a vector
)apply
(loops over rows or columns of a matrix, returns a vector
)apply
exist, but I rarely use themurls = c('http://data.insideairbnb.com/the-netherlands/north-holland/amsterdam/2023-09-03/visualisations/listings.csv', 'http://data.insideairbnb.com/belgium/vlg/antwerp/2023-06-28/visualisations/listings.csv', 'http://data.insideairbnb.com/united-states/nc/asheville/2023-06-18/visualisations/listings.csv')
datasets <- lapply(urls, read_csv, n_max = 200, col_types=list(neighbourhood='character'))
Do
datasets[[1]]
, datasets[[2]]
, etc. - does it correspond to what you would expect?lapply
and the summary
. bind_rows()
. Can you still correctly identify the name of the original city?What's the key difference between the “apply family” and “for loops”?
lapply
“returns” stuff; loops just execute stuff without returning anythingEach and every source code file obeys to this procedure. We called this setup-ITO earlier.
To do all of this, we require knowledge on new programming concepts. So let's continue.
my_dataset = datasets[[1]]
my_dataset %>% group_by(neighbourhood) %>% summarize(hosts = n_distinct(host_id))
DO:
prep_data <- function(dataset) { # your code here }
datasets[[1]]
and datasets[[2]]
prep_data <- function(dataset) {
my_dataset = dataset %>% group_by(neighbourhood) %>% summarize(hosts = n_distinct(host_id))
return(my_dataset)
}
prep_data(datasets[[1]])
Now let's continue with some more exercises.
datasets
, using lapply
. Save the result in a new variable, called edited_datasets
.bind_rows()
function from the dplyr
package to bind them together? Call it final_dataset
.prep_data <- function(dataset) {
result = dataset %>% group_by(neighbourhood) %>% summarize(hosts = n_distinct(host_id))
result = result %>% mutate(neighbourhood = as.character(neighbourhood))
return(result)
}
edited_datasets <- lapply(datasets, prep_data)
final_dataset <- bind_rows(edited_datasets)
grepl
can filter for information, without typing the “exact” search querygsub
helps you to replace informationfinal_dataset %>% filter(grepl('Centrum', neighbourhood))
# A tibble: 5 × 2
neighbourhood hosts
<chr> <int>
1 Bijlmer-Centrum 1
2 Centrum-Oost 36
3 Centrum-West 30
4 Historisch Centrum 13
5 Hoboken - Centrum 1
DO
de pijp
using filter
and grepl
.zuidas
using filter
and grepl
. Can you make your search case-insensitive?apply
family (lapply
)DO:
read_csv()
head()
, tail()
, and View()
Unit of analysis: Fundamental concept in data analysis. Refers to the specific entity or level of observation at which you collect, process, and analyze data.
Why is it important? It determines how you aggregate, summarize, and interpret your data. Without knowing your unit of analysis – at each part of your project – you will likely get lost.
Example: The listing
data is at the “city-listing” level. However, we may want to analyze the AirBnB data at the city-listing-date level. Aggregation may be necessary.
summary()
command.
retail_and_recreation_percent_change_from_baseline
variable: what does it mean?
DO:
cols_to_drop <- c('country_region_code', 'metro_area', 'iso_3166_2_code', 'census_fips_code')
sub_region_1
to province
, and sub_region_2
to city
We can also use more advanced functions, e.g., to wipe the percent_change...
from the column names.
class(mobility$date)
%>% filter(...)
?Rscript <filename>
)Remember the workflow from the start of today's session?
We've just covered step 1. Concepts and programming knowledge extends to subsequent parts.
<!–
Welcome to today's coaching session.
Example application: contributing to open source projects using forks and pull requests (can demonstrate)
How's your team work going?