Hannes Datta
dplyr
- chapter 1 & 2Get your cheat sheets ready!!!
streams.csv
: artist streams, dailyweather.csv
: weather data for four countriessocials.csv
: social media statsplaylists.csv
: playlist statisticsNote: All data has been simulated; it's a new tutorial so let's hope everything works out.
library(tidyverse)
download.file('https://github.com/hannesdatta/course-dprep/raw/refs/heads/master/content/docs/modules/week4/tutorial/data/week4_data.zip', 'week4_data.zip')
unzip('week4_data.zip', overwrite=TRUE)
download.R
, and insert the snippet above.explore.R
, and load all data sets using read_csv()
head()
, View()
, summary()
, dim()
commands and try to describe the data set.library(tidyverse)
streams <- read_csv('streams.csv')
weather <- read_csv('weather.csv')
socials <- read_csv('socials.csv')
playlists <- read_csv('playlists.csv')
read_csv
, write_csv
), head
, View()
, summary()
, table()
, dim()
group_by()
, summarize()
, mutate()
, filter()
, select()
, arrange()
, count()
What's to come…
tidyverse
streams %>% group_by(artist) %>%
summarize(
mean_streams = mean(streams, na.rm = TRUE))
DO: Please extend code snippet above to include
sd
), min
) and max
) in the summary table.library(dplyr)
# Compute basic statistics per artist
streams %>%
group_by(artist) %>%
summarize(
min_streams = min(streams, na.rm = TRUE),
mean_streams = mean(streams, na.rm = TRUE),
max_streams = max(streams, na.rm = TRUE),
sd_streams = sd(streams, na.rm = TRUE)
)
date
and country
rank()
function to find the “rank” of streams (inverse!) - in combination w/ mutate()
ungroup()
)date
and ranks
- lower ranks are “up”filter
)streams %>%
group_by(date, country) %>%
mutate(rank = rank(-streams)) %>%
ungroup() %>%
arrange(date, country, rank) %>%
filter(rank<=10)
unique()
: removes duplicates from a vectorn_distinct()
: counts the number of unique values in a vectordistinct()
: removes duplicate rowsstreams %>% summarize(number_of_artists = n_distinct(artist))
# or: length(unique(streams$artist))
distinct(streams)
unique(streams$artist)[1:10]
[1] "10 Beats to Phoenix" "10 Notes to Wolves" "30 Minutes to Mirage"
[4] "30 Minutes to Rebels" "30 Minutes to Saints" "30 Ways to Storm"
[7] "40 Days to Rebels" "40 Miles to Storm" "60 Notes to Echo"
[10] "77 Beats to Wolves"
Functions you need to know
grepl(x, y)
→ detect whether y
contains x
(evaluates to TRUE
or FALSE
)grep(x, y, value=TRUE)
→ detect whether y
contains x
(& returns the search results)gsub(x, z, y)
→ searches x
in y
and replaces x
by z
How to search? Examples:
grepl("\\d", "Song123")
→ TRUE
grepl("[a-z]", "K3")
→ FALSE
(case-sensitive)grepl("^The", "The Beatles")
→ TRUE
, but…grepl("^the", "The Beatles")
→ FALSE
(know why?)grepl("band$", "Jazz Band", ignore.case=T)
→ TRUE
grepl("jazz$", "Jazz Band", ignore.case=T)
→ FALSE
(know why?)DO:
Storm
by Wind
Tip: first extract the unique artist names.
artists <- unique(streams$artist)
artists <- unique(streams$artist)
grep('\\d', artists, value=T)
grep('band$', artists, value=T, ignore.case=T)
gsub('Storm','Wind',artists)
Issues
1) converting “characters” to date/time
2) conversion between one format to the other (easy: minutes → seconds, harder: dates → weeks)
3) accounting for the “time properties” of data
For dealing with dates time, we make use of three extra packages:
lubridate
, ISOweek
, andzoo
DO:
install.packages('lubridate')
library(lubridate)
)bad_dates = c('20250104')
good_date = as.Date(bad_dates, format='%Y%m%d') # using base R
good_date
[1] "2025-01-04"
good_date2 = lubridate::ymd(bad_dates) # alternative using lubridate
good_date2
[1] "2025-01-04"
Note: ::
explicitly tells R to take the ymd
function (after ::
) from the lubridate
package (before the ::
)
A demonstration of lubridate
's functionality
library(lubridate) # load the package
my_date = as.Date('2025-02-19') # set a data for demonstration
isoweek(my_date) # convert date to week number
[1] 8
m<-month(my_date) # convert date to month number
m
[1] 2
mon<-month(my_date, label=T) # retrieve name of month
mon
[1] Feb
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
A demonstration of lubridate
's functionality (continued)
y<-year(my_date) # extract year from date
y
[1] 2025
# assemble back to a string
paste0(y,'-',sprintf('%02d',m),'-01') # note: sprintf turns 2 into '02'
[1] "2025-02-01"
A demonstration of ISOweek
's functionality
ISOweek
is a super fast way to get us there, compared to lubridate
ISOweek::ISOweek(my_date)
[1] "2025-W08"
DO: Please create the following new variables inside the streams
data set:
week
: week number (lubridate
's week
)week2
: year-week (ISOweek
's ISOweek
) month
: month number (lubridate
's month
)year
: year number (lubridate
's year
)Make use of tidyverse/dplyr's mutate
.
library(lubridate)
streams <- streams %>%
mutate(
week = isoweek(date),
week2 = ISOweek::ISOweek(date),
month = month(date),
year = year(date)
)
streams %>% select(date, week, week2, month, year) %>% head(10)
# A tibble: 10 × 5
date week week2 month year
<date> <dbl> <chr> <dbl> <dbl>
1 2024-01-01 1 2024-W01 1 2024
2 2024-01-02 1 2024-W01 1 2024
3 2024-01-03 1 2024-W01 1 2024
4 2024-01-04 1 2024-W01 1 2024
5 2024-01-05 1 2024-W01 1 2024
6 2024-01-06 1 2024-W01 1 2024
7 2024-01-07 1 2024-W01 1 2024
8 2024-01-08 2 2024-W02 1 2024
9 2024-01-09 2 2024-W02 1 2024
10 2024-01-10 2 2024-W02 1 2024
NA
, but we can make plausible assumptions as to what their values could have been.zoo::na.locf
: last observation carried forwardzoo::na.approx
: approximation of missing values, such as linear interpolationsim_data %>%
mutate(price_imp = zoo::na.locf(price,na.rm=F), # locf
playlists_imp = zoo::na.approx(playlists,na.rm=F), # linear interp.
playlists_imp2 = zoo::na.locf(playlists, na.rm=F)) # locf
# A tibble: 4 × 6
date price playlists price_imp playlists_imp playlists_imp2
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2025-01-01 1.2 30 1.2 30 30
2 2025-01-02 1.1 20 1.1 20 20
3 2025-01-03 NA NA 1.1 15 20
4 2025-01-04 1.1 10 1.1 10 10
DO: Please use the zoo::na.locf()
function to interpolate the number of playlists in the playlists
data set
Important:
artist
(and think about why you would have to do that!)dates
(and also think about the why)playlists = playlists %>%
group_by(artist) %>%
arrange(artist, iso_week) %>%
mutate(playlists_imputed = zoo::na.locf(playlists, na.rm=F))
playlists
# A tibble: 4,494 × 4
# Groups: artist [107]
iso_week artist playlists playlists_imputed
<chr> <chr> <dbl> <dbl>
1 2024-W01 10 Beats to Phoenix 24 24
2 2024-W02 10 Beats to Phoenix 28 28
3 2024-W03 10 Beats to Phoenix 28 28
4 2024-W04 10 Beats to Phoenix 27 27
5 2024-W06 10 Beats to Phoenix NA 27
6 2024-W07 10 Beats to Phoenix 25 25
7 2024-W08 10 Beats to Phoenix 15 15
8 2024-W11 10 Beats to Phoenix 61 61
9 2024-W12 10 Beats to Phoenix 22 22
10 2024-W14 10 Beats to Phoenix 23 23
# ℹ 4,484 more rows
…but… wait a minute - where's week 5? We will return to this later!
Example:
playlists = playlists %>% group_by(artist) %>% arrange(artist, iso_week) %>% mutate(playlists_lagged = lag(playlists_imputed),
playlists_lead = lead(playlists_imputed))
head(playlists,2)
# A tibble: 2 × 6
# Groups: artist [1]
iso_week artist playlists playlists_imputed playlists_lagged playlists_lead
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2024-W01 10 Beats… 24 24 NA 28
2 2024-W02 10 Beats… 28 28 24 28
DO: Can you also add a variable, reflecting the change from one week to the next?
playlists = playlists %>% group_by(artist) %>% arrange(artist, iso_week) %>% mutate(playlists_lagged = lag(playlists_imputed),
playlists_lead = lead(playlists_imputed),
change = playlists_imputed - playlists_lagged)
playlists
# A tibble: 4,494 × 7
# Groups: artist [107]
iso_week artist playlists playlists_imputed playlists_lagged playlists_lead
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2024-W01 10 Beat… 24 24 NA 28
2 2024-W02 10 Beat… 28 28 24 28
3 2024-W03 10 Beat… 28 28 28 27
4 2024-W04 10 Beat… 27 27 28 27
5 2024-W06 10 Beat… NA 27 27 25
6 2024-W07 10 Beat… 25 25 27 15
7 2024-W08 10 Beat… 15 15 25 61
8 2024-W11 10 Beat… 61 61 15 22
9 2024-W12 10 Beat… 22 22 61 23
10 2024-W14 10 Beat… 23 23 22 24
# ℹ 4,484 more rows
# ℹ 1 more variable: change <dbl>
streams.csv
)weather.csv
)DO: Let's take a look at streams.csv
head(streams,3)
# A tibble: 3 × 8
date artist country streams week week2 month year
<date> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 2024-01-01 10 Beats to Phoenix BE 6983 1 2024-W01 1 2024
2 2024-01-02 10 Beats to Phoenix BE 8451 1 2024-W01 1 2024
3 2024-01-03 10 Beats to Phoenix BE 0 1 2024-W01 1 2024
From long…
# A tibble: 2 × 8
date artist country streams week week2 month year
<date> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 2024-01-01 10 Beats to Phoenix BE 6983 1 2024-W01 1 2024
2 2024-01-02 10 Beats to Phoenix BE 8451 1 2024-W01 1 2024
…to wide:
streams %>%
pivot_wider(id_cols=c('date','artist'),
names_from=c('country'),
values_from=c('streams')) %>% head(6)
# A tibble: 6 × 6
date artist BE DE NL US
<date> <chr> <dbl> <dbl> <dbl> <dbl>
1 2024-01-01 10 Beats to Phoenix 6983 9523 6258 2856
2 2024-01-02 10 Beats to Phoenix 8451 8601 0 6642
3 2024-01-03 10 Beats to Phoenix 0 0 9712 9648
4 2024-01-04 10 Beats to Phoenix 0 292 0 972
5 2024-01-05 10 Beats to Phoenix 1710 0 12293 7029
6 2024-01-06 10 Beats to Phoenix 3069 2899 2724 1556
DO: Let's take a look at weather.csv
head(weather,3)
# A tibble: 3 × 9
date DE_temp BE_temp NL_temp US_temp DE_sun BE_sun NL_sun US_sun
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2024-01-01 14.3 10.5 10.6 17.1 5.46 5.16 4.31 6.93
2 2024-01-02 8.72 17.1 10.4 21.8 4.88 5.63 3.75 12
3 2024-01-03 11.7 7.41 11.7 17.4 6.50 6.09 5.06 7.95
From wide…
# A tibble: 2 × 9
date DE_temp BE_temp NL_temp US_temp DE_sun BE_sun NL_sun US_sun
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2024-01-01 14.3 10.5 10.6 17.1 5.46 5.16 4.31 6.93
2 2024-01-02 8.72 17.1 10.4 21.8 4.88 5.63 3.75 12
…to long:
weather %>%
pivot_longer(cols = matches("_"),
names_to = 'country_measure',
values_to = 'value') %>% head(4)
# A tibble: 4 × 3
date country_measure value
<date> <chr> <dbl>
1 2024-01-01 DE_temp 14.3
2 2024-01-01 BE_temp 10.5
3 2024-01-01 NL_temp 10.6
4 2024-01-01 US_temp 17.1
DO: Convert the social media metrics (socials.csv
) from long to wide
Steps:
metric == 'followers'
(using filter()
)pivot_wider
requires id_cols
, names_from
, and values_from
wide = socials %>% filter(metric=='followers') %>%
pivot_wider(id_cols = c('date','artist'),
names_from=c('platform'),
values_from=c('value'))
head(wide,4)
# A tibble: 4 × 6
date artist TikTok X Snapchat Instagram
<date> <chr> <dbl> <dbl> <dbl> <dbl>
1 2024-01-01 Saint Echo 458301 NA NA 198006
2 2024-01-02 Saint Echo 458594 368312 226699 NA
3 2024-01-03 Saint Echo 459599 NA 226957 198549
4 2024-01-04 Saint Echo 460286 369857 227480 198947
DO: Convert the wide
data from before back to long
Hints:
pivot_longer
requires cols
; optional: names_to
, and values_to
.wide %>% pivot_longer(cols=all_of(c('TikTok','X','Snapchat','Instagram')))
# A tibble: 142,096 × 4
date artist name value
<date> <chr> <chr> <dbl>
1 2024-01-01 Saint Echo TikTok 458301
2 2024-01-01 Saint Echo X NA
3 2024-01-01 Saint Echo Snapchat NA
4 2024-01-01 Saint Echo Instagram 198006
5 2024-01-02 Saint Echo TikTok 458594
6 2024-01-02 Saint Echo X 368312
7 2024-01-02 Saint Echo Snapchat 226699
8 2024-01-02 Saint Echo Instagram NA
9 2024-01-03 Saint Echo TikTok 459599
10 2024-01-03 Saint Echo X NA
# ℹ 142,086 more rows
left_join
(“keep everything in the left table, add the right table where you can)inner_join()
- keeps only observations that occur in both tables (potential of losing control!)ISOweek
)left_join()
streams <- streams %>% mutate(week=ISOweek::ISOweek(date))
merged_data <- streams %>%
left_join(playlists, by = c("artist" = "artist",
"week"="iso_week"))
head(merged_data)
# A tibble: 6 × 13
date artist country streams week week2 month year playlists
<date> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 2024-01-01 10 Beats to Phoe… BE 6983 2024… 2024… 1 2024 24
2 2024-01-02 10 Beats to Phoe… BE 8451 2024… 2024… 1 2024 24
3 2024-01-03 10 Beats to Phoe… BE 0 2024… 2024… 1 2024 24
4 2024-01-04 10 Beats to Phoe… BE 0 2024… 2024… 1 2024 24
5 2024-01-05 10 Beats to Phoe… BE 1710 2024… 2024… 1 2024 24
6 2024-01-06 10 Beats to Phoe… BE 3069 2024… 2024… 1 2024 24
# ℹ 4 more variables: playlists_imputed <dbl>, playlists_lagged <dbl>,
# playlists_lead <dbl>, change <dbl>
Task: Please follow the steps below to assemble a data set for analysis, using streams.csv
, weather.csv
, socials.csv
, and playlists.csv
.
Tips:
.R
file, that saves the final data set as final.csv
(write_csv()
)1) Please aggregate streams.csv
to the weekly level (ISO); how will you deal with the daily metric of streams
? Mean? Sum?
2) In your newly created data, please create a trend variable, going from 1 (for the first date) to the last date.
3) Create an empty data set (i.e., only key columns): for all artists, and all countries, and all weeks.
4) Merge streams to this newly created data set.
5) Load weather data, pivot to long by country & merge to your data (by week & country)
6) Load social media data, pivot to wide and merge with data (by week and artist)
7) Impute any missing data using na.locf
or na.approx
, depending on what suits best.
7) Scaling up I: use for
loop to create 10 line plots for an artists' streams, and save using ggsave()
artists = unique(streams$artist)
for (one_artist in artists[1:10]) {
filtered_data = streams %>% filter(artist==one_artist)
# ggplot code here
}
8) Scaling up II: use lapply
to estimate individual regression models (lm
) for 10 artists; return results and summarize.
# starting code
artists = unique(streams$artist)
results = lapply(artists[1:10], function(one_artist) {
filtered_data = streams %>% filter(artist==one_artist)
# estimation code here
m <- lm(rnorm(10)~1) # just an example
results = tibble(variable = names(m$coefficients),
summary(m)$coefficients)
colnames(results) <-
c('variable','estimate','se','t','pval')
results = results %>% mutate(artist=one_artist)
return(results)
})
all_results = bind_rows(results)
dplyr
one is useful (and very visual!)Complete tutorial (part 1) and exercises (part 2)
After class, you will continue with the coaching session
.R
for regular scriptstidyverse
functions (e.g., read_csv()
, not read.csv()
)