Engineering data sets (in-class tutorial)

Hannes Datta

Before we get started

  • Exam dates have been set
    • dPrep: 3 April 2025 (resit 27 June)
    • oDCM: 1 April 2025 (resit 26 June)
  • GitHub: can push? → recording of last week, part 2, minute 40 (see also Canvas announcement)
  • Mac users: Please familiarize yourselves with R/RStudio on Windows during the oDCM lab session this Friday
  • Completely overhauled tutorial - hope everything works :)

Structure of this week

Get your cheat sheets ready!!!

Why this tutorial

  • What we've covered earlier in this course
    • Basic wrangling/transformations (can you name a few examples?)
    • Structure scripts in four blocks – required to automate workflow later (setup, input, transformation, output)
  • Now: zoom in on more advanced ways to transform data

Today's data

  • Massive, “big” data on music streaming
    • streams.csv: artist streams, daily
    • weather.csv: weather data for four countries
    • socials.csv: social media stats
    • playlists.csv: playlist statistics

Note: All data has been simulated; it's a new tutorial so let's hope everything works out.

DO: Let's download today's data

library(tidyverse)
download.file('https://github.com/hannesdatta/course-dprep/raw/refs/heads/master/content/docs/modules/week4/tutorial/data/week4_data.zip', 'week4_data.zip')
unzip('week4_data.zip', overwrite=TRUE)
  • Create a new project in R
  • Create a file called download.R, and insert the snippet above.
  • Create a new file, called explore.R, and load all data sets using read_csv()
  • Then, use the head(), View(), summary(), dim() commands and try to describe the data set.

Solution

library(tidyverse)

streams <- read_csv('streams.csv')
weather <- read_csv('weather.csv')
socials <- read_csv('socials.csv')
playlists <- read_csv('playlists.csv')

Part 1: Expanding our knowledge of R

  • Knowledge so far is restricted to…
    • opening/saving data (read_csv, write_csv),
    • exploring data (head, View(), summary(), table(), dim()
    • wrangling with tidyverse/dplyr: group_by(), summarize(), mutate(), filter(), select(), arrange(), count()

Part 1: Expanding our knowledge of R

  • But… what about more complex operations?
    • custom summary statistics? ranking? → create charts
    • removing duplicates? → for data auditing
    • complex text searches? → find artists
    • dealing with time series data? → replace missings
    • reshaping data? → bring data into the right shape for analysis
    • merging multiple data sets? → build “final” data set

Part 1: Expanding our knowledge of R

What's to come…

  • 1A) Applying base R functions in tidyverse
  • 1B) Advanced cleaning: Using regular expressions
  • 1C) Dealing with time
  • 1D) Reshaping data
  • 1E) Merging data

1A) Applying functions in `tidyverse` (1/3)

  • Summarize can take any functions as arguments
  • Best to have “Base R” cheat sheet available
streams %>% group_by(artist) %>%
  summarize(
    mean_streams = mean(streams, na.rm = TRUE))

DO: Please extend code snippet above to include

  • standard deviation (sd),
  • minimum (min) and
  • maximum (max) in the summary table.

Solution (1/3)

library(dplyr)

# Compute basic statistics per artist
streams %>%
  group_by(artist) %>%
  summarize(
    min_streams = min(streams, na.rm = TRUE),
    mean_streams = mean(streams, na.rm = TRUE),
    max_streams = max(streams, na.rm = TRUE),
    sd_streams = sd(streams, na.rm = TRUE) 

  )

1A) DO: Applying functions in `tidyverse` (2/3)

  • Now, let's extract the daily “top 10” - i.e., the top 10 artists on a given day.
  • Tips
    • First, group by date and country
    • Use the rank() function to find the “rank” of streams (inverse!) - in combination w/ mutate()
    • Ungroup your data (ungroup())
    • Reorder your data set by date and ranks - lower ranks are “up”
    • Show the first 10 observations for each day (filter)

Solution (2/3)

streams %>% 
  group_by(date, country) %>% 
  mutate(rank = rank(-streams)) %>% 
  ungroup() %>% 
  arrange(date, country, rank) %>% 
  filter(rank<=10)

1A) Applying functions in `tidyverse` (3/3)

  • Meeting unique(): removes duplicates from a vector
  • n_distinct(): counts the number of unique values in a vector
  • distinct(): removes duplicate rows
  • Example: Let's count the number of unique artists
streams %>% summarize(number_of_artists = n_distinct(artist))
# or: length(unique(streams$artist))
distinct(streams)

1B) Advanced Cleaning: Using Regular Expressions

  • Often, we're facing character columns
unique(streams$artist)[1:10]
 [1] "10 Beats to Phoenix"  "10 Notes to Wolves"   "30 Minutes to Mirage"
 [4] "30 Minutes to Rebels" "30 Minutes to Saints" "30 Ways to Storm"    
 [7] "40 Days to Rebels"    "40 Miles to Storm"    "60 Notes to Echo"    
[10] "77 Beats to Wolves"  
  • Hard-coded “searches” in such data is difficult (you probably never get it exact)
  • Regular expressions are ways to help us search using “search patterns”

1B) Advanced Cleaning: Using Regular Expressions

Functions you need to know

  • grepl(x, y) → detect whether y contains x (evaluates to TRUE or FALSE)
  • grep(x, y, value=TRUE) → detect whether y contains x (& returns the search results)
  • gsub(x, z, y) → searches x in y and replaces x by z

1B) Advanced Cleaning: Using Regular Expressions

How to search? Examples:

  • grepl("\\d", "Song123")TRUE
  • grepl("[a-z]", "K3")FALSE (case-sensitive)
  • grepl("^The", "The Beatles")TRUE, but…
  • grepl("^the", "The Beatles")FALSE (know why?)
  • grepl("band$", "Jazz Band", ignore.case=T)TRUE
  • grepl("jazz$", "Jazz Band", ignore.case=T)FALSE (know why?)

1B) Advanced Cleaning: Using Regular Expressions

DO:

  • How many artists have a number in their name?
  • Which bands end on “Band”?
  • Replace Storm by Wind

Tip: first extract the unique artist names.

artists <- unique(streams$artist)

Solution

artists <- unique(streams$artist)
grep('\\d', artists, value=T)
grep('band$', artists, value=T, ignore.case=T)
gsub('Storm','Wind',artists)

1C) Dealing with time

  • Dealing with “time” in the largest sense - dates, years, months, weeks, hours, minutes, seconds - can be a complicated matter

Issues

1) converting “characters” to date/time

2) conversion between one format to the other (easy: minutes → seconds, harder: dates → weeks)

3) accounting for the “time properties” of data

1C) Dealing with time

For dealing with dates time, we make use of three extra packages:

  • lubridate,
  • ISOweek, and
  • zoo

DO:

  • Please install these three packages (e.g., install.packages('lubridate')
  • Then, load the libraries (e.g., library(lubridate))

1C) Dealing with time (1) - Conversion

  • Often, data is not properly coded
  • Need to “recode” to suit the date-time format R recognizes
  • Example: YYYYMMDD instead of YYYY-MM-DD
bad_dates = c('20250104')

good_date = as.Date(bad_dates, format='%Y%m%d') # using base R
good_date
[1] "2025-01-04"
good_date2 = lubridate::ymd(bad_dates) # alternative using lubridate
good_date2
[1] "2025-01-04"

Note: :: explicitly tells R to take the ymd function (after ::) from the lubridate package (before the ::)

1C) Dealing with time (2) - Conversion

A demonstration of lubridate's functionality

library(lubridate) # load the package
my_date = as.Date('2025-02-19') # set a data for demonstration

isoweek(my_date) # convert date to week number
[1] 8
m<-month(my_date) # convert date to month number
m
[1] 2
mon<-month(my_date, label=T) # retrieve name of month
mon
[1] Feb
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

1C) Dealing with time (2) - Conversion

A demonstration of lubridate's functionality (continued)

y<-year(my_date) # extract year from date
y
[1] 2025
# assemble back to a string
paste0(y,'-',sprintf('%02d',m),'-01') # note: sprintf turns 2 into '02'
[1] "2025-02-01"

1C) Dealing with time (2) - Conversion

A demonstration of ISOweek's functionality

  • Often, week numbers are only meaningful when connected to a year (e.g., to make a difference between 2024-W08 and 2025-W08)
  • ISOweek is a super fast way to get us there, compared to lubridate
ISOweek::ISOweek(my_date)
[1] "2025-W08"

1C) Dealing with time (2) - Conversion

DO: Please create the following new variables inside the streams data set:

  • week: week number (lubridate's week)
  • week2: year-week (ISOweek's ISOweek)
  • month: month number (lubridate's month)
  • year: year number (lubridate's year)

Make use of tidyverse/dplyr's mutate.

Solution

library(lubridate)

streams <- streams %>%
  mutate(
    week = isoweek(date),
    week2 = ISOweek::ISOweek(date),
    month = month(date),
    year = year(date)
  )

streams %>% select(date, week, week2, month, year) %>% head(10)
# A tibble: 10 × 5
   date        week week2    month  year
   <date>     <dbl> <chr>    <dbl> <dbl>
 1 2024-01-01     1 2024-W01     1  2024
 2 2024-01-02     1 2024-W01     1  2024
 3 2024-01-03     1 2024-W01     1  2024
 4 2024-01-04     1 2024-W01     1  2024
 5 2024-01-05     1 2024-W01     1  2024
 6 2024-01-06     1 2024-W01     1  2024
 7 2024-01-07     1 2024-W01     1  2024
 8 2024-01-08     2 2024-W02     1  2024
 9 2024-01-09     2 2024-W02     1  2024
10 2024-01-10     2 2024-W02     1  2024

1C) Dealing with time (3) - "time properties" of data

  • Some NAs in time series are NA, but we can make plausible assumptions as to what their values could have been.
  • Example 1: prices (a good assumption: last observed value!)
  • Example 2: playlist evolution (a good assumption: linear interpolation → 15)

1C) Dealing with time (3) - "time properties" of data

  • zoo::na.locf: last observation carried forward
  • zoo::na.approx: approximation of missing values, such as linear interpolation
sim_data %>% 
  mutate(price_imp = zoo::na.locf(price,na.rm=F), # locf
         playlists_imp = zoo::na.approx(playlists,na.rm=F), # linear interp.
         playlists_imp2 = zoo::na.locf(playlists, na.rm=F)) # locf
# A tibble: 4 × 6
  date       price playlists price_imp playlists_imp playlists_imp2
  <date>     <dbl>     <dbl>     <dbl>         <dbl>          <dbl>
1 2025-01-01   1.2        30       1.2            30             30
2 2025-01-02   1.1        20       1.1            20             20
3 2025-01-03  NA          NA       1.1            15             20
4 2025-01-04   1.1        10       1.1            10             10

1C) Dealing with time (3) - "time properties" of data

DO: Please use the zoo::na.locf() function to interpolate the number of playlists in the playlists data set

Important:

  • first group your data by artist (and think about why you would have to do that!)
  • also order your data by dates (and also think about the why)

Solution

playlists = playlists %>% 
  group_by(artist) %>% 
  arrange(artist, iso_week) %>% 
  mutate(playlists_imputed = zoo::na.locf(playlists, na.rm=F))
playlists
# A tibble: 4,494 × 4
# Groups:   artist [107]
   iso_week artist              playlists playlists_imputed
   <chr>    <chr>                   <dbl>             <dbl>
 1 2024-W01 10 Beats to Phoenix        24                24
 2 2024-W02 10 Beats to Phoenix        28                28
 3 2024-W03 10 Beats to Phoenix        28                28
 4 2024-W04 10 Beats to Phoenix        27                27
 5 2024-W06 10 Beats to Phoenix        NA                27
 6 2024-W07 10 Beats to Phoenix        25                25
 7 2024-W08 10 Beats to Phoenix        15                15
 8 2024-W11 10 Beats to Phoenix        61                61
 9 2024-W12 10 Beats to Phoenix        22                22
10 2024-W14 10 Beats to Phoenix        23                23
# ℹ 4,484 more rows

…but… wait a minute - where's week 5? We will return to this later!

1C) Dealing with time

  • Other typical “time series” functions are to generate lagged values (or lead values)

Example:

playlists = playlists %>% group_by(artist) %>% arrange(artist, iso_week) %>% mutate(playlists_lagged = lag(playlists_imputed),
       playlists_lead = lead(playlists_imputed))
head(playlists,2)
# A tibble: 2 × 6
# Groups:   artist [1]
  iso_week artist    playlists playlists_imputed playlists_lagged playlists_lead
  <chr>    <chr>         <dbl>             <dbl>            <dbl>          <dbl>
1 2024-W01 10 Beats…        24                24               NA             28
2 2024-W02 10 Beats…        28                28               24             28

DO: Can you also add a variable, reflecting the change from one week to the next?

Solution

playlists = playlists %>% group_by(artist) %>% arrange(artist, iso_week) %>% mutate(playlists_lagged = lag(playlists_imputed),
       playlists_lead = lead(playlists_imputed),
       change = playlists_imputed - playlists_lagged)
playlists
# A tibble: 4,494 × 7
# Groups:   artist [107]
   iso_week artist   playlists playlists_imputed playlists_lagged playlists_lead
   <chr>    <chr>        <dbl>             <dbl>            <dbl>          <dbl>
 1 2024-W01 10 Beat…        24                24               NA             28
 2 2024-W02 10 Beat…        28                28               24             28
 3 2024-W03 10 Beat…        28                28               28             27
 4 2024-W04 10 Beat…        27                27               28             27
 5 2024-W06 10 Beat…        NA                27               27             25
 6 2024-W07 10 Beat…        25                25               27             15
 7 2024-W08 10 Beat…        15                15               25             61
 8 2024-W11 10 Beat…        61                61               15             22
 9 2024-W12 10 Beat…        22                22               61             23
10 2024-W14 10 Beat…        23                23               22             24
# ℹ 4,484 more rows
# ℹ 1 more variable: change <dbl>

1D) Reshaping/Pivoting

  • Mostly two data set “types”
    • “Long”: each observation has its own row, with a single column for values (see streams.csv)
    • “Wide”: each row is a single entity, with multiple columns for different variables (see weather.csv)
  • When we convert from long to wide or the other way around, we call this 'pivoting' or 'reshaping'

1D) Reshaping/Pivoting: Meet long data

DO: Let's take a look at streams.csv

head(streams,3)
# A tibble: 3 × 8
  date       artist              country streams  week week2    month  year
  <date>     <chr>               <chr>     <dbl> <dbl> <chr>    <dbl> <dbl>
1 2024-01-01 10 Beats to Phoenix BE         6983     1 2024-W01     1  2024
2 2024-01-02 10 Beats to Phoenix BE         8451     1 2024-W01     1  2024
3 2024-01-03 10 Beats to Phoenix BE            0     1 2024-W01     1  2024
  • What uniquely identifies a row → “id columns”
  • What would be a way in which you could reshape this to “wide”?

1D) Reshaping/Pivoting: meet long data

From long…

# A tibble: 2 × 8
  date       artist              country streams  week week2    month  year
  <date>     <chr>               <chr>     <dbl> <dbl> <chr>    <dbl> <dbl>
1 2024-01-01 10 Beats to Phoenix BE         6983     1 2024-W01     1  2024
2 2024-01-02 10 Beats to Phoenix BE         8451     1 2024-W01     1  2024

…to wide:

streams %>% 
  pivot_wider(id_cols=c('date','artist'),
              names_from=c('country'),         
              values_from=c('streams')) %>% head(6)
# A tibble: 6 × 6
  date       artist                 BE    DE    NL    US
  <date>     <chr>               <dbl> <dbl> <dbl> <dbl>
1 2024-01-01 10 Beats to Phoenix  6983  9523  6258  2856
2 2024-01-02 10 Beats to Phoenix  8451  8601     0  6642
3 2024-01-03 10 Beats to Phoenix     0     0  9712  9648
4 2024-01-04 10 Beats to Phoenix     0   292     0   972
5 2024-01-05 10 Beats to Phoenix  1710     0 12293  7029
6 2024-01-06 10 Beats to Phoenix  3069  2899  2724  1556

1D) Reshaping/Pivoting: meet wide data

DO: Let's take a look at weather.csv

head(weather,3)
# A tibble: 3 × 9
  date       DE_temp BE_temp NL_temp US_temp DE_sun BE_sun NL_sun US_sun
  <date>       <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 2024-01-01   14.3    10.5     10.6    17.1   5.46   5.16   4.31   6.93
2 2024-01-02    8.72   17.1     10.4    21.8   4.88   5.63   3.75  12   
3 2024-01-03   11.7     7.41    11.7    17.4   6.50   6.09   5.06   7.95
  • Which column(s) uniquely identify a record in this data?
  • Which columns hold relevant data?
  • What would be a way in which you could reshape this to “long”?

1D) Reshaping/Pivoting: meet wide data

From wide…

# A tibble: 2 × 9
  date       DE_temp BE_temp NL_temp US_temp DE_sun BE_sun NL_sun US_sun
  <date>       <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 2024-01-01   14.3     10.5    10.6    17.1   5.46   5.16   4.31   6.93
2 2024-01-02    8.72    17.1    10.4    21.8   4.88   5.63   3.75  12   

…to long:

weather %>% 
  pivot_longer(cols = matches("_"), 
               names_to = 'country_measure',
               values_to = 'value') %>% head(4)
# A tibble: 4 × 3
  date       country_measure value
  <date>     <chr>           <dbl>
1 2024-01-01 DE_temp          14.3
2 2024-01-01 BE_temp          10.5
3 2024-01-01 NL_temp          10.6
4 2024-01-01 US_temp          17.1

1D) Pivoting: long to wide (exercise)

DO: Convert the social media metrics (socials.csv) from long to wide

Steps:

  • Only select metric == 'followers' (using filter())
  • Desired output: per date and artist, columns for TikTok, Instagram, etc. holding the value of followers
  • Hint: pivot_wider requires id_cols, names_from, and values_from

Solution

wide = socials %>% filter(metric=='followers') %>%
  pivot_wider(id_cols = c('date','artist'),
              names_from=c('platform'),
              values_from=c('value')) 
head(wide,4)
# A tibble: 4 × 6
  date       artist     TikTok      X Snapchat Instagram
  <date>     <chr>       <dbl>  <dbl>    <dbl>     <dbl>
1 2024-01-01 Saint Echo 458301     NA       NA    198006
2 2024-01-02 Saint Echo 458594 368312   226699        NA
3 2024-01-03 Saint Echo 459599     NA   226957    198549
4 2024-01-04 Saint Echo 460286 369857   227480    198947

1D) Pivoting: wide to long (exercise)

DO: Convert the wide data from before back to long

Hints:

  • pivot_longer requires cols; optional: names_to, and values_to.

Solution

wide %>% pivot_longer(cols=all_of(c('TikTok','X','Snapchat','Instagram')))
# A tibble: 142,096 × 4
   date       artist     name       value
   <date>     <chr>      <chr>      <dbl>
 1 2024-01-01 Saint Echo TikTok    458301
 2 2024-01-01 Saint Echo X             NA
 3 2024-01-01 Saint Echo Snapchat      NA
 4 2024-01-01 Saint Echo Instagram 198006
 5 2024-01-02 Saint Echo TikTok    458594
 6 2024-01-02 Saint Echo X         368312
 7 2024-01-02 Saint Echo Snapchat  226699
 8 2024-01-02 Saint Echo Instagram     NA
 9 2024-01-03 Saint Echo TikTok    459599
10 2024-01-03 Saint Echo X             NA
# ℹ 142,086 more rows

1E) Merging/joining

  • How can we assemble different data sets together?
  • We can join them!
    • Why join / Why not to join? Mostly a memory issue. Think about it wisely.
    • Keys: required to join (e.g., single key, composite keys
    • Different type of joins
      • left_join (“keep everything in the left table, add the right table where you can)
      • inner_join() - keeps only observations that occur in both tables (potential of losing control!)

Do: Let's join!

  1. Create a ISO week column for streams (from date, using ISOweek)
  2. Join streams (left) with playlists (right)
    • using a left_join()
    • by artist and week (composite key).

Solution

streams <- streams %>% mutate(week=ISOweek::ISOweek(date))

merged_data <- streams %>%
  left_join(playlists, by = c("artist" = "artist", 
                              "week"="iso_week"))
head(merged_data)
# A tibble: 6 × 13
  date       artist            country streams week  week2 month  year playlists
  <date>     <chr>             <chr>     <dbl> <chr> <chr> <dbl> <dbl>     <dbl>
1 2024-01-01 10 Beats to Phoe… BE         6983 2024… 2024…     1  2024        24
2 2024-01-02 10 Beats to Phoe… BE         8451 2024… 2024…     1  2024        24
3 2024-01-03 10 Beats to Phoe… BE            0 2024… 2024…     1  2024        24
4 2024-01-04 10 Beats to Phoe… BE            0 2024… 2024…     1  2024        24
5 2024-01-05 10 Beats to Phoe… BE         1710 2024… 2024…     1  2024        24
6 2024-01-06 10 Beats to Phoe… BE         3069 2024… 2024…     1  2024        24
# ℹ 4 more variables: playlists_imputed <dbl>, playlists_lagged <dbl>,
#   playlists_lead <dbl>, change <dbl>

Part 2: Case studies and exercises (I)

Task: Please follow the steps below to assemble a data set for analysis, using streams.csv, weather.csv, socials.csv, and playlists.csv.

Tips:

  • When working on this task, adhere to the setup-ITO structure of writing code.
  • Work in an .R file, that saves the final data set as final.csv (write_csv())

Part 2: Case studies and exercises (II)

1) Please aggregate streams.csv to the weekly level (ISO); how will you deal with the daily metric of streams? Mean? Sum?

2) In your newly created data, please create a trend variable, going from 1 (for the first date) to the last date.

Part 2: Case studies and exercises (III)

3) Create an empty data set (i.e., only key columns): for all artists, and all countries, and all weeks.

4) Merge streams to this newly created data set.

Part 2: Case studies and exercises (III)

5) Load weather data, pivot to long by country & merge to your data (by week & country)

6) Load social media data, pivot to wide and merge with data (by week and artist)

7) Impute any missing data using na.locf or na.approx, depending on what suits best.

Part 2: Case studies and exercises (IV)

7) Scaling up I: use for loop to create 10 line plots for an artists' streams, and save using ggsave()

artists = unique(streams$artist)
for (one_artist in artists[1:10]) {
  filtered_data = streams %>% filter(artist==one_artist)
  # ggplot code here
}

Part 2: Case studies and exercises (IV)

8) Scaling up II: use lapply to estimate individual regression models (lm) for 10 artists; return results and summarize.

# starting code

artists = unique(streams$artist)
results = lapply(artists[1:10], function(one_artist) {
  filtered_data = streams %>% filter(artist==one_artist)
  # estimation code here
  m <- lm(rnorm(10)~1) # just an example
  results = tibble(variable = names(m$coefficients),
                     summary(m)$coefficients)
  colnames(results) <-    
     c('variable','estimate','se','t','pval')
  results = results %>% mutate(artist=one_artist)
  return(results)
})

all_results = bind_rows(results)

Summary of today's tutorial

  • Wrangling is necessary to get data into shape and generate meaning/understanding
  • You've just been introduced to a series of advanced coding concepts (base R functions, regular expressions, dealing with time, pivoting/reshaping and merging)
  • Remember: work in “ITO building blocks” (+ loading libraries, of course!)
  • Have your cheat sheets available always – especially the dplyr one is useful (and very visual!)

Next steps

  • Complete tutorial (part 1) and exercises (part 2)

    • solutions will still be added to the tutorial
  • After class, you will continue with the coaching session

Coaching notes

  • absolutely important to work in local repositories, with issue-based branches, and pull requests.
  • Source code files need to be structured as setup-ITO – required for automation!
  • Choose appropriate file format: Rmarkdowns or Quarto for reports/PDFs/HTML, .R for regular scripts
  • Adhere to tidyverse functions (e.g., read_csv(), not read.csv())
  • Incorporate programming concepts where needed: looping, using functions