Introduction

In this tutorial, we'll tackle common data challenges using a simulated music streaming dataset. Starting with multiple raw data sources, we'll clean, transform, and merge them into a single analysis-ready data frame.

What You'll Learn in this Tutorial

  • Expand your knowledge of R functions (e.g., base functions such as rank() or unique())
  • Handle missing and duplicate values.
  • Handle and preprocess string data (remove characters, splitting using regular expressions).
  • Operationalize variables and handle pivoting (wide to long, long to wide)
  • Merge data sets on one or more indices and understand the different types of joins (inner, left, right join).
  • Scale workflows (e.g., batch plotting, estimation of linear models, lm())
  • Implement techniques to exemplary case studies often encountered in marketing research

Prerequisites

Ready to start? Here's what we recommend you do before going through this tutorial.

Create a new R project

  1. Open RStudio and create a new project by clicking on File -> New Project -> New Directory -> New Project.
  2. Name your project data-preparation and save it in a location of your choice.
  3. Add a data folder:
  • Option 1 (code): Run `dir.create("data")
  • Option 2 (Manual): Click New Folder (📁) in the Files pane

Download the data

Click here to manually download the dataset. Alternatively, run the following R code to download and unzip the file automatically:

# downloading and unzipping the data
if (!file.exists('week4_data.zip')) {
  download.file('https://github.com/hannesdatta/course-dprep/raw/refs/heads/master/content/docs/modules/week4/tutorial/data/week4_data.zip', 'week4_data.zip')
  unzip('week4_data.zip', overwrite=TRUE)
  }

Now, load the data into R using read_csv:

library(tidyverse)

streams <- read_csv('streams.csv')
weather <- read_csv('weather.csv')
socials <- read_csv('socials.csv')
playlists <- read_csv('playlists.csv')

Troubleshooting:

  • If R doesn’t find your file: Click Session → Set working directory → To source file location. Avoid using setwd() — hardcoding directories can cause problems!

Understanding the Data

Let's explore the data. It consists of four files, which vary in structure, unit of observation, etc. This is on purpose because then you can learn all the concepts required in this tutorial.

  • streams.csv: number of streams (streams) for an artist in a given country on a given day (date)
  • weather.csv: for each day and country, the average temperature (_temp) and sun hours (_sun)
  • socials.csv: for every artist and day (date), this data lists social media metrics (e.g., on platform TikTok, a metric followers is observed and has the value 458352)
  • playlists.csv: for each week iso_week and artist, the data lists the number of playlists an artist was listed on on Spotify.

Now it's your turn: Explore the data yourself!

We can now explore the data using the strategies introduced in earlier tutorials.

  1. What is the unit of observation? (The unit of analysis tells us what each row in the dataset represents.)
  2. Are there any missing values in the columns? Do values make sense?
  3. What are common columns across the datasets that could be used for merging?

Tips:

  • Take a look at the data using the function head(streams), or View(streams). If you'd like to view more rows with head, use head(streams, 100) (for, e.g., the first 100 rows). (tail() gives you the last rows of the data).
  • The command summary(playlists) or glimpse(playlists) generates descriptive statistics for all variables in the data. You can also use this command on individual columns (e.g., summary(playlists$playlists)).
  • Character or factor columns are best inspected using the table() command. These will create frequency tables.
# View the first rows of the datasets to get an overview
head(streams)
head(socials)
head(weather)
head(playlists)

glimpse(streams)
glimpse(socials)
glimpse(weather)
glimpse(playlists)

Part 1: Expanding our knowledge of R

In this section, we will deepen our understanding of key R functions that are essential for data engineering. We will focus on numerical functions, advanced string cleaning, and working with time-based data. Each concept will be demonstrated with examples from our datasets.

1A) Applying base R functions in tidyverse

Before diving into more advanced transformations, let’s start by applying base R functions inside dplyr operations. In mutate() and summarize(), you can use any function from base R—not just mean(), min(), or max(). This is useful when operationalizing new variables, performing custom transformations, or computing statistics on groups.

For now, we'll use min, mean, and max to get started. But remember, the same logic applies to any function, like sd(), range(), var(), or even custom functions you define.

Compute summary statistics for streaming data

library(dplyr)

# Compute basic statistics per artist
streams %>%
  group_by(artist) %>%
  summarize(
    min_streams = min(streams, na.rm = TRUE),
    mean_streams = mean(streams, na.rm = TRUE),
    max_streams = max(streams, na.rm = TRUE),
    sd_streams = sd(streams, na.rm = TRUE)  # Standard deviation
    
  )
## # A tibble: 107 × 5
##    artist               min_streams mean_streams max_streams sd_streams
##    <chr>                      <dbl>        <dbl>       <dbl>      <dbl>
##  1 10 Beats to Phoenix            0        5519.       24868      5121.
##  2 10 Notes to Wolves             0        5795.       21932      4697.
##  3 30 Minutes to Mirage           0        5021.       22207      4468.
##  4 30 Minutes to Rebels           0        5034.       24260      4818.
##  5 30 Minutes to Saints           0        5098.       20977      4627.
##  6 30 Ways to Storm               0        5865.       24271      4948.
##  7 40 Days to Rebels              0        5082.       21415      4667.
##  8 40 Miles to Storm              0        5386.       24915      4823.
##  9 60 Notes to Echo               0        4558.       22690      4519.
## 10 77 Beats to Wolves             0        5661.       24751      4827.
## # ℹ 97 more rows

Want to go further? Any function that works on a vector can be used here—for example:

  • var(streams, na.rm = TRUE) to calculate variance.
  • range(streams, na.rm = TRUE) to get the min and max in one step.
  • quantile(streams, probs = c(0.25, 0.5, 0.75), na.rm = TRUE) to get quartiles.

💡 Tip: You can check out the Base R Cheatsheet to explore more functions you can integrate into your tidyverse workflows.

Ranking: Identifying the Most-Listened-To Artists

Ranking is super useful in our application because we may be interested in identifying the most-listened-to artist or tracking changes in ranking over time. Instead of just looking at absolute numbers, ranking allows us to order values while handling ties, which is often critical in reporting and analysis.

The rank() function assigns ranks based on the values in a column, with ties receiving the same rank. By default, it ranks in ascending order, but using -x flips the order to descending. This is particularly useful when working with streams data, where we want to rank artists by their daily streaming count.

Example: Rank Artists by Streams Per Day

library(dplyr)

# Rank artists within each day
streams %>%
  group_by(date) %>%
  mutate(
    rank_streams = rank(-streams)  # Higher streams = lower rank (1st place)
   ) %>%
  arrange(date, rank_streams) %>%
  select(date, artist, streams, rank_streams) %>%
  head(10)
## # A tibble: 10 × 4
## # Groups:   date [1]
##    date       artist            streams rank_streams
##    <date>     <chr>               <dbl>        <dbl>
##  1 2024-01-01 Rockets Band        18348            1
##  2 2024-01-01 Saint Echo          17092            2
##  3 2024-01-01 Midnight Mirage     16080            3
##  4 2024-01-01 Storm Fusion        15635            4
##  5 2024-01-01 Midnight Rockets    15428            5
##  6 2024-01-01 Wolves Band         15031            6
##  7 2024-01-01 Saint Storm         14941            7
##  8 2024-01-01 Wolves Experience   14424            8
##  9 2024-01-01 Sophia Miller       14248            9
## 10 2024-01-01 Logan Brown         14221           10

💡 Other useful ranking functions:

  • row_number(x): Assigns unique sequential numbers without ties.

Using these functions, we can create trend variables (e.g., running from 1 to N, for all observations of an artist). Using this variable in a regression model, for example, could help us understand streaming trends by country.

Example: Create a trend variable & estimate a linear regression

streams = streams %>% group_by(artist) %>% arrange(artist, date) %>% mutate(trend = row_number()) %>% ungroup()
head(streams,15)
## # A tibble: 15 × 5
##    date       artist              country streams trend
##    <date>     <chr>               <chr>     <dbl> <int>
##  1 2024-01-01 10 Beats to Phoenix BE         6983     1
##  2 2024-01-01 10 Beats to Phoenix DE         9523     2
##  3 2024-01-01 10 Beats to Phoenix NL         6258     3
##  4 2024-01-01 10 Beats to Phoenix US         2856     4
##  5 2024-01-02 10 Beats to Phoenix BE         8451     5
##  6 2024-01-02 10 Beats to Phoenix DE         8601     6
##  7 2024-01-02 10 Beats to Phoenix NL            0     7
##  8 2024-01-02 10 Beats to Phoenix US         6642     8
##  9 2024-01-03 10 Beats to Phoenix BE            0     9
## 10 2024-01-03 10 Beats to Phoenix DE            0    10
## 11 2024-01-03 10 Beats to Phoenix NL         9712    11
## 12 2024-01-03 10 Beats to Phoenix US         9648    12
## 13 2024-01-04 10 Beats to Phoenix BE            0    13
## 14 2024-01-04 10 Beats to Phoenix DE          292    14
## 15 2024-01-04 10 Beats to Phoenix NL            0    15
# overall trend
m <- lm(streams ~ 1 + trend, data = streams)
summary(m)
## 
## Call:
## lm(formula = streams ~ 1 + trend, data = streams)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6074.1 -4275.5  -818.8  3085.3 24313.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.096e+03  2.865e+01  142.98   <2e-16 ***
## trend       1.531e+00  3.992e-02   38.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4721 on 104529 degrees of freedom
##   (31293 observations deleted due to missingness)
## Multiple R-squared:  0.01387,    Adjusted R-squared:  0.01386 
## F-statistic:  1470 on 1 and 104529 DF,  p-value: < 2.2e-16
# by country
m2 <- lm(streams ~ 1 + trend*country, data = streams)
summary(m2)
## 
## Call:
## lm(formula = streams ~ 1 + trend * country, data = streams)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9051.6 -3569.1  -760.7  2909.6 23727.1 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.055e+03  5.367e+01  56.913  < 2e-16 ***
## trend           1.097e+00  7.496e-02  14.631  < 2e-16 ***
## countryDE       6.859e+02  7.608e+01   9.016  < 2e-16 ***
## countryNL       1.988e+01  7.616e+01   0.261   0.7940    
## countryUS       3.471e+03  7.609e+01  45.619  < 2e-16 ***
## trend:countryDE 5.308e-01  1.062e-01   5.000 5.73e-07 ***
## trend:countryNL 3.236e-01  1.061e-01   3.049   0.0023 ** 
## trend:countryUS 8.583e-01  1.061e-01   8.090 6.04e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4440 on 104523 degrees of freedom
##   (31293 observations deleted due to missingness)
## Multiple R-squared:  0.1278, Adjusted R-squared:  0.1277 
## F-statistic:  2187 on 7 and 104523 DF,  p-value: < 2.2e-16

Dealing with Unique Values: Removing Duplicates and Counting Distinct Entries

When working with data, we often need to remove duplicate values or count unique occurrences. Whether we're filtering artist names, cleaning up datasets, or summarizing trends, uniqueness matters.

In tidyverse, we have a few tools to help:

  • unique() kicks out duplicate values from a vector or data frame.
  • distinct() (from dplyr) removes duplicate rows from a data frame.
  • n_distinct() counts the number of unique values in a column.

Example: Get a List of Unique Artists

Let’s start simple—getting a list of all the unique artists in our dataset.

unique_artists <- unique(streams$artist)  # Extract unique artist names
head(unique_artists, 10)  # Show a preview
##  [1] "10 Beats to Phoenix"  "10 Notes to Wolves"   "30 Minutes to Mirage"
##  [4] "30 Minutes to Rebels" "30 Minutes to Saints" "30 Ways to Storm"    
##  [7] "40 Days to Rebels"    "40 Miles to Storm"    "60 Notes to Echo"    
## [10] "77 Beats to Wolves"

You can also use unique() inside a mutate() statement.

Example: Count the Number of Unique Countries

Now, let's count how many different countries appear in our dataset:

streams %>%
  summarize(num_unique_countries = n_distinct(country))
## # A tibble: 1 × 1
##   num_unique_countries
##                  <int>
## 1                    4

1B) Advanced Cleaning: Using Regular Expressions

Regular expressions (regex) allow us to search for patterns in text, making it easier to clean and filter data. In R, we use:

  • grepl(x, y) → detect whether y contains x (evaluates to TRUE or FALSE)
  • grep(x, y, value = T) → detect whether y contains x (& returns the search results)
  • gsub(x, z, y) → replace matched patterns (x) with new values (z)

Common Regex Search Patterns

Pattern Description Example
\\d Find numbers (0-9) grepl("\\d", "Song123")TRUE
[a-zA-Z] Find letters (case-sensitive) grepl("[a-zA-Z]", "123!")FALSE
^word Match beginning of a string grepl("^The", "The Beatles")TRUE
word$ Match end of a string grepl("Band$", "Jazz Band")TRUE

grepl(): Finding Patterns in Text

Use grepl() to detect whether a pattern exists in a column. It returns TRUE/FALSE, making it useful for filtering data.

Example: Identify artists with "Beat" in their name

streams %>%
  filter(grepl("Beat", artist)) %>%
  select(artist) %>%
  distinct()
## # A tibble: 2 × 1
##   artist             
##   <chr>              
## 1 10 Beats to Phoenix
## 2 77 Beats to Wolves

Example: Find artists whose names contain numbers

streams %>%
  filter(grepl("[0-9]", artist)) %>%
  select(artist) %>%
  distinct()
## # A tibble: 15 × 1
##    artist              
##    <chr>               
##  1 10 Beats to Phoenix 
##  2 10 Notes to Wolves  
##  3 30 Minutes to Mirage
##  4 30 Minutes to Rebels
##  5 30 Minutes to Saints
##  6 30 Ways to Storm    
##  7 40 Days to Rebels   
##  8 40 Miles to Storm   
##  9 60 Notes to Echo    
## 10 77 Beats to Wolves  
## 11 77 Miles to Storm   
## 12 77 Steps to Phoenix 
## 13 77 Ways to Riders   
## 14 88 Steps to Storm   
## 15 88 Ways to Rebels

Example: Find artists starting with "The" (case-insensitive)

streams %>%
  filter(grepl("^The", artist, ignore.case = TRUE)) %>%
  select(artist) %>%
  distinct()
## # A tibble: 3 × 1
##   artist    
##   <chr>     
## 1 The Mirage
## 2 The Rebels
## 3 The Wolves

Example: Find artists ending with "Band"

streams %>%
  filter(grepl("Band$", artist)) %>%
  select(artist) %>%
  distinct()
## # A tibble: 3 × 1
##   artist      
##   <chr>       
## 1 Riders Band 
## 2 Rockets Band
## 3 Wolves Band
# alternative, using grep()
streams %>% pull(artist) %>% grep("Band$", ., value=T) %>% unique()
## [1] "Riders Band"  "Rockets Band" "Wolves Band"

gsub(): Replacing Patterns in Text

Use gsub() to modify text by replacing matched patterns.

Example: Clean up spaces in artist names (replace with nothing)

streams %>%
  mutate(artist_clean = gsub(" ", "", artist))
## # A tibble: 135,824 × 6
##    date       artist              country streams trend artist_clean    
##    <date>     <chr>               <chr>     <dbl> <int> <chr>           
##  1 2024-01-01 10 Beats to Phoenix BE         6983     1 10BeatstoPhoenix
##  2 2024-01-01 10 Beats to Phoenix DE         9523     2 10BeatstoPhoenix
##  3 2024-01-01 10 Beats to Phoenix NL         6258     3 10BeatstoPhoenix
##  4 2024-01-01 10 Beats to Phoenix US         2856     4 10BeatstoPhoenix
##  5 2024-01-02 10 Beats to Phoenix BE         8451     5 10BeatstoPhoenix
##  6 2024-01-02 10 Beats to Phoenix DE         8601     6 10BeatstoPhoenix
##  7 2024-01-02 10 Beats to Phoenix NL            0     7 10BeatstoPhoenix
##  8 2024-01-02 10 Beats to Phoenix US         6642     8 10BeatstoPhoenix
##  9 2024-01-03 10 Beats to Phoenix BE            0     9 10BeatstoPhoenix
## 10 2024-01-03 10 Beats to Phoenix DE            0    10 10BeatstoPhoenix
## # ℹ 135,814 more rows

Example: Remove numbers from artist names

streams %>% 
  mutate(artist_clean = gsub("\\d", "", artist))
## # A tibble: 135,824 × 6
##    date       artist              country streams trend artist_clean       
##    <date>     <chr>               <chr>     <dbl> <int> <chr>              
##  1 2024-01-01 10 Beats to Phoenix BE         6983     1 " Beats to Phoenix"
##  2 2024-01-01 10 Beats to Phoenix DE         9523     2 " Beats to Phoenix"
##  3 2024-01-01 10 Beats to Phoenix NL         6258     3 " Beats to Phoenix"
##  4 2024-01-01 10 Beats to Phoenix US         2856     4 " Beats to Phoenix"
##  5 2024-01-02 10 Beats to Phoenix BE         8451     5 " Beats to Phoenix"
##  6 2024-01-02 10 Beats to Phoenix DE         8601     6 " Beats to Phoenix"
##  7 2024-01-02 10 Beats to Phoenix NL            0     7 " Beats to Phoenix"
##  8 2024-01-02 10 Beats to Phoenix US         6642     8 " Beats to Phoenix"
##  9 2024-01-03 10 Beats to Phoenix BE            0     9 " Beats to Phoenix"
## 10 2024-01-03 10 Beats to Phoenix DE            0    10 " Beats to Phoenix"
## # ℹ 135,814 more rows

1C) Dealing with Time

Time-related data comes in many forms—years, months, days, hours, minutes, and seconds—each with its own quirks. In R, handling time efficiently requires careful attention to conversion, formatting, and interpretation. Whether you're working with timestamps, durations, or calendar-based data, understanding how to manipulate and analyze time correctly is crucial for avoiding errors and ensuring meaningful insights.

In this subsection, we’ll cover:
1. Converting character data to date/time formats
2. Switching between different time units (e.g., minutes to hours, days to weeks)
3. Accounting for the time properties of your data (e.g., time zones, missing values, and irregular intervals)

With dplyr and supporting packages like lubridate, ISOweek and zoo, R provides powerful tools for working with time in a structured and intuitive way. Let’s dive in!

Tip: ensure you've all these packages installed.

install.packages(c('lubridate','ISOweek','zoo'))

Conversion between data formats

Working with dates and times in R can be tricky, especially when data isn't stored in a standard format. Raw datasets often contain date values as character strings, which need to be converted into proper date-time objects for analysis. For example, you might encounter dates stored as "20250104" (YYYYMMDD) instead of the more commonly recognized "2025-01-04" (YYYY-MM-DD).

To work effectively with dates, we need to recode and standardize them into a format R can recognize. This allows for easier filtering, grouping, and aggregation—essential when analyzing trends over time, such as weekly sales or monthly user activity.

Example: Convert a string to a date

library(lubridate)
bad_dates = c('20250104')

good_date = as.Date(bad_dates, format='%Y%m%d') # using base R
good_date
## [1] "2025-01-04"
good_date2 = lubridate::ymd(bad_dates) # alternative using lubridate
good_date2
## [1] "2025-01-04"

Switching between date formats

Example: Convert date to week and extract the year

week(date) from lubridate gives the week number (1-53) based on the Gregorian calendar, starting on January 1st. In contrast, ISOweek(date) follows the ISO 8601 standard, where weeks start on Monday, and Week 1 is the first week with at least four days in the new year. ISOweek also includes the year (YYYY-Www) and may be more appropriate when you need full weeks with exactly seven days.

library(lubridate)

streams <- streams %>%
  mutate(
    week = week(date),
    week2 = ISOweek::ISOweek(date),
    month = month(date),
    year = year(date)
  )

streams %>% select(date, week, week2, month, year) %>% head(10)
## # A tibble: 10 × 5
##    date        week week2    month  year
##    <date>     <dbl> <chr>    <dbl> <dbl>
##  1 2024-01-01     1 2024-W01     1  2024
##  2 2024-01-01     1 2024-W01     1  2024
##  3 2024-01-01     1 2024-W01     1  2024
##  4 2024-01-01     1 2024-W01     1  2024
##  5 2024-01-02     1 2024-W01     1  2024
##  6 2024-01-02     1 2024-W01     1  2024
##  7 2024-01-02     1 2024-W01     1  2024
##  8 2024-01-02     1 2024-W01     1  2024
##  9 2024-01-03     1 2024-W01     1  2024
## 10 2024-01-03     1 2024-W01     1  2024

Example: Aggregate streams by month (take mean value here; "average" per day in a week)

Aggregating data over different time frames helps us see trends, patterns, and long-term changes more clearly. It’s also useful when combining datasets with different time levels—for example, if one dataset has daily data and another has weekly data, it makes sense to convert both to the same time unit. Aggregation also helps when there are missing values; instead of struggling with incomplete daily data, taking the average per week gives a more reliable view.

Using higher time units can also smooth out random ups and downs, making trends easier to spot, reduce the amount of data to process, and match real-world reporting, like weekly sales reports or monthly budgets. The best level of aggregation depends on the question you’re trying to answer and how complete your data is.

monthly_streams <- streams %>%
  group_by(year, month, artist) %>%
  summarize(total_streams = mean(streams, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
monthly_streams %>% head(10)
## # A tibble: 10 × 4
## # Groups:   year, month [1]
##     year month artist               total_streams
##    <dbl> <dbl> <chr>                        <dbl>
##  1  2024     1 10 Beats to Phoenix          3532.
##  2  2024     1 10 Notes to Wolves           5431.
##  3  2024     1 30 Minutes to Mirage         4751.
##  4  2024     1 30 Minutes to Rebels         4605.
##  5  2024     1 30 Minutes to Saints         4679.
##  6  2024     1 30 Ways to Storm             4562.
##  7  2024     1 40 Days to Rebels            5290.
##  8  2024     1 40 Miles to Storm            4739.
##  9  2024     1 60 Notes to Echo             4672.
## 10  2024     1 77 Beats to Wolves           5077.

Dealing with time properties of data

In time series data, some missing values (NAs) can be reasonably estimated rather than left blank. Instead of treating them as unknown, we can make plausible assumptions based on how the data typically behaves. Looking at the dataset, we see missing values in both the price and playlists columns.

For prices, a good assumption is that they remain the same until updated. This approach, called last observation carried forward (LOCF), means we can fill missing values with the last observed value. For example, on 2025-01-03, the price is missing, but since it was 1.1 on 2025-01-02, we assume it stayed the same.

tmp_dates=seq(from=as.Date('2025-01-01'), length.out=10, by='1 day')

sim_data= tibble(date=tmp_dates, price = c(1.20,1.10,NA,1.10,1.5,1.5,NA, NA, 1.4, 1.3),
                 playlists = c(30,20,NA,10,NA, NA, 30, 40, NA, NA))
sim_data
## # A tibble: 10 × 3
##    date       price playlists
##    <date>     <dbl>     <dbl>
##  1 2025-01-01   1.2        30
##  2 2025-01-02   1.1        20
##  3 2025-01-03  NA          NA
##  4 2025-01-04   1.1        10
##  5 2025-01-05   1.5        NA
##  6 2025-01-06   1.5        NA
##  7 2025-01-07  NA          30
##  8 2025-01-08  NA          40
##  9 2025-01-09   1.4        NA
## 10 2025-01-10   1.3        NA

Example: Fill missing prices using Last Value Carried Forward (na.locf)

library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
sim_data = sim_data %>% mutate(prices_imputed = zoo::na.locf(price, na.rm=F))
sim_data
## # A tibble: 10 × 4
##    date       price playlists prices_imputed
##    <date>     <dbl>     <dbl>          <dbl>
##  1 2025-01-01   1.2        30            1.2
##  2 2025-01-02   1.1        20            1.1
##  3 2025-01-03  NA          NA            1.1
##  4 2025-01-04   1.1        10            1.1
##  5 2025-01-05   1.5        NA            1.5
##  6 2025-01-06   1.5        NA            1.5
##  7 2025-01-07  NA          30            1.5
##  8 2025-01-08  NA          40            1.5
##  9 2025-01-09   1.4        NA            1.4
## 10 2025-01-10   1.3        NA            1.3

Example: Fill missing playlist stats using Linear Interpolation (na.approx)

For playlists, a better approach is linear interpolation, which estimates values by assuming a smooth change over time. Instead of leaving 2025-01-04 blank, we can calculate its value as the midpoint between 2025-01-03 (10) and 2025-01-07 (30), which gives 15.

library(zoo)
sim_data = sim_data %>% mutate(playlists_imputed = zoo::na.approx(playlists, na.rm=F))
sim_data
## # A tibble: 10 × 5
##    date       price playlists prices_imputed playlists_imputed
##    <date>     <dbl>     <dbl>          <dbl>             <dbl>
##  1 2025-01-01   1.2        30            1.2              30  
##  2 2025-01-02   1.1        20            1.1              20  
##  3 2025-01-03  NA          NA            1.1              15  
##  4 2025-01-04   1.1        10            1.1              10  
##  5 2025-01-05   1.5        NA            1.5              16.7
##  6 2025-01-06   1.5        NA            1.5              23.3
##  7 2025-01-07  NA          30            1.5              30  
##  8 2025-01-08  NA          40            1.5              40  
##  9 2025-01-09   1.4        NA            1.4              NA  
## 10 2025-01-10   1.3        NA            1.3              NA

Example: Application to our data (using na.approx)

When imputing missing values, always group by the right variables before applying functions like na.locf(). If you forget to group_by(artist, country), missing stream counts might get filled across different artists or countries, leading to completely nonsensical data. At the same time, don’t forget to ungroup() afterward—otherwise, later operations might still be applied within each group, causing unexpected results. Grouping is powerful, but if used carelessly, it can seriously mess up your data!

library(zoo)

# This snippet fills these observations, by carrying the "last observation (that's not NA) forward"
streams <- streams %>%
  group_by(artist, country) %>%
  mutate(streams_filled = na.approx(streams, na.rm = FALSE))

Lead and lags, changes

Other useful variables in time series analysis are lags and leads. Lags shift past values forward (lag()), helping us compare a current value to a previous one, while leads shift future values backward (lead()), allowing us to see what comes next. For example, if a song had 5,000 streams on Monday and 5,500 on Tuesday, a lag would bring Monday’s value into Tuesday’s row, making comparisons easier. In the next example, we’ll use lags to track how streams change over time.

Example: Create lags and compute daily change in streams

streams <- streams %>%
  group_by(artist, country) %>%
  arrange(date) %>%
  mutate(
    lag_streams = lag(streams_filled),
    daily_change = streams_filled - lag_streams
  )

streams %>% arrange(artist,country,date) %>% select(artist, country, date, streams, lag_streams, daily_change) %>% head(10)
## # A tibble: 10 × 6
## # Groups:   artist, country [1]
##    artist              country date       streams lag_streams daily_change
##    <chr>               <chr>   <date>       <dbl>       <dbl>        <dbl>
##  1 10 Beats to Phoenix BE      2024-01-01    6983         NA           NA 
##  2 10 Beats to Phoenix BE      2024-01-02    8451       6983         1468 
##  3 10 Beats to Phoenix BE      2024-01-03       0       8451        -8451 
##  4 10 Beats to Phoenix BE      2024-01-04       0          0            0 
##  5 10 Beats to Phoenix BE      2024-01-05    1710          0         1710 
##  6 10 Beats to Phoenix BE      2024-01-06    3069       1710         1359 
##  7 10 Beats to Phoenix BE      2024-01-07      NA       3069        -1202.
##  8 10 Beats to Phoenix BE      2024-01-08     666       1868.       -1202.
##  9 10 Beats to Phoenix BE      2024-01-09       0        666         -666 
## 10 10 Beats to Phoenix BE      2024-01-10       0          0            0

1D) Pivoting Data: Reshaping from Long to Wide and Wide to Long

Data is often stored in different formats, depending on the analysis needs. Sometimes, we need to widen a dataset to make it easier to read, and other times, we need to lengthen it to perform certain operations. The tidyr package in tidyverse provides pivoting functions to reshape data efficiently.

  • Wide format: Each row represents a single entity, with multiple columns for different variables.
  • Long format: Each observation is in its own row, with a single column for values and an additional column indicating the variable type.
Format Example Structure Example Dataset
Wide One row per date, multiple columns for weather indicator by country weather.csv
Long One row per artist-date-country combination, a column for streams streams.csv

From Wide to Long: pivot_longer()

When converting wide data to long format, we gather multiple columns into key-value pairs. This is useful when variables are spread across columns but should be stored as rows.

Key arguments in pivot_longer():

  1. cols – Selects which columns to pivot.
    • Can specify by name (c(BE_temp, DE_temp)) or range (BE_temp:US_temp).
    • Use -date or !date to pivot everything except certain columns.
  2. names_to – Defines where column names will be stored.
    • Example: "country" if columns like DE_temp and BE_temp should be labeled by country.
  3. values_to – Defines where the actual data values go.
    • Example: "temperature" if numeric values from DE_temp and BE_temp represent temperature readings.

This transformation makes it easier to filter, group, and analyze structured data.

Example: Make weather data long by country

The weather dataset stores temperature data by country in long format. We want to convert it into wide format, with one row per date and separate columns for each country’s temperature.

long_weather <- weather %>%
  pivot_longer(cols = ends_with('temp'), names_to = "name", values_to = "temperature") %>% separate(name, into = c('country','variable'), sep='_') %>% select(date, country, temperature)

head(long_weather)
## # A tibble: 6 × 3
##   date       country temperature
##   <date>     <chr>         <dbl>
## 1 2024-01-01 DE            14.3 
## 2 2024-01-01 BE            10.5 
## 3 2024-01-01 NL            10.6 
## 4 2024-01-01 US            17.1 
## 5 2024-01-02 DE             8.72
## 6 2024-01-02 BE            17.1

Before pivoting (wide format):

date BE_temp DE_temp
2024-01-01 11.6 14.3
2024-01-02 11.4 8.7

After pivoting (long format):

date country temperature
2024-01-01 BE 11.6
2024-01-01 DE 14.3
2024-01-02 BE 11.4
2024-01-02 DE 8.7

From Long to Wide: pivot_wider()

When converting long data to wide format, we spread values across multiple columns, making the dataset easier to read and analyze.

Key arguments in pivot_wider():

  1. id_cols – Specifies the column(s) that should remain unchanged.
    • Example: "date" if we want to keep dates as row identifiers while spreading other values.
    • If not specified, pivot_wider() assumes all non-selected columns should be spread.
  2. names_from – Defines which column contains the new column names.
    • Example: "country" if values from "country" (DE, BE, NL) should become separate columns.
  3. values_from – Specifies which column contains the data to be spread across new columns.
    • Example: "temperature" if we want each country's temperature as its own column.

This transformation is useful when preparing data for reports, modeling, or visualizations.

Example: Make long_weather wide again

wide_weather <- long_weather %>% 
  pivot_wider(names_from = 'country', values_from = 'temperature')

head(wide_weather)
## # A tibble: 6 × 5
##   date          DE    BE    NL    US
##   <date>     <dbl> <dbl> <dbl> <dbl>
## 1 2024-01-01 14.3  10.5  10.6   17.1
## 2 2024-01-02  8.72 17.1  10.4   21.8
## 3 2024-01-03 11.7   7.41 11.7   17.4
## 4 2024-01-04 12.7  10.5   8.78  15.3
## 5 2024-01-05 12.2  13.0   9.82  29.3
## 6 2024-01-06 10.9  17.0  13.4   13.3

Before pivoting (long format):

date country temperature
2024-01-01 BE 11.6
2024-01-01 DE 14.3
2024-01-02 BE 11.4
2024-01-02 DE 8.7

After pivoting (wide format):

date BE_temp DE_temp
2024-01-01 11.6 14.3
2024-01-02 11.4 8.7

For more details, check out the tidyverse cheatsheet on pivoting:
tidyr Pivoting Cheatsheet

1E) Merging Data: Combining Multiple Data Sources

In real-world analysis, data is often stored in separate tables. To get meaningful insights, we need to merge datasets based on shared keys. The dplyr package provides efficient functions for this, making it easy to combine and manipulate data.

Joins allow us to merge two datasets by matching values in a key column. Here’s an overview of different join types:

Join types

Common Join Types
Common Join Types

Here, we will focus on the most important ones:

Join Type Keeps All Rows From Drops Rows?
Left Join (left_join()) Left table No
Inner Join (inner_join()) Only matching rows Yes
Right Join (right_join()) Right table No
Full Join (full_join()) Both tables (fills missing values) No

The left join is the default choice in most cases because it keeps all rows from the left dataset and only adds matching values from the right. Inner joins should be used with caution because they remove rows that do not have a match.

What Are Keys in a Join?

To merge data correctly, we need keys—columns that uniquely identify records in each dataset.

  • Single key: Merging by a single column like "date".
  • Composite key: Merging by multiple columns like "date" and "artist".

Without defining a proper key, the merge may not work as expected, or it may introduce duplicate rows.

Examples of Joins in Action

Example: Inner Join (Only Matching Rows)

Keeps only rows where a match exists in both streams and socials.

tiktok_followers = socials %>% filter(platform=='TikTok'  & metric == 'followers')

merged_data <- streams %>%
  inner_join(tiktok_followers, by = c("date", "artist"))  # Match by both columns

# rename value column
merged_data = merged_data %>% mutate(tiktok_followers = value) %>% select(-platform, -metric, -value)

head(merged_data)
## # A tibble: 6 × 13
## # Groups:   artist, country [6]
##   date       artist country streams trend week  week2 month  year streams_filled
##   <date>     <chr>  <chr>     <dbl> <int> <chr> <chr> <dbl> <dbl>          <dbl>
## 1 2024-01-01 10 Be… BE         6983     1 2024… 2024…     1  2024           6983
## 2 2024-01-01 10 Be… DE         9523     2 2024… 2024…     1  2024           9523
## 3 2024-01-01 10 Be… NL         6258     3 2024… 2024…     1  2024           6258
## 4 2024-01-01 10 Be… US         2856     4 2024… 2024…     1  2024           2856
## 5 2024-01-01 10 No… BE         7186     1 2024… 2024…     1  2024           7186
## 6 2024-01-01 10 No… DE         4173     2 2024… 2024…     1  2024           4173
## # ℹ 3 more variables: lag_streams <dbl>, daily_change <dbl>,
## #   tiktok_followers <dbl>

Things to keep in mind:

  • Rows are removed if an artist has streams but no corresponding social media data on that date.
  • This join is only useful when seeking to analyze cases where both datasets must have relevant information.

Note: Before dplyr, R users relied on the merge() function from base R. It works similarly but is less readable and much slower. In other words: please use tidyverse/dplyr.

For more details, check out the official dplyr cheatsheet on joins:
Joins Cheatsheet

You've Made It! 🎉

Congratulations! You've tackled the fundamentals of working with time in R, from handling messy date formats to computing lags and filling in missing values. Now, it's time to put your skills to the test with a real-world challenge—one that reflects the types of problems you’d face in data analysis.

Part 2: Case study and Exercises

Before any meaningful analysis can happen, data needs to be cleaned, structured, and prepared—a process that often takes up more time than the actual modeling. In real-world marketing analytics, raw datasets are rarely ready for immediate use. Instead, they come in different formats, with missing values, misaligned timeframes, and inconsistencies that need to be resolved before drawing any conclusions.

In this case study, you'll work with four datasetsstreams.csv, weather.csv, socials.csv, and playlists.csv—to assemble a dataset ready for analysis. This involves aggregating time-based data, handling missing values, and merging different data sources in a way that ensures consistency.

Throughout this process, you’ll need to make critical decisions: Should streams be summed or averaged at the weekly level? How do you ensure all artists and countries are accounted for? What’s the best way to impute missing data? These choices impact the quality and reliability of your final dataset.

Once the data is structured, you’ll take it further—visualizing trends and running artist-specific regressions. This is an opportunity to apply what you’ve learned and gain experience with the kinds of data challenges analysts deal with every day.

💡 A few things to keep in mind:
- Follow the setup-ITO structure to keep your workflow organized.
- Work in an .R file and save your final dataset as final.csv.
- Think carefully about your approach—there’s no single correct answer, but some methods are more appropriate than others.

Exercise 1: Assemble a Dataset for Analysis

Load the datasets.

💡 Tips:

  • Follow the setup-ITO structure to keep your code organized.
  • Ensure all datasets are properly read into R and inspected for missing values.
library(tidyverse)

# Load data
streams <- read_csv("streams.csv")
## Rows: 135824 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): artist, country
## dbl  (1): streams
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
weather <- read_csv("weather.csv")
## Rows: 365 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (8): DE_temp, BE_temp, NL_temp, US_temp, DE_sun, BE_sun, NL_sun, US_sun
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
socials <- read_csv("socials.csv")
## Rows: 267072 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): artist, platform, metric
## dbl  (1): value
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
playlists <- read_csv("playlists.csv")
## Rows: 4494 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): iso_week, artist
## dbl (1): playlists
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# No transformation or output generated so far... (i.e., we only have setup and input)

Exercise 2: Aggregate Streams to the Weekly Level (ISO Weeks)

Convert daily streams into weekly aggregates using ISO weeks. Take the mean() (so that you have a weekly average, which is still valid in case single observations in a week are missing).

💡 Tip:

  • Summing would work too, but only if you have complete data. In case you want to still measure "total" streams, you could also take the mean()*7.
library(lubridate)

streams_aggregated <- streams %>%
  mutate(week = ISOweek::ISOweek(date)) %>%
  group_by(artist, country, week) %>%
  summarize(
    weekly_streams = 7*mean(streams, na.rm = TRUE) # or use mean()
  )
## `summarise()` has grouped output by 'artist', 'country'. You can override using
## the `.groups` argument.

Exercise 3: Create a Trend Variable

Add a trend variable that increases from 1 (for the first week) to the last. We can later use this variable in a regression analysis.

streams_aggregated <- streams_aggregated %>%
  group_by(artist,country) %>% arrange(artist, country, week) %>%
  mutate(trend = row_number()) %>% ungroup()

# Import: group by artist and country; we can then generate row numbers for every day.
# I have built in a safeguard so that the data will be ordered by week (because, if it's not for some reason, the trend variable will be off).

Exercise 4: Create a Complete Data Grid

Ensure that every artist-country-week combination is represented, even if no streams were recorded.

💡 Tip:

  • Use expand_grid() to create all possible combinations of artists, countries and dates. Check R's help to see how the command works (we haven't covered it in the tutorial so far). Just type ?expand.grid on your console.
all_weeks <- unique(streams_aggregated$week)
all_artists <- unique(streams_aggregated$artist)
all_countries <- unique(streams_aggregated$country)

complete_data <- expand_grid(artist = all_artists, country = all_countries, week = all_weeks)
dim(complete_data) # more than 20k rows!
## [1] 22684     3

Exercise 5: Merge Streams Data into the Complete Grid (complete_data)

Join the aggregated streams data with the complete dataset from Exercise 4.

complete_data <- complete_data %>%
  left_join(streams_aggregated, by = c("artist", "country", "week"))

Exercise 6: Merge Weather and Social Media Data

  • Prepare and merge weather data
    • First, create an ISO week column.
    • Then, aggregate to the weekly level.
    • Next, turn it back into wide (with temp and sun as columns)
    • Finally, you can merge it with complete_data.
  • Prepare and merge socials data
    • Add ISO week to socials data
    • Aggregate to weekly level (week, artist, platform and metric)
    • Reshape long to wide
    • Merge!
# Reshape weather data (long format)
weather_long <- weather %>%
  pivot_longer(cols = -date, names_to = "country", values_to = "value") %>%
  mutate(week = ISOweek::ISOweek(date))

# Aggregate to weekly level
weather_long_aggregated <- weather_long %>% group_by(week, country) %>% summarize(value=mean(value,na.rm=T))
## `summarise()` has grouped output by 'week'. You can override using the
## `.groups` argument.
# Separate country and measures into columns
weather_long_aggregated = weather_long_aggregated %>% separate(country, into = c('country','measure'), sep='_', convert=T)

# Turn into wide again for temp & sun hours
weather_long_wide = weather_long_aggregated %>% pivot_wider(id_cols=c('week','country'), names_from='measure', values_from='value')

# Merge weather data
complete_data <- complete_data %>%
  left_join(weather_long_wide, by = c("country", "week"))

# Add ISO week to socials data
socials <- socials %>% mutate(week=ISOweek::ISOweek(date))

# Aggregate to weekly level
socials_aggregated <- socials %>% group_by(week, artist, platform, metric) %>% summarize(value=mean(value,na.rm=T))
## `summarise()` has grouped output by 'week', 'artist', 'platform'. You can
## override using the `.groups` argument.
# Reshape social media data (wide format)
socials_wide <- socials_aggregated %>%
  pivot_wider(names_from = c('platform', 'metric'), values_from = 'value')

# Merge social media
complete_data <- complete_data %>%
  left_join(socials_wide, by = c("artist", "week"))

Exercise 7 Handle Missing Data

Impute missing values for the following variables:

  • sun: na.locf()
  • temperature: na.locf()
  • Instagram_comments: na.approx()
  • Instagram_followers: na.approx()
  • Snapchat_comments: na.approx()
  • Snapchat_followers: na.approx()
  • TikTok_comments: na.approx()
  • TikTok_followers: na.approx()
  • X_comments: na.approx()
  • X_followers: na.approx()
complete_data %>% group_by(artist, country) %>%
  arrange(artist, country, week) %>%
  mutate(sun = na.locf(sun, na.rm=F),
         temp = na.locf(temp, na.rm=F),
         Instagram_comments = na.approx(Instagram_comments, na.rm=F),
         Instagram_followers = na.approx(Instagram_followers, na.rm=F),
         Snapchat_comments = na.approx(Snapchat_comments, na.rm=F),
         Snapchat_followers = na.approx(Snapchat_followers, na.rm=F),
         TikTok_comments = na.approx(TikTok_comments, na.rm=F),
         TikTok_followers = na.approx(TikTok_followers, na.rm=F),
         X_comments = na.approx(X_comments, na.rm=F),
         X_followers = na.approx(X_followers, na.rm=F)) %>%
  ungroup()
## # A tibble: 22,684 × 15
##    artist      country week  weekly_streams trend   sun  temp Instagram_comments
##    <chr>       <chr>   <chr>          <dbl> <int> <dbl> <dbl>              <dbl>
##  1 10 Beats t… BE      2024…         23582.     1  5.87  12.6            268948.
##  2 10 Beats t… BE      2024…         18361      2  5.82  12.7            272486.
##  3 10 Beats t… BE      2024…         12335.     3  6.72  16.3            276559.
##  4 10 Beats t… BE      2024…         18779.     4  6.43  16.8            280275.
##  5 10 Beats t… BE      2024…         11993.     5  7.29  17.5            284042.
##  6 10 Beats t… BE      2024…         12023.     6  6.95  17.4            287862 
##  7 10 Beats t… BE      2024…         17969      7  6.77  17.5            291993 
##  8 10 Beats t… BE      2024…         53596.     8  7.26  18.9            296186.
##  9 10 Beats t… BE      2024…         20325.     9  7.26  20.0            300070 
## 10 10 Beats t… BE      2024…         11753     10  7.34  20.8            304335.
## # ℹ 22,674 more rows
## # ℹ 7 more variables: Instagram_followers <dbl>, Snapchat_comments <dbl>,
## #   Snapchat_followers <dbl>, TikTok_comments <dbl>, TikTok_followers <dbl>,
## #   X_comments <dbl>, X_followers <dbl>

Exercise 8: Scaling Up I: Generate Stream Plots for Multiple Artists

Generate line plots for the first 10 artists and save them using ggsave(). Use facet plots (facet_wrap(), see week 1) to also plot streams by different countries.

# starting code
library(ggplot2)
dir.create('plots')

complete_data <- complete_data %>% mutate(week_start = ISOweek::ISOweek2date(paste0(week, '-1')))

artists = unique(streams$artist)
for (one_artist in artists[1:10]) {
  plot_data <- complete_data %>% filter(artist == one_artist)
  
  # ggplot code here
}
library(ggplot2)
dir.create('plots')

artists <- unique(complete_data$artist)

complete_data <- complete_data %>% mutate(week_start = ISOweek::ISOweek2date(paste0(week, '-1')))

for (one_artist in artists[1:10]) {
  plot_data <- complete_data %>% filter(artist == one_artist)
  
  p <- ggplot(plot_data, aes(x = week_start, y = weekly_streams)) +
    geom_line() +
    facet_wrap(~country) +
    labs(title = paste("Streams for", one_artist))

  ggsave(paste0("plots/", one_artist, ".png"), plot = p, width = 6, height = 4)
}

Exercise 9: Scaling Up II: Estimate Individual Regression Models

Estimate separate regression models for the first 10 artists and summarize results. Use a loop, as in the code snippet below (later on, you could estimate these models for all 107 artists).

In your regression model, use

  • DV: log(weekly_streams+1)
  • Independent variables
    • country (dummy variable, as.factor(country))
    • sun and temp (not log-transformed)
    • and the log of all social media metrics (log(x+1))
# starting code

artists = unique(complete_data$artist)

results = lapply(artists[1:10], function(one_artist) {
  filtered_data = complete_data %>% filter(artist==one_artist)
  # estimation code here
  
  m <- lm(rnorm(10)~1, data = filtered_data) # just an example
  
  results = data.frame(variable = names(m$coefficients),
                     summary(m)$coefficients)
  colnames(results) <- c('variable','estimate','se','t','pval')
  results = tibble(results) %>% mutate(artist=one_artist)
  
})

all_results = bind_rows(results)
artists = unique(complete_data$artist)

results = lapply(artists[1:10], function(one_artist) {
  filtered_data = complete_data %>% filter(artist==one_artist)
  
  m <- lm(log(weekly_streams+1) ~ 1 + 
                as.factor(country) +
                sun + temp +
                log(Instagram_comments+1) + log(Instagram_followers) +
                log(Snapchat_comments+1) + log(Snapchat_followers) +
                log(TikTok_comments+1) + log(TikTok_followers) +
                log(X_comments+1) + log(X_followers), data= filtered_data)
  results = data.frame(variable = names(m$coefficients),
                     summary(m)$coefficients)
  colnames(results) <- c('variable','estimate','se','t','pval')
  results = tibble(results) %>% mutate(artist=one_artist)
  return(results)
})

all_results = bind_rows(results)

# all results now contains the results of all individual regressions;
# you could now "filter" through - e.g., to create a histogram of the effects of sun!

all_results %>% filter(grepl('sun', variable)) %>%
  ggplot() + geom_histogram(aes(x = estimate)) + ggtitle('Histogram of the effect of sun hours on followers')

Summary

Through these exercises, you’ll gain hands-on experience with:
- Aggregating and transforming time-based data
- Handling missing values appropriately
- Merging datasets with different time granularities
- Automating plots and regression analysis for multiple artists

This workflow mirrors real-world data science challenges, where cleaning and structuring data is just as important as modeling.

Good job!