complete_data
)In this tutorial, we'll tackle common data challenges using a simulated music streaming dataset. Starting with multiple raw data sources, we'll clean, transform, and merge them into a single analysis-ready data frame.
rank()
or unique()
)lm()
)Ready to start? Here's what we recommend you do before going through this tutorial.
File
-> New Project
->
New Directory
-> New Project
.data-preparation
and save it in a
location of your choice.data
folder:Click here to manually download the dataset. Alternatively, run the following R code to download and unzip the file automatically:
# downloading and unzipping the data
if (!file.exists('week4_data.zip')) {
download.file('https://github.com/hannesdatta/course-dprep/raw/refs/heads/master/content/docs/modules/week4/tutorial/data/week4_data.zip', 'week4_data.zip')
unzip('week4_data.zip', overwrite=TRUE)
}
Now, load the data into R using read_csv
:
library(tidyverse)
streams <- read_csv('streams.csv')
weather <- read_csv('weather.csv')
socials <- read_csv('socials.csv')
playlists <- read_csv('playlists.csv')
Troubleshooting:
setwd()
— hardcoding directories can cause problems!Let's explore the data. It consists of four files, which vary in structure, unit of observation, etc. This is on purpose because then you can learn all the concepts required in this tutorial.
streams.csv
: number of streams (streams
)
for an artist
in a given country
on a given
day (date
)weather.csv
: for each day and country, the average
temperature (_temp
) and sun hours (_sun
)socials.csv
: for every artist
and day
(date
), this data lists social media metrics (e.g., on
platform
TikTok, a metric
followers is
observed and has the value
458352)playlists.csv
: for each week iso_week
and
artist
, the data lists the number of playlists
an artist was listed on on Spotify.We can now explore the data using the strategies introduced in earlier tutorials.
Tips:
head(streams)
, or View(streams)
. If you'd like
to view more rows with head
, use
head(streams, 100)
(for, e.g., the first 100 rows).
(tail()
gives you the last rows of the data).summary(playlists)
or
glimpse(playlists)
generates descriptive statistics for all
variables in the data. You can also use this command on individual
columns (e.g., summary(playlists$playlists)
).table()
command. These will create frequency tables.In this section, we will deepen our understanding of key R functions that are essential for data engineering. We will focus on numerical functions, advanced string cleaning, and working with time-based data. Each concept will be demonstrated with examples from our datasets.
tidyverse
Before diving into more advanced transformations, let’s start by
applying base R functions inside dplyr
operations. In mutate()
and summarize()
, you
can use any function from base R—not just
mean()
, min()
, or max()
. This is
useful when operationalizing new variables, performing custom
transformations, or computing statistics on groups.
For now, we'll use min, mean, and max to get
started. But remember, the same logic applies to any
function, like sd()
, range()
,
var()
, or even custom functions you define.
library(dplyr)
# Compute basic statistics per artist
streams %>%
group_by(artist) %>%
summarize(
min_streams = min(streams, na.rm = TRUE),
mean_streams = mean(streams, na.rm = TRUE),
max_streams = max(streams, na.rm = TRUE),
sd_streams = sd(streams, na.rm = TRUE) # Standard deviation
)
## # A tibble: 107 × 5
## artist min_streams mean_streams max_streams sd_streams
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 10 Beats to Phoenix 0 5519. 24868 5121.
## 2 10 Notes to Wolves 0 5795. 21932 4697.
## 3 30 Minutes to Mirage 0 5021. 22207 4468.
## 4 30 Minutes to Rebels 0 5034. 24260 4818.
## 5 30 Minutes to Saints 0 5098. 20977 4627.
## 6 30 Ways to Storm 0 5865. 24271 4948.
## 7 40 Days to Rebels 0 5082. 21415 4667.
## 8 40 Miles to Storm 0 5386. 24915 4823.
## 9 60 Notes to Echo 0 4558. 22690 4519.
## 10 77 Beats to Wolves 0 5661. 24751 4827.
## # ℹ 97 more rows
Want to go further? Any function that works on a vector can be used here—for example:
var(streams, na.rm = TRUE)
to calculate variance.range(streams, na.rm = TRUE)
to get the min and max in
one step.quantile(streams, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
to get quartiles.💡 Tip: You can check out the Base R
Cheatsheet to explore more functions you can integrate into your
tidyverse
workflows.
Ranking is super useful in our application because we may be interested in identifying the most-listened-to artist or tracking changes in ranking over time. Instead of just looking at absolute numbers, ranking allows us to order values while handling ties, which is often critical in reporting and analysis.
The rank()
function assigns ranks based on the values in
a column, with ties receiving the same rank. By
default, it ranks in ascending order, but using -x
flips
the order to descending. This is particularly useful when working with
streams data, where we want to rank artists by
their daily streaming count.
library(dplyr)
# Rank artists within each day
streams %>%
group_by(date) %>%
mutate(
rank_streams = rank(-streams) # Higher streams = lower rank (1st place)
) %>%
arrange(date, rank_streams) %>%
select(date, artist, streams, rank_streams) %>%
head(10)
## # A tibble: 10 × 4
## # Groups: date [1]
## date artist streams rank_streams
## <date> <chr> <dbl> <dbl>
## 1 2024-01-01 Rockets Band 18348 1
## 2 2024-01-01 Saint Echo 17092 2
## 3 2024-01-01 Midnight Mirage 16080 3
## 4 2024-01-01 Storm Fusion 15635 4
## 5 2024-01-01 Midnight Rockets 15428 5
## 6 2024-01-01 Wolves Band 15031 6
## 7 2024-01-01 Saint Storm 14941 7
## 8 2024-01-01 Wolves Experience 14424 8
## 9 2024-01-01 Sophia Miller 14248 9
## 10 2024-01-01 Logan Brown 14221 10
💡 Other useful ranking functions:
row_number(x)
: Assigns unique sequential numbers
without ties.Using these functions, we can create trend variables (e.g., running from 1 to N, for all observations of an artist). Using this variable in a regression model, for example, could help us understand streaming trends by country.
streams = streams %>% group_by(artist) %>% arrange(artist, date) %>% mutate(trend = row_number()) %>% ungroup()
head(streams,15)
## # A tibble: 15 × 5
## date artist country streams trend
## <date> <chr> <chr> <dbl> <int>
## 1 2024-01-01 10 Beats to Phoenix BE 6983 1
## 2 2024-01-01 10 Beats to Phoenix DE 9523 2
## 3 2024-01-01 10 Beats to Phoenix NL 6258 3
## 4 2024-01-01 10 Beats to Phoenix US 2856 4
## 5 2024-01-02 10 Beats to Phoenix BE 8451 5
## 6 2024-01-02 10 Beats to Phoenix DE 8601 6
## 7 2024-01-02 10 Beats to Phoenix NL 0 7
## 8 2024-01-02 10 Beats to Phoenix US 6642 8
## 9 2024-01-03 10 Beats to Phoenix BE 0 9
## 10 2024-01-03 10 Beats to Phoenix DE 0 10
## 11 2024-01-03 10 Beats to Phoenix NL 9712 11
## 12 2024-01-03 10 Beats to Phoenix US 9648 12
## 13 2024-01-04 10 Beats to Phoenix BE 0 13
## 14 2024-01-04 10 Beats to Phoenix DE 292 14
## 15 2024-01-04 10 Beats to Phoenix NL 0 15
##
## Call:
## lm(formula = streams ~ 1 + trend, data = streams)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6074.1 -4275.5 -818.8 3085.3 24313.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.096e+03 2.865e+01 142.98 <2e-16 ***
## trend 1.531e+00 3.992e-02 38.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4721 on 104529 degrees of freedom
## (31293 observations deleted due to missingness)
## Multiple R-squared: 0.01387, Adjusted R-squared: 0.01386
## F-statistic: 1470 on 1 and 104529 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = streams ~ 1 + trend * country, data = streams)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9051.6 -3569.1 -760.7 2909.6 23727.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.055e+03 5.367e+01 56.913 < 2e-16 ***
## trend 1.097e+00 7.496e-02 14.631 < 2e-16 ***
## countryDE 6.859e+02 7.608e+01 9.016 < 2e-16 ***
## countryNL 1.988e+01 7.616e+01 0.261 0.7940
## countryUS 3.471e+03 7.609e+01 45.619 < 2e-16 ***
## trend:countryDE 5.308e-01 1.062e-01 5.000 5.73e-07 ***
## trend:countryNL 3.236e-01 1.061e-01 3.049 0.0023 **
## trend:countryUS 8.583e-01 1.061e-01 8.090 6.04e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4440 on 104523 degrees of freedom
## (31293 observations deleted due to missingness)
## Multiple R-squared: 0.1278, Adjusted R-squared: 0.1277
## F-statistic: 2187 on 7 and 104523 DF, p-value: < 2.2e-16
When working with data, we often need to remove duplicate values or count unique occurrences. Whether we're filtering artist names, cleaning up datasets, or summarizing trends, uniqueness matters.
In tidyverse
, we have a few tools to help:
unique()
kicks out duplicate values
from a vector or data frame.distinct()
(from dplyr
) removes duplicate
rows from a data frame.n_distinct()
counts the number of
unique values in a column.Let’s start simple—getting a list of all the unique artists in our dataset.
unique_artists <- unique(streams$artist) # Extract unique artist names
head(unique_artists, 10) # Show a preview
## [1] "10 Beats to Phoenix" "10 Notes to Wolves" "30 Minutes to Mirage"
## [4] "30 Minutes to Rebels" "30 Minutes to Saints" "30 Ways to Storm"
## [7] "40 Days to Rebels" "40 Miles to Storm" "60 Notes to Echo"
## [10] "77 Beats to Wolves"
You can also use unique()
inside a mutate()
statement.
Regular expressions (regex) allow us to search for patterns in text, making it easier to clean and filter data. In R, we use:
grepl(x, y)
→ detect whether y
contains x (evaluates to TRUE
or FALSE
)grep(x, y, value = T)
→ detect whether
y contains x (& returns the search results)gsub(x, z, y)
→ replace matched
patterns (x) with new values (z)Pattern | Description | Example |
---|---|---|
\\d |
Find numbers (0-9 ) |
grepl("\\d", "Song123") → TRUE |
[a-zA-Z] |
Find letters (case-sensitive) | grepl("[a-zA-Z]", "123!") → FALSE |
^word |
Match beginning of a string | grepl("^The", "The Beatles") → TRUE |
word$ |
Match end of a string | grepl("Band$", "Jazz Band") → TRUE |
grepl()
: Finding Patterns in TextUse grepl()
to detect whether a pattern
exists in a column. It returns TRUE/FALSE,
making it useful for filtering data.
## # A tibble: 2 × 1
## artist
## <chr>
## 1 10 Beats to Phoenix
## 2 77 Beats to Wolves
## # A tibble: 15 × 1
## artist
## <chr>
## 1 10 Beats to Phoenix
## 2 10 Notes to Wolves
## 3 30 Minutes to Mirage
## 4 30 Minutes to Rebels
## 5 30 Minutes to Saints
## 6 30 Ways to Storm
## 7 40 Days to Rebels
## 8 40 Miles to Storm
## 9 60 Notes to Echo
## 10 77 Beats to Wolves
## 11 77 Miles to Storm
## 12 77 Steps to Phoenix
## 13 77 Ways to Riders
## 14 88 Steps to Storm
## 15 88 Ways to Rebels
## # A tibble: 3 × 1
## artist
## <chr>
## 1 The Mirage
## 2 The Rebels
## 3 The Wolves
## # A tibble: 3 × 1
## artist
## <chr>
## 1 Riders Band
## 2 Rockets Band
## 3 Wolves Band
## [1] "Riders Band" "Rockets Band" "Wolves Band"
gsub()
: Replacing Patterns in TextUse gsub()
to modify text by replacing matched
patterns.
## # A tibble: 135,824 × 6
## date artist country streams trend artist_clean
## <date> <chr> <chr> <dbl> <int> <chr>
## 1 2024-01-01 10 Beats to Phoenix BE 6983 1 10BeatstoPhoenix
## 2 2024-01-01 10 Beats to Phoenix DE 9523 2 10BeatstoPhoenix
## 3 2024-01-01 10 Beats to Phoenix NL 6258 3 10BeatstoPhoenix
## 4 2024-01-01 10 Beats to Phoenix US 2856 4 10BeatstoPhoenix
## 5 2024-01-02 10 Beats to Phoenix BE 8451 5 10BeatstoPhoenix
## 6 2024-01-02 10 Beats to Phoenix DE 8601 6 10BeatstoPhoenix
## 7 2024-01-02 10 Beats to Phoenix NL 0 7 10BeatstoPhoenix
## 8 2024-01-02 10 Beats to Phoenix US 6642 8 10BeatstoPhoenix
## 9 2024-01-03 10 Beats to Phoenix BE 0 9 10BeatstoPhoenix
## 10 2024-01-03 10 Beats to Phoenix DE 0 10 10BeatstoPhoenix
## # ℹ 135,814 more rows
## # A tibble: 135,824 × 6
## date artist country streams trend artist_clean
## <date> <chr> <chr> <dbl> <int> <chr>
## 1 2024-01-01 10 Beats to Phoenix BE 6983 1 " Beats to Phoenix"
## 2 2024-01-01 10 Beats to Phoenix DE 9523 2 " Beats to Phoenix"
## 3 2024-01-01 10 Beats to Phoenix NL 6258 3 " Beats to Phoenix"
## 4 2024-01-01 10 Beats to Phoenix US 2856 4 " Beats to Phoenix"
## 5 2024-01-02 10 Beats to Phoenix BE 8451 5 " Beats to Phoenix"
## 6 2024-01-02 10 Beats to Phoenix DE 8601 6 " Beats to Phoenix"
## 7 2024-01-02 10 Beats to Phoenix NL 0 7 " Beats to Phoenix"
## 8 2024-01-02 10 Beats to Phoenix US 6642 8 " Beats to Phoenix"
## 9 2024-01-03 10 Beats to Phoenix BE 0 9 " Beats to Phoenix"
## 10 2024-01-03 10 Beats to Phoenix DE 0 10 " Beats to Phoenix"
## # ℹ 135,814 more rows
Time-related data comes in many forms—years, months, days, hours, minutes, and seconds—each with its own quirks. In R, handling time efficiently requires careful attention to conversion, formatting, and interpretation. Whether you're working with timestamps, durations, or calendar-based data, understanding how to manipulate and analyze time correctly is crucial for avoiding errors and ensuring meaningful insights.
In this subsection, we’ll cover:
1. Converting character data to date/time formats
2. Switching between different time units (e.g., minutes to
hours, days to weeks)
3. Accounting for the time properties of your data (e.g., time
zones, missing values, and irregular intervals)
With dplyr
and supporting packages like
lubridate
, ISOweek
and zoo
, R
provides powerful tools for working with time in a structured and
intuitive way. Let’s dive in!
Tip: ensure you've all these packages installed.
Working with dates and times in R can be tricky, especially when data
isn't stored in a standard format. Raw datasets often contain date
values as character strings, which need to be converted
into proper date-time objects for analysis. For example, you might
encounter dates stored as "20250104"
(YYYYMMDD) instead of
the more commonly recognized "2025-01-04"
(YYYY-MM-DD).
To work effectively with dates, we need to recode and standardize them into a format R can recognize. This allows for easier filtering, grouping, and aggregation—essential when analyzing trends over time, such as weekly sales or monthly user activity.
week(date)
from lubridate gives the
week number (1-53) based on the Gregorian calendar,
starting on January 1st. In contrast, ISOweek(date)
follows
the ISO 8601 standard, where weeks start on
Monday, and Week 1 is the first week
with at least four days in the new year. ISOweek
also
includes the year (YYYY-Www
) and may be more appropriate
when you need full weeks with exactly seven days.
library(lubridate)
streams <- streams %>%
mutate(
week = week(date),
week2 = ISOweek::ISOweek(date),
month = month(date),
year = year(date)
)
streams %>% select(date, week, week2, month, year) %>% head(10)
## # A tibble: 10 × 5
## date week week2 month year
## <date> <dbl> <chr> <dbl> <dbl>
## 1 2024-01-01 1 2024-W01 1 2024
## 2 2024-01-01 1 2024-W01 1 2024
## 3 2024-01-01 1 2024-W01 1 2024
## 4 2024-01-01 1 2024-W01 1 2024
## 5 2024-01-02 1 2024-W01 1 2024
## 6 2024-01-02 1 2024-W01 1 2024
## 7 2024-01-02 1 2024-W01 1 2024
## 8 2024-01-02 1 2024-W01 1 2024
## 9 2024-01-03 1 2024-W01 1 2024
## 10 2024-01-03 1 2024-W01 1 2024
Aggregating data over different time frames helps us see trends, patterns, and long-term changes more clearly. It’s also useful when combining datasets with different time levels—for example, if one dataset has daily data and another has weekly data, it makes sense to convert both to the same time unit. Aggregation also helps when there are missing values; instead of struggling with incomplete daily data, taking the average per week gives a more reliable view.
Using higher time units can also smooth out random ups and downs, making trends easier to spot, reduce the amount of data to process, and match real-world reporting, like weekly sales reports or monthly budgets. The best level of aggregation depends on the question you’re trying to answer and how complete your data is.
monthly_streams <- streams %>%
group_by(year, month, artist) %>%
summarize(total_streams = mean(streams, na.rm = TRUE))
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 4
## # Groups: year, month [1]
## year month artist total_streams
## <dbl> <dbl> <chr> <dbl>
## 1 2024 1 10 Beats to Phoenix 3532.
## 2 2024 1 10 Notes to Wolves 5431.
## 3 2024 1 30 Minutes to Mirage 4751.
## 4 2024 1 30 Minutes to Rebels 4605.
## 5 2024 1 30 Minutes to Saints 4679.
## 6 2024 1 30 Ways to Storm 4562.
## 7 2024 1 40 Days to Rebels 5290.
## 8 2024 1 40 Miles to Storm 4739.
## 9 2024 1 60 Notes to Echo 4672.
## 10 2024 1 77 Beats to Wolves 5077.
In time series data, some missing values (NAs) can be reasonably estimated rather than left blank. Instead of treating them as unknown, we can make plausible assumptions based on how the data typically behaves. Looking at the dataset, we see missing values in both the price and playlists columns.
For prices, a good assumption is that they remain the same until updated. This approach, called last observation carried forward (LOCF), means we can fill missing values with the last observed value. For example, on 2025-01-03, the price is missing, but since it was 1.1 on 2025-01-02, we assume it stayed the same.
tmp_dates=seq(from=as.Date('2025-01-01'), length.out=10, by='1 day')
sim_data= tibble(date=tmp_dates, price = c(1.20,1.10,NA,1.10,1.5,1.5,NA, NA, 1.4, 1.3),
playlists = c(30,20,NA,10,NA, NA, 30, 40, NA, NA))
sim_data
## # A tibble: 10 × 3
## date price playlists
## <date> <dbl> <dbl>
## 1 2025-01-01 1.2 30
## 2 2025-01-02 1.1 20
## 3 2025-01-03 NA NA
## 4 2025-01-04 1.1 10
## 5 2025-01-05 1.5 NA
## 6 2025-01-06 1.5 NA
## 7 2025-01-07 NA 30
## 8 2025-01-08 NA 40
## 9 2025-01-09 1.4 NA
## 10 2025-01-10 1.3 NA
na.locf
)##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## # A tibble: 10 × 4
## date price playlists prices_imputed
## <date> <dbl> <dbl> <dbl>
## 1 2025-01-01 1.2 30 1.2
## 2 2025-01-02 1.1 20 1.1
## 3 2025-01-03 NA NA 1.1
## 4 2025-01-04 1.1 10 1.1
## 5 2025-01-05 1.5 NA 1.5
## 6 2025-01-06 1.5 NA 1.5
## 7 2025-01-07 NA 30 1.5
## 8 2025-01-08 NA 40 1.5
## 9 2025-01-09 1.4 NA 1.4
## 10 2025-01-10 1.3 NA 1.3
na.approx
)For playlists, a better approach is linear interpolation, which estimates values by assuming a smooth change over time. Instead of leaving 2025-01-04 blank, we can calculate its value as the midpoint between 2025-01-03 (10) and 2025-01-07 (30), which gives 15.
library(zoo)
sim_data = sim_data %>% mutate(playlists_imputed = zoo::na.approx(playlists, na.rm=F))
sim_data
## # A tibble: 10 × 5
## date price playlists prices_imputed playlists_imputed
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2025-01-01 1.2 30 1.2 30
## 2 2025-01-02 1.1 20 1.1 20
## 3 2025-01-03 NA NA 1.1 15
## 4 2025-01-04 1.1 10 1.1 10
## 5 2025-01-05 1.5 NA 1.5 16.7
## 6 2025-01-06 1.5 NA 1.5 23.3
## 7 2025-01-07 NA 30 1.5 30
## 8 2025-01-08 NA 40 1.5 40
## 9 2025-01-09 1.4 NA 1.4 NA
## 10 2025-01-10 1.3 NA 1.3 NA
na.approx
)When imputing missing values, always group by the right
variables before applying functions like
na.locf()
. If you forget to
group_by(artist, country)
, missing stream counts might get
filled across different artists or countries, leading
to completely nonsensical data. At the same time,
don’t forget to ungroup() afterward—otherwise, later
operations might still be applied within each group, causing unexpected
results. Grouping is powerful, but if used carelessly, it can seriously
mess up your data!
Other useful variables in time series analysis are
lags and leads. Lags
shift past values forward (lag()
), helping us compare a
current value to a previous one, while leads shift
future values backward (lead()
), allowing us to see what
comes next. For example, if a song had 5,000 streams on
Monday and 5,500 on Tuesday, a lag would bring
Monday’s value into Tuesday’s row, making comparisons easier. In the
next example, we’ll use lags to track how streams change over time.
streams <- streams %>%
group_by(artist, country) %>%
arrange(date) %>%
mutate(
lag_streams = lag(streams_filled),
daily_change = streams_filled - lag_streams
)
streams %>% arrange(artist,country,date) %>% select(artist, country, date, streams, lag_streams, daily_change) %>% head(10)
## # A tibble: 10 × 6
## # Groups: artist, country [1]
## artist country date streams lag_streams daily_change
## <chr> <chr> <date> <dbl> <dbl> <dbl>
## 1 10 Beats to Phoenix BE 2024-01-01 6983 NA NA
## 2 10 Beats to Phoenix BE 2024-01-02 8451 6983 1468
## 3 10 Beats to Phoenix BE 2024-01-03 0 8451 -8451
## 4 10 Beats to Phoenix BE 2024-01-04 0 0 0
## 5 10 Beats to Phoenix BE 2024-01-05 1710 0 1710
## 6 10 Beats to Phoenix BE 2024-01-06 3069 1710 1359
## 7 10 Beats to Phoenix BE 2024-01-07 NA 3069 -1202.
## 8 10 Beats to Phoenix BE 2024-01-08 666 1868. -1202.
## 9 10 Beats to Phoenix BE 2024-01-09 0 666 -666
## 10 10 Beats to Phoenix BE 2024-01-10 0 0 0
Data is often stored in different formats, depending
on the analysis needs. Sometimes, we need to widen a
dataset to make it easier to read, and other times, we need to
lengthen it to perform certain operations. The
tidyr
package in tidyverse
provides
pivoting functions to reshape data efficiently.
Format | Example Structure | Example Dataset |
---|---|---|
Wide | One row per date, multiple columns for weather indicator by country | weather.csv |
Long | One row per artist-date-country combination, a column for streams | streams.csv |
pivot_longer()
When converting wide data to long format, we gather multiple columns into key-value pairs. This is useful when variables are spread across columns but should be stored as rows.
Key arguments in pivot_longer()
:
cols
– Selects which columns to pivot.
c(BE_temp, DE_temp)
) or range
(BE_temp:US_temp
).-date
or !date
to pivot everything
except certain columns.names_to
– Defines where column names
will be stored.
"country"
if columns like DE_temp
and BE_temp
should be labeled by country.values_to
– Defines where the actual
data values go.
"temperature"
if numeric values from
DE_temp
and BE_temp
represent temperature
readings.This transformation makes it easier to filter, group, and analyze structured data.
weather
data long by countryThe weather
dataset stores temperature data by
country in long format. We want to convert it into wide format,
with one row per date
and separate columns for each
country’s temperature.
long_weather <- weather %>%
pivot_longer(cols = ends_with('temp'), names_to = "name", values_to = "temperature") %>% separate(name, into = c('country','variable'), sep='_') %>% select(date, country, temperature)
head(long_weather)
## # A tibble: 6 × 3
## date country temperature
## <date> <chr> <dbl>
## 1 2024-01-01 DE 14.3
## 2 2024-01-01 BE 10.5
## 3 2024-01-01 NL 10.6
## 4 2024-01-01 US 17.1
## 5 2024-01-02 DE 8.72
## 6 2024-01-02 BE 17.1
Before pivoting (wide format):
date | BE_temp | DE_temp |
---|---|---|
2024-01-01 | 11.6 | 14.3 |
2024-01-02 | 11.4 | 8.7 |
After pivoting (long format):
date | country | temperature |
---|---|---|
2024-01-01 | BE | 11.6 |
2024-01-01 | DE | 14.3 |
2024-01-02 | BE | 11.4 |
2024-01-02 | DE | 8.7 |
pivot_wider()
When converting long data to wide format, we spread values across multiple columns, making the dataset easier to read and analyze.
Key arguments in pivot_wider()
:
id_cols
– Specifies the column(s) that
should remain unchanged.
"date"
if we want to keep dates as row
identifiers while spreading other values.pivot_wider()
assumes all
non-selected columns should be spread.names_from
– Defines which column
contains the new column names.
"country"
if values from
"country"
(DE
, BE
,
NL
) should become separate columns.values_from
– Specifies which column
contains the data to be spread across new columns.
"temperature"
if we want each country's
temperature as its own column.This transformation is useful when preparing data for reports, modeling, or visualizations.
long_weather
wide againwide_weather <- long_weather %>%
pivot_wider(names_from = 'country', values_from = 'temperature')
head(wide_weather)
## # A tibble: 6 × 5
## date DE BE NL US
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 2024-01-01 14.3 10.5 10.6 17.1
## 2 2024-01-02 8.72 17.1 10.4 21.8
## 3 2024-01-03 11.7 7.41 11.7 17.4
## 4 2024-01-04 12.7 10.5 8.78 15.3
## 5 2024-01-05 12.2 13.0 9.82 29.3
## 6 2024-01-06 10.9 17.0 13.4 13.3
Before pivoting (long format):
date | country | temperature |
---|---|---|
2024-01-01 | BE | 11.6 |
2024-01-01 | DE | 14.3 |
2024-01-02 | BE | 11.4 |
2024-01-02 | DE | 8.7 |
After pivoting (wide format):
date | BE_temp | DE_temp |
---|---|---|
2024-01-01 | 11.6 | 14.3 |
2024-01-02 | 11.4 | 8.7 |
For more details, check out the tidyverse cheatsheet on
pivoting:
tidyr
Pivoting Cheatsheet
In real-world analysis, data is often stored in separate tables. To
get meaningful insights, we need to merge datasets based on shared keys.
The dplyr
package provides efficient functions for this,
making it easy to combine and manipulate data.
Joins allow us to merge two datasets by matching values in a key column. Here’s an overview of different join types:
Here, we will focus on the most important ones:
Join Type | Keeps All Rows From | Drops Rows? |
---|---|---|
Left Join (left_join() ) |
Left table | No |
Inner Join (inner_join() ) |
Only matching rows | Yes |
Right Join (right_join() ) |
Right table | No |
Full Join (full_join() ) |
Both tables (fills missing values) | No |
The left join is the default choice in most cases because it keeps all rows from the left dataset and only adds matching values from the right. Inner joins should be used with caution because they remove rows that do not have a match.
To merge data correctly, we need keys—columns that uniquely identify records in each dataset.
"date"
."date"
and "artist"
.Without defining a proper key, the merge may not work as expected, or it may introduce duplicate rows.
Merging streams
with playlists
, keeping all
rows from streams
.
# For this exercise, let's add a week column that we can merge on
streams <- streams %>% mutate(week=ISOweek::ISOweek(date))
merged_data <- streams %>%
left_join(playlists, by = c("artist" = "artist",
"week"="iso_week")) # Keeps all stream data, adds playlist info.
# Note: here we specify that "week" in the LEFT data frame (streams) is matched on the "iso_week" column in the RIGHT data frame (playlists).
head(merged_data)
## # A tibble: 6 × 13
## # Groups: artist, country [6]
## date artist country streams trend week week2 month year streams_filled
## <date> <chr> <chr> <dbl> <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2024-01-01 10 Be… BE 6983 1 2024… 2024… 1 2024 6983
## 2 2024-01-01 10 Be… DE 9523 2 2024… 2024… 1 2024 9523
## 3 2024-01-01 10 Be… NL 6258 3 2024… 2024… 1 2024 6258
## 4 2024-01-01 10 Be… US 2856 4 2024… 2024… 1 2024 2856
## 5 2024-01-01 10 No… BE 7186 1 2024… 2024… 1 2024 7186
## 6 2024-01-01 10 No… DE 4173 2 2024… 2024… 1 2024 4173
## # ℹ 3 more variables: lag_streams <dbl>, daily_change <dbl>, playlists <dbl>
Why use a left join?
Keeps only rows where a match exists in both streams
and
socials
.
tiktok_followers = socials %>% filter(platform=='TikTok' & metric == 'followers')
merged_data <- streams %>%
inner_join(tiktok_followers, by = c("date", "artist")) # Match by both columns
# rename value column
merged_data = merged_data %>% mutate(tiktok_followers = value) %>% select(-platform, -metric, -value)
head(merged_data)
## # A tibble: 6 × 13
## # Groups: artist, country [6]
## date artist country streams trend week week2 month year streams_filled
## <date> <chr> <chr> <dbl> <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2024-01-01 10 Be… BE 6983 1 2024… 2024… 1 2024 6983
## 2 2024-01-01 10 Be… DE 9523 2 2024… 2024… 1 2024 9523
## 3 2024-01-01 10 Be… NL 6258 3 2024… 2024… 1 2024 6258
## 4 2024-01-01 10 Be… US 2856 4 2024… 2024… 1 2024 2856
## 5 2024-01-01 10 No… BE 7186 1 2024… 2024… 1 2024 7186
## 6 2024-01-01 10 No… DE 4173 2 2024… 2024… 1 2024 4173
## # ℹ 3 more variables: lag_streams <dbl>, daily_change <dbl>,
## # tiktok_followers <dbl>
Things to keep in mind:
Note: Before dplyr
, R users relied on the
merge()
function from base R. It works similarly but is
less readable and much slower. In other words: please use
tidyverse
/dplyr
.
For more details, check out the official dplyr
cheatsheet on joins:
Joins
Cheatsheet
Congratulations! You've tackled the fundamentals of working with time in R, from handling messy date formats to computing lags and filling in missing values. Now, it's time to put your skills to the test with a real-world challenge—one that reflects the types of problems you’d face in data analysis.
Before any meaningful analysis can happen, data needs to be cleaned, structured, and prepared—a process that often takes up more time than the actual modeling. In real-world marketing analytics, raw datasets are rarely ready for immediate use. Instead, they come in different formats, with missing values, misaligned timeframes, and inconsistencies that need to be resolved before drawing any conclusions.
In this case study, you'll work with four
datasets—streams.csv
, weather.csv
,
socials.csv
, and playlists.csv
—to assemble a
dataset ready for analysis. This involves aggregating time-based
data, handling missing values, and merging
different data sources in a way that ensures
consistency.
Throughout this process, you’ll need to make critical decisions: Should streams be summed or averaged at the weekly level? How do you ensure all artists and countries are accounted for? What’s the best way to impute missing data? These choices impact the quality and reliability of your final dataset.
Once the data is structured, you’ll take it further—visualizing trends and running artist-specific regressions. This is an opportunity to apply what you’ve learned and gain experience with the kinds of data challenges analysts deal with every day.
💡 A few things to keep in mind:
- Follow the setup-ITO structure to keep your workflow
organized.
- Work in an .R
file and save your final
dataset as final.csv
.
- Think carefully about your approach—there’s no single correct answer,
but some methods are more appropriate than others.
Load the datasets.
💡 Tips:
## Rows: 135824 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): artist, country
## dbl (1): streams
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 365 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): DE_temp, BE_temp, NL_temp, US_temp, DE_sun, BE_sun, NL_sun, US_sun
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 267072 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): artist, platform, metric
## dbl (1): value
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 4494 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): iso_week, artist
## dbl (1): playlists
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Convert daily streams into weekly aggregates using
ISO weeks. Take the mean()
(so that you
have a weekly average, which is still valid in case single observations
in a week are missing).
💡 Tip:
mean()*7
.library(lubridate)
streams_aggregated <- streams %>%
mutate(week = ISOweek::ISOweek(date)) %>%
group_by(artist, country, week) %>%
summarize(
weekly_streams = 7*mean(streams, na.rm = TRUE) # or use mean()
)
## `summarise()` has grouped output by 'artist', 'country'. You can override using
## the `.groups` argument.
Add a trend variable that increases from 1 (for the first week) to the last. We can later use this variable in a regression analysis.
streams_aggregated <- streams_aggregated %>%
group_by(artist,country) %>% arrange(artist, country, week) %>%
mutate(trend = row_number()) %>% ungroup()
# Import: group by artist and country; we can then generate row numbers for every day.
# I have built in a safeguard so that the data will be ordered by week (because, if it's not for some reason, the trend variable will be off).
Ensure that every artist-country-week combination is represented, even if no streams were recorded.
💡 Tip:
expand_grid()
to create all possible combinations
of artists, countries and dates. Check R's help to see how the command
works (we haven't covered it in the tutorial so far). Just type
?expand.grid
on your console.complete_data
)Join the aggregated streams data with the complete dataset from Exercise 4.
Impute missing values for the following variables:
sun
: na.locf()
temperature
: na.locf()
Instagram_comments
: na.approx()
Instagram_followers
: na.approx()
Snapchat_comments
: na.approx()
Snapchat_followers
: na.approx()
TikTok_comments
: na.approx()
TikTok_followers
: na.approx()
X_comments
: na.approx()
X_followers
: na.approx()
complete_data %>% group_by(artist, country) %>%
arrange(artist, country, week) %>%
mutate(sun = na.locf(sun, na.rm=F),
temp = na.locf(temp, na.rm=F),
Instagram_comments = na.approx(Instagram_comments, na.rm=F),
Instagram_followers = na.approx(Instagram_followers, na.rm=F),
Snapchat_comments = na.approx(Snapchat_comments, na.rm=F),
Snapchat_followers = na.approx(Snapchat_followers, na.rm=F),
TikTok_comments = na.approx(TikTok_comments, na.rm=F),
TikTok_followers = na.approx(TikTok_followers, na.rm=F),
X_comments = na.approx(X_comments, na.rm=F),
X_followers = na.approx(X_followers, na.rm=F)) %>%
ungroup()
## # A tibble: 22,684 × 15
## artist country week weekly_streams trend sun temp Instagram_comments
## <chr> <chr> <chr> <dbl> <int> <dbl> <dbl> <dbl>
## 1 10 Beats t… BE 2024… 23582. 1 5.87 12.6 268948.
## 2 10 Beats t… BE 2024… 18361 2 5.82 12.7 272486.
## 3 10 Beats t… BE 2024… 12335. 3 6.72 16.3 276559.
## 4 10 Beats t… BE 2024… 18779. 4 6.43 16.8 280275.
## 5 10 Beats t… BE 2024… 11993. 5 7.29 17.5 284042.
## 6 10 Beats t… BE 2024… 12023. 6 6.95 17.4 287862
## 7 10 Beats t… BE 2024… 17969 7 6.77 17.5 291993
## 8 10 Beats t… BE 2024… 53596. 8 7.26 18.9 296186.
## 9 10 Beats t… BE 2024… 20325. 9 7.26 20.0 300070
## 10 10 Beats t… BE 2024… 11753 10 7.34 20.8 304335.
## # ℹ 22,674 more rows
## # ℹ 7 more variables: Instagram_followers <dbl>, Snapchat_comments <dbl>,
## # Snapchat_followers <dbl>, TikTok_comments <dbl>, TikTok_followers <dbl>,
## # X_comments <dbl>, X_followers <dbl>
Generate line plots for the first 10 artists and
save them using ggsave()
. Use facet plots
(facet_wrap()
, see week 1) to also plot streams by
different countries.
# starting code
library(ggplot2)
dir.create('plots')
complete_data <- complete_data %>% mutate(week_start = ISOweek::ISOweek2date(paste0(week, '-1')))
artists = unique(streams$artist)
for (one_artist in artists[1:10]) {
plot_data <- complete_data %>% filter(artist == one_artist)
# ggplot code here
}
library(ggplot2)
dir.create('plots')
artists <- unique(complete_data$artist)
complete_data <- complete_data %>% mutate(week_start = ISOweek::ISOweek2date(paste0(week, '-1')))
for (one_artist in artists[1:10]) {
plot_data <- complete_data %>% filter(artist == one_artist)
p <- ggplot(plot_data, aes(x = week_start, y = weekly_streams)) +
geom_line() +
facet_wrap(~country) +
labs(title = paste("Streams for", one_artist))
ggsave(paste0("plots/", one_artist, ".png"), plot = p, width = 6, height = 4)
}
Estimate separate regression models for the first 10 artists and summarize results. Use a loop, as in the code snippet below (later on, you could estimate these models for all 107 artists).
In your regression model, use
log(weekly_streams+1)
as.factor(country)
)sun
and temp
(not log-transformed)log(x+1)
)# starting code
artists = unique(complete_data$artist)
results = lapply(artists[1:10], function(one_artist) {
filtered_data = complete_data %>% filter(artist==one_artist)
# estimation code here
m <- lm(rnorm(10)~1, data = filtered_data) # just an example
results = data.frame(variable = names(m$coefficients),
summary(m)$coefficients)
colnames(results) <- c('variable','estimate','se','t','pval')
results = tibble(results) %>% mutate(artist=one_artist)
})
all_results = bind_rows(results)
artists = unique(complete_data$artist)
results = lapply(artists[1:10], function(one_artist) {
filtered_data = complete_data %>% filter(artist==one_artist)
m <- lm(log(weekly_streams+1) ~ 1 +
as.factor(country) +
sun + temp +
log(Instagram_comments+1) + log(Instagram_followers) +
log(Snapchat_comments+1) + log(Snapchat_followers) +
log(TikTok_comments+1) + log(TikTok_followers) +
log(X_comments+1) + log(X_followers), data= filtered_data)
results = data.frame(variable = names(m$coefficients),
summary(m)$coefficients)
colnames(results) <- c('variable','estimate','se','t','pval')
results = tibble(results) %>% mutate(artist=one_artist)
return(results)
})
all_results = bind_rows(results)
# all results now contains the results of all individual regressions;
# you could now "filter" through - e.g., to create a histogram of the effects of sun!
all_results %>% filter(grepl('sun', variable)) %>%
ggplot() + geom_histogram(aes(x = estimate)) + ggtitle('Histogram of the effect of sun hours on followers')
Through these exercises, you’ll gain hands-on experience with:
- Aggregating and transforming time-based data
- Handling missing values appropriately
- Merging datasets with different time
granularities
- Automating plots and regression analysis for multiple
artists
This workflow mirrors real-world data science challenges, where cleaning and structuring data is just as important as modeling.
Good job!