In this tutorial, we'll take data preparation to the next level by considering a variety of common data issues in the Spotify Top 200 rankings. We'll start off with multiple raw big data sets and end up with a single data frame ready for analysis!
For technical issues outside of scheduled classes, please check the support section on the course website.
Spotify Charts has listed the daily top 200 most streamed songs by country since 2017. As you can see below, you can use the dropdown menus to change your selection, and clicking in the top right corner returns the data in CSV-format.
For this exercise we have collected the rankings for 2020 which have been split up into 3 separate files:
Attribute | Definition | Example |
---|---|---|
song_id | An identifier associated with each song | 22403119 |
date | The date on which the ranking was composed | 2020-01-01 00:00:00+00:00 |
popularity | A popularity score between 0 and 100 on a given date | 95 |
streams | The total number of streams on Spotify in a country on a given date | 544851 |
country | Country abbreviation | US |
Attribute | Definition | Example |
---|---|---|
song_id | An identifier associated with each song | 22403119 |
song_name | The name of the song | WAP (feat. Megan Thee Stallion) |
artists | A list of comma-separated artists of the song | Cardi B,Megan Thee Stallion |
genres | A list of comma-separated genres of the song | pop,rap,houston rap,pop,trap queen |
Attribute | Definition | Example |
---|---|---|
continent | The continent associated with the country | North America |
country | The name of the country | United States of America |
country_code_2_letter | The two-letter country code | US |
country_code_3_letter | The three-letter country code | USA |
country_number | The official ISO country code | 840 |
Exercise 1
Download the data
sets and describe the structure of the data sets in your own
words.
# downloading and unzipping the data
download.file('https://github.com/hannesdatta/course-dprep/raw/master/content/docs/modules/week4/tutorial/data_without_duplicates.zip', 'data.zip')
unzip('data.zip', overwrite=TRUE)
Please keep in mind that opening the data into a spreadsheet program such as Excel only shows you the first 1.05 million records or so (R, however, can load in all the data!).
The streams data set contains the top 200 songs on given date in a
country. The popularity and song id columns have missing values, the
ranks are all between 1 and 200, and the number of streams seems is at
least 1000. The country
column can be joined on the
country_code_2_letter
from the country_codes file.
Similarly, the song_id
can be joined on the
song_id
from songs.csv. The song_name
column
has some non-ASCII characters (e.g., 說好不哭
) here and
there. Roughly one sixth of all songs is not associated with any genre
tags, 17 songs don't have any song id, and for 9 songs the artist is
missing.
Exercise 2
Merge all three files into a single data frame df
and
export the result as merged_df.csv
. Keep in mind the
following instructions:
dplyr
and aim for concise
code (tip: look up the Datacamp tutorials for the syntax!).;
). Rather than using
read_csv()
, you can use read_delim()
and
explicitly tell R which separation character (delim=';'
) to
use!streams.csv
and
country_codes.csv
), you may consider adding suffixes.# install.packages("tidyverse") - run this command if you have never used the package before
library(tidyverse)
# input
streams <- read_csv("streams.csv")
songs <- read_csv("songs.csv")
country_codes <- read_delim("country_codes.csv", delim = ";")
# transformation
df <- streams %>%
inner_join(songs, by = "song_id") %>%
left_join(country_codes, by = c("country" = "country_code_2_letter"), suffix = c("_streams", "_cc"))
# Since countries in Oceania are missing in country_codes.csv we use a left join here (otherwise you'd miss out on +/- 139K records)
# output
write_csv(df, "merged_df.csv")
Exercise 3
A track can appear in multiple top 200 lists across countries (e.g.,
both in the Netherlands and in Belgium) for sustained periods of time
(e.g., yesterday and today). For that reason, a song_id
may
occur repeatedly within the data frame. Yet it can also happen that
exactly the same record appears more than once. For example, the scraper
that collected the data in the first place might have run twice which
duplicated some of the records, or we accidentally made a copy-paste
mistake ourselves. Check whether such duplicates are present in
df
and if so remove them. Export the result as
df_no_duplicates.csv
.
#input
df <- read_csv("merged_df.csv")
# transformation
df <- df %>% filter(!duplicated(df))
# output
write_csv(df, "df_no_duplicates.csv")
In roughly half of the cases, the popularity
column is
missing which turns out to be troublesome once you move towards a model
building phase. Therefore, we ask you to impute these missing values by
the median popularity score of the country on a given date. Here's a
simple example to illustrate:
song_name | country | date | popularity | popularity_filled |
---|---|---|---|---|
A | United Kingdom | 01-01-2020 | 91 | 91 |
B | United Kingdom | 01-01-2020 | 66 | 66 |
C | United Kingdom | 01-01-2020 | NA |
66 |
D | United Kingdom | 01-01-2020 | 62 | 62 |
E | United Kingdom | 02-01-2020 | 79 | 79 |
A | Netherlands | 01-01-2020 | 80 | 80 |
The popularity_filled
column has been added which takes
on the value of the popularity
column if present and
otherwise missing data (NA
) are replaced with their imputed
values. In this example, the popularity of song C
in the
United Kingdom has been computed as the median of
{62, 66, 91} = 66
. Note that it does not take into account
song E
in the United Kingdom (different date) and song
A
in the Netherlands (different country).
Exercise 4
Implement this data imputation strategy using the filter
,
groupby
, and summarise
functions from the
dplyr
package and the result as
df_imputed.csv
. We recommend first computing the imputation
values (for each country across the days), and then replacing missing
values in the popularity
column with their matching
counterparts. Evaluate your results in terms of the pros and cons of
this approach and try to come up with an alternative imputation strategy
(incl. arguments why you prefer that approach).
# input
df <- read_csv("df_no_duplicates.csv")
# transformation
imputation_values <- df %>% filter(!is.na(popularity)) %>%
group_by(country, date) %>%
summarise(popularity_filled = median(popularity))
df_joined <- df %>% left_join(imputation_values, by = c("country" = "country", "date" = "date"))
df <- df_joined %>% mutate(popularity_filled = ifelse(is.na(popularity), popularity_filled, popularity))
# The main disadvantage is that missing records will always be assigned a median popularity score (so never abnormally low or high). Given the sheer amount of missing records, this would highly distort the distribution of the scores.
# Since the popularity score is country independent and it is either present for all records or not available for any country (or time period), we should find a way to group songs together. For example, if the artist has other songs for which the popularity score is known we could take the median of those scores, or we could look for songs with similar genre tag words.
# output
write_csv(df, "df_imputed.csv")
More likely than not, you have come across the lm()
(i.e., linear model) function once before during your studies. Then, you
also know that a regression model can only interpret numeric data.
Categorical fields are typically converted into so-called dummy
variables that take on either the value 0
or
1
. In a similar vein, dates are often converted into trend
variables as input for a time-series model. For example, in our dataset
January the 1st, of 2020 becomes 1 and the 15th of December 2020 (the
last entry of df
) becomes 350. Follow the steps below to
create this trend variable for the date
column of
df
:
Exercise 5
1. Study the documentation
of the seq
function to create a sequence of dates (see
examples at the bottom of the page). 2. Create a data frame
dates
with two columns: date
(i.e., the
sequence of dates) and counter
(i.e., a numeric value
between 1 and 350). Tip: you may first need to convert the
dates
column into character format
(as.charcter(date)
). 3. Merge df
and
dates
and inspect your results (which type of join do you
use?)
# Step 1
date <- seq(min(as.Date(df$date)), max(as.Date(df$date)), by = "day")
# Step 2
counter <- 1:length(date)
dates <- data.frame(counter=counter, date=date)
# Step 3
df <- df %>% inner_join(dates, by = "date")
Even though the data frame represents the Spotify top 200 charts,
none of the columns indicates the rank of a given track. We can easily
derive it from the number of streams, the date
, and the
country
column. The built-in R function ranks
returns a vector of ranks for a given input. By default, the lowest
digit gets the highest rank (e.g., 1 becomes rank 1), you can reverse
the order by adding a minus sign (-
) in front of the input
vector:
rank_example = c(4, 9, 3, 2, 1)
# regular
print(rank(rank_example))
## [1] 4 5 3 2 1
# reversed
print(rank(-rank_example))
## [1] 2 1 3 4 5
Exercise 6
Add a ranking
column to df
using the
dplyr
package and export the result as
df_ranking.csv
. Although it is unlikely that some tracks
have exactly the same number of streams, think about how you will deal
with these edge cases (tip: run ?rank
to look up the
documentation). Does your ranking always run from 1 to 200? Why is that?
Investigate the root cause and fix it!
# input
df <- read_csv("df_imputed.csv")
# transformation
df <- df %>% group_by(country, date) %>%
mutate(ranking = rank(-streams, ties.method = "random"))
# The maximum ranking is 400 (rather than 200)
# Once we inspect one of the listings of df[df$ranking >= 200, ] we find that these records typically have redundant data somewhere hidden in between. For example, for df[(df$date == "2020-05-17") & (df$country == "HK"),] we discover that ranking 59 appears 17 times (!) which can be attributed to missing values `song_id`
# remove rows without song_id (if you then recreate the ranking column you'll come to the conclusion the maximum ranking has become 200!)
df <- df %>% filter(!is.na(song_id))
# output
write_csv(df, "df_ranking.csv")
The lion share of tracks are assigned to more than one genre. For
example, the famous song "Señorita" by Shawn Mendes and Camilo Cabello
is categorized as pop
, but also as
canadian pop
, post-teen pop
,
viral pop
, and dance pop
. The data frame
stores these records as a list of comma-separated genres. On a high
level, we like to move from this state:
song_name | genres |
---|---|
Señorita | pop, canadian pop, post-teen pop, viral pop, dance pop |
To a format in which the columns represent genre tag words and the
table values indicate which genre(s) belong to which
song_name
:
song_name | k-pop | pop | canadian pop | post-teen pop | viral pop | dance pop |
---|---|---|---|---|---|---|
Señorita | 0 | 1 | 1 | 1 | 1 | 1 |
Memories | 0 | 1 | 0 | 0 | 0 | 0 |
Psycho | 1 | 0 | 0 | 0 | 0 | 0 |
To this end, we first need to separate the genres
column
of df
into individual genres. We can use the built-in R
function strsplit
for that in the following way (note the
double brackets [[]]
around 1
!):
genres_example <- c("pop", "canadian pop", "post-teen pop", "viral pop", "dance pop")
genres_example_split <- split(genres_example, ",")
genres_example_split[[1]]
## [1] "pop" "canadian pop" "post-teen pop" "viral pop"
## [5] "dance pop"
After some data wrangling we can transform the data into the following format:
song_id | song_genre |
---|---|
13594414 | pop |
13594414 | canadian pop |
13594414 | post-teen pop |
13594414 | viral pop |
13594414 | dance pop |
13977556 | pop` |
Exercise 7
As a last step, we use the pivot_longer()
and
pivot_wider()
functions to change the format of our data
frame (see, e.g., this
overview or these code
snippets). We then store the results in
df_reshaped.csv
.
Following these steps, it's your turn to create a data set, with (a) rows being the dates, (b) columns being the song IDs, and (c) the cells indicating a song's sum of streams.
# load input
df <- read_csv("df_ranking.csv")
song_popularity = df %>% pivot_wider(id_cols = date, names_from = 'song_id', values_from = 'streams', values_fn = sum)
# let us look at the columns (--> these are song IDs!)
colnames(song_popularity)
# let us look at the rows (--> these are dates!)
song_popularity$date[1:10]
# output
write_csv(song_popularity, "df_reshaped.csv")