In this tutorial, we'll take data preparation to the next level by considering a variety of common data issues in the Spotify Top 200 rankings. We'll start off with multiple raw big data sets and end up with a single data frame ready for analysis!


Prerequisites


Learning Objectives


Support Needed?

For technical issues outside of scheduled classes, please check the support section on the course website.


1. Loading and Inspecting the Data

1.1 Merging Data

Spotify Charts has listed the daily top 200 most streamed songs by country since 2017. As you can see below, you can use the dropdown menus to change your selection, and clicking in the top right corner returns the data in CSV-format.

For this exercise we have collected the rankings for 2020 which have been split up into 3 separate files:

Streams

Attribute Definition Example
song_id An identifier associated with each song 22403119
date The date on which the ranking was composed 2020-01-01 00:00:00+00:00
popularity A popularity score between 0 and 100 on a given date 95
streams The total number of streams on Spotify in a country on a given date 544851
country Country abbreviation US

Songs

Attribute Definition Example
song_id An identifier associated with each song 22403119
song_name The name of the song WAP (feat. Megan Thee Stallion)
artists A list of comma-separated artists of the song Cardi B,Megan Thee Stallion
genres A list of comma-separated genres of the song pop,rap,houston rap,pop,trap queen

Country Codes

Attribute Definition Example
continent The continent associated with the country North America
country The name of the country United States of America
country_code_2_letter The two-letter country code US
country_code_3_letter The three-letter country code USA
country_number The official ISO country code 840


Exercise 1
Download the data sets and describe the structure of the data sets in your own words.

# downloading and unzipping the data
download.file('https://github.com/hannesdatta/course-dprep/raw/master/content/docs/modules/week4/tutorial/data_without_duplicates.zip', 'data.zip')
unzip('data.zip', overwrite=TRUE)
  • What is the target unit of analysis?
  • Do columns seem complete?
  • Are there any values that do not make sense?
  • Are “strings” properly encoded?
  • What are common columns across the three files?

Please keep in mind that opening the data into a spreadsheet program such as Excel only shows you the first 1.05 million records or so (R, however, can load in all the data!).

The streams data set contains the top 200 songs on given date in a country. The popularity and song id columns have missing values, the ranks are all between 1 and 200, and the number of streams seems is at least 1000. The country column can be joined on the country_code_2_letter from the country_codes file. Similarly, the song_id can be joined on the song_id from songs.csv. The song_name column has some non-ASCII characters (e.g., 說好不哭) here and there. Roughly one sixth of all songs is not associated with any genre tags, 17 songs don't have any song id, and for 9 songs the artist is missing.


Exercise 2
Merge all three files into a single data frame df and export the result as merged_df.csv. Keep in mind the following instructions:

  • Use the join functions of dplyr and aim for concise code (tip: look up the Datacamp tutorials for the syntax!).
  • Note that CSV-files are separated by commas (hence the name comma-separated-value!), but they can also be separated by other delimiters (e.g., ;). Rather than using read_csv(), you can use read_delim() and explicitly tell R which separation character (delim=';') to use!
  • Examine the effect of picking a left, right, and inner join, and how that affects the completeness of your data. Strive for a final data frame that captures as much of the original data as possible.
  • To avoid confusion between overlapping column names (e.g., country appears in both streams.csv and country_codes.csv), you may consider adding suffixes.
# install.packages("tidyverse") - run this command if you have never used the package before
library(tidyverse)

# input
streams <- read_csv("streams.csv")
songs <- read_csv("songs.csv")
country_codes <- read_delim("country_codes.csv", delim = ";")

# transformation
df <- streams %>% 
  inner_join(songs, by = "song_id") %>%
  left_join(country_codes, by = c("country" = "country_code_2_letter"), suffix = c("_streams", "_cc"))

# Since countries in Oceania are missing in country_codes.csv we use a left join here (otherwise you'd miss out on +/- 139K records)

# output
write_csv(df, "merged_df.csv")


1.2 Duplicates


Exercise 3
A track can appear in multiple top 200 lists across countries (e.g., both in the Netherlands and in Belgium) for sustained periods of time (e.g., yesterday and today). For that reason, a song_id may occur repeatedly within the data frame. Yet it can also happen that exactly the same record appears more than once. For example, the scraper that collected the data in the first place might have run twice which duplicated some of the records, or we accidentally made a copy-paste mistake ourselves. Check whether such duplicates are present in df and if so remove them. Export the result as df_no_duplicates.csv.


#input
df <- read_csv("merged_df.csv")

# transformation
df <- df %>% filter(!duplicated(df))

# output
write_csv(df, "df_no_duplicates.csv")


1.3 Data Imputation

In roughly half of the cases, the popularity column is missing which turns out to be troublesome once you move towards a model building phase. Therefore, we ask you to impute these missing values by the median popularity score of the country on a given date. Here's a simple example to illustrate:

song_name country date popularity popularity_filled
A United Kingdom 01-01-2020 91 91
B United Kingdom 01-01-2020 66 66
C United Kingdom 01-01-2020 NA 66
D United Kingdom 01-01-2020 62 62
E United Kingdom 02-01-2020 79 79
A Netherlands 01-01-2020 80 80

The popularity_filled column has been added which takes on the value of the popularity column if present and otherwise missing data (NA) are replaced with their imputed values. In this example, the popularity of song C in the United Kingdom has been computed as the median of {62, 66, 91} = 66. Note that it does not take into account song E in the United Kingdom (different date) and song A in the Netherlands (different country).

Exercise 4
Implement this data imputation strategy using the filter, groupby, and summarise functions from the dplyr package and the result as df_imputed.csv. We recommend first computing the imputation values (for each country across the days), and then replacing missing values in the popularity column with their matching counterparts. Evaluate your results in terms of the pros and cons of this approach and try to come up with an alternative imputation strategy (incl. arguments why you prefer that approach).

# input
df <- read_csv("df_no_duplicates.csv")

# transformation
imputation_values <- df %>% filter(!is.na(popularity)) %>% 
  group_by(country, date) %>% 
  summarise(popularity_filled = median(popularity))

df_joined <- df %>% left_join(imputation_values, by = c("country" = "country", "date" = "date"))

df <- df_joined %>% mutate(popularity_filled = ifelse(is.na(popularity), popularity_filled, popularity))

# The main disadvantage is that missing records will always be assigned a median popularity score (so never abnormally low or high). Given the sheer amount of missing records, this would highly distort the distribution of the scores.
# Since the popularity score is country independent and it is either present for all records or not available for any country (or time period), we should find a way to group songs together. For example, if the artist has other songs for which the popularity score is known we could take the median of those scores, or we could look for songs with similar genre tag words.

# output
write_csv(df, "df_imputed.csv")

2. Feature Engineering / Operationalizing Variables


2.1 Trend Variable

More likely than not, you have come across the lm() (i.e., linear model) function once before during your studies. Then, you also know that a regression model can only interpret numeric data. Categorical fields are typically converted into so-called dummy variables that take on either the value 0 or 1. In a similar vein, dates are often converted into trend variables as input for a time-series model. For example, in our dataset January the 1st, of 2020 becomes 1 and the 15th of December 2020 (the last entry of df) becomes 350. Follow the steps below to create this trend variable for the date column of df:


Exercise 5
1. Study the documentation of the seq function to create a sequence of dates (see examples at the bottom of the page). 2. Create a data frame dates with two columns: date (i.e., the sequence of dates) and counter (i.e., a numeric value between 1 and 350). Tip: you may first need to convert the dates column into character format (as.charcter(date)). 3. Merge df and dates and inspect your results (which type of join do you use?)

# Step 1
date <- seq(min(as.Date(df$date)), max(as.Date(df$date)), by = "day")

# Step 2
counter <- 1:length(date)
dates <- data.frame(counter=counter, date=date)

# Step 3
df <- df %>% inner_join(dates, by = "date")


2.2 Ranks

Even though the data frame represents the Spotify top 200 charts, none of the columns indicates the rank of a given track. We can easily derive it from the number of streams, the date, and the country column. The built-in R function ranks returns a vector of ranks for a given input. By default, the lowest digit gets the highest rank (e.g., 1 becomes rank 1), you can reverse the order by adding a minus sign (-) in front of the input vector:

rank_example = c(4, 9, 3, 2, 1)
# regular 
print(rank(rank_example))
## [1] 4 5 3 2 1
# reversed
print(rank(-rank_example)) 
## [1] 2 1 3 4 5


Exercise 6
Add a ranking column to df using the dplyr package and export the result as df_ranking.csv. Although it is unlikely that some tracks have exactly the same number of streams, think about how you will deal with these edge cases (tip: run ?rank to look up the documentation). Does your ranking always run from 1 to 200? Why is that? Investigate the root cause and fix it!

# input
df <- read_csv("df_imputed.csv")

# transformation
df <- df %>% group_by(country, date) %>%
  mutate(ranking = rank(-streams, ties.method = "random"))

# The maximum ranking is 400 (rather than 200)
# Once we inspect one of the listings of df[df$ranking >= 200, ] we find that these records typically have redundant data somewhere hidden in between. For example, for df[(df$date == "2020-05-17") & (df$country == "HK"),] we discover that ranking 59 appears 17 times (!) which can be attributed to missing values `song_id` 

# remove rows without song_id (if you then recreate the ranking column you'll come to the conclusion the maximum ranking has become 200!)
df <- df %>% filter(!is.na(song_id))

# output
write_csv(df, "df_ranking.csv")


2.3 Reshaping / pivoting longer or wider

The lion share of tracks are assigned to more than one genre. For example, the famous song "Señorita" by Shawn Mendes and Camilo Cabello is categorized as pop, but also as canadian pop, post-teen pop, viral pop, and dance pop. The data frame stores these records as a list of comma-separated genres. On a high level, we like to move from this state:

song_name genres
Señorita pop, canadian pop, post-teen pop, viral pop, dance pop


To a format in which the columns represent genre tag words and the table values indicate which genre(s) belong to which song_name:

song_name k-pop pop canadian pop post-teen pop viral pop dance pop
Señorita 0 1 1 1 1 1
Memories 0 1 0 0 0 0
Psycho 1 0 0 0 0 0


To this end, we first need to separate the genres column of df into individual genres. We can use the built-in R function strsplit for that in the following way (note the double brackets [[]] around 1!):


genres_example <- c("pop", "canadian pop", "post-teen pop", "viral pop", "dance pop")
genres_example_split <- split(genres_example, ",")
genres_example_split[[1]]
## [1] "pop"           "canadian pop"  "post-teen pop" "viral pop"    
## [5] "dance pop"

After some data wrangling we can transform the data into the following format:

song_id song_genre
13594414 pop
13594414 canadian pop
13594414 post-teen pop
13594414 viral pop
13594414 dance pop
13977556 pop`


Exercise 7
As a last step, we use the pivot_longer() and pivot_wider() functions to change the format of our data frame (see, e.g., this overview or these code snippets). We then store the results in df_reshaped.csv.

Following these steps, it's your turn to create a data set, with (a) rows being the dates, (b) columns being the song IDs, and (c) the cells indicating a song's sum of streams.


# load input
df <- read_csv("df_ranking.csv")

song_popularity = df %>% pivot_wider(id_cols = date, names_from = 'song_id', values_from = 'streams', values_fn = sum)

# let us look at the columns (--> these are song IDs!)
colnames(song_popularity)

# let us look at the rows (--> these are dates!)
song_popularity$date[1:10]

# output
write_csv(song_popularity, "df_reshaped.csv")