In this tutorial, we'll take data preparation to the next level by considering a variety of common data issues in the Spotify Top 200 rankings. We'll start off with multiple raw big data sets and end up with a single data frame ready for analysis!
Ready to start? Here's what we recommend you do before going through this tutorial.
Spotify Charts has listed the daily top 200 most streamed songs by country since 2017. As you can see below, you can use the dropdown menus to change your selection, and clicking in the top right corner returns the data in CSV-format.
For this exercise we have collected the rankings for 2020 which have been split up into 3 separate files:
Attribute | Definition | Example |
---|---|---|
song_id | An identifier associated with each song | 22403119 |
date | The date on which the ranking was composed | 2020-01-01 00:00:00+00:00 |
popularity | A popularity score between 0 and 100 on a given date | 95 |
streams | The total number of streams on Spotify in a country on a given date | 544851 |
country | Country abbreviation | US |
Attribute | Definition | Example |
---|---|---|
song_id | An identifier associated with each song | 22403119 |
song_name | The name of the song | WAP (feat. Megan Thee Stallion) |
artists | A list of comma-separated artists of the song | Cardi B,Megan Thee Stallion |
genres | A list of comma-separated genres of the song | pop,rap,houston rap,pop,trap queen |
Attribute | Definition | Example |
---|---|---|
continent | The continent associated with the country | North America |
country | The name of the country | United States of America |
country_code_2_letter | The two-letter country code | US |
country_code_3_letter | The three-letter country code | USA |
country_number | The official ISO country code | 840 |
File
-> New Project
->
New Directory
-> New Project
.data-preparation
and save it in a
location of your choice.data
folder:Click here to manually download the dataset. Alternatively, run the following R code to download and unzip the file automatically:
# downloading and unzipping the data
download.file('https://github.com/hannesdatta/course-dprep/raw/master/content/docs/modules/week4/tutorial/data_without_duplicates.zip', 'data.zip')
unzip('data.zip', overwrite=TRUE)
Now, load the data into R using readr
's
read_csv
and read_delim
functions:
Tip:
Note that CSV-files are separated by commas (hence the name
comma-separated-value!), but they can also be separated by other
delimiters (e.g., ;
). Rather than using
read_csv()
, you can use read_delim()
and
explicitly tell R which separation character (delim=';'
) to
use!
#install.packges("readr") # uncomment and run this if you have not installed `readr` package in R
library(readr)
streams <- read_csv("streams.csv")
## Rows: 278724 Columns: 5
## โโ Column specification โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
## Delimiter: ","
## chr (1): country
## dbl (3): song_id, popularity, streams
## date (1): date
##
## โน Use `spec()` to retrieve the full column specification for this data.
## โน Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 50268 Columns: 4
## โโ Column specification โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
## Delimiter: ","
## chr (3): song_name, artists, genres
## dbl (1): song_id
##
## โน Use `spec()` to retrieve the full column specification for this data.
## โน Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 258 Columns: 5
## โโ Column specification โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
## Delimiter: ";"
## chr (4): continent, country, country_code_2_letter, country_code_3_letter
## dbl (1): country_number
##
## โน Use `spec()` to retrieve the full column specification for this data.
## โน Specify the column types or set `show_col_types = FALSE` to quiet this message.
Troubleshooting
setwd()
โ hardcoding directories can cause problems!Your task is to examine the datasets and identify the following:
Tips:
Take a look at the data using the function
head(streams)
, or View(streams)
. If you'd like
to view more rows with head
, use
head(streams, 100)
(for, e.g., the first 100 rows).
(tail()
gives you the last rows of the data).
The command summary(songs)
or
glimpse(songs)
generates descriptive statistics for all
variables in the data. You can also use this command on individual
columns (e.g., summary(streams$date)
).
Character or factor columns are best inspected using the
table()
command. These will create frequency
tables.
## # A tibble: 6 ร 5
## song_id date country popularity streams
## <dbl> <date> <chr> <dbl> <dbl>
## 1 12968774 2020-01-01 BE 100 61421
## 2 16667372 2020-01-01 BE 100 32758
## 3 17236587 2020-01-01 BE 99 31336
## 4 17235820 2020-01-01 BE NA 28651
## 5 13586280 2020-01-01 BE NA 28342
## 6 13977556 2020-01-01 BE NA 26884
## # A tibble: 6 ร 4
## song_id song_name artists genres
## <dbl> <chr> <chr> <chr>
## 1 10139587 Sunflower - Spider-Man: Into the Spider-Verse Post Mโฆ dfw rโฆ
## 2 10001151 thank u, next Arianaโฆ danceโฆ
## 3 10138723 Everything I Need (From Aquaman: Original Motion Pictโฆ Skylarโฆ pianoโฆ
## 4 9920215 ๆไธ็จฎๆฒๅท - ้ปๅฝฑใๆฏๆฒๅทๆดๆฒๅท็ๆ
ไบใไธป้กๆฒ A-Lin mandoโฆ
## 5 9893795 Without Me Halsey danceโฆ
## 6 10175311 Wow. Post Mโฆ dfw rโฆ
## # A tibble: 6 ร 5
## continent country country_code_2_letter country_code_3_letter country_number
## <chr> <chr> <chr> <chr> <dbl>
## 1 Asia Afghaniโฆ AF AFG 4
## 2 Europe Albaniaโฆ AL ALB 8
## 3 Antarctica Antarctโฆ AQ ATA 10
## 4 Africa Algeriaโฆ DZ DZA 12
## 5 Oceania Americaโฆ AS ASM 16
## 6 Europe Andorraโฆ AD AND 20
# 1. Identify target unit of analysis
# Ans: The streams data set contains the top 200 songs on given date in a country.
# 2. Missing values
colSums(is.na(streams)) # popularity and song_id have missing values
## song_id date country popularity streams
## 37 0 0 125553 0
## song_id song_name artists genres
## 1 1 0 7491
## continent country country_code_2_letter
## 0 0 1
## country_code_3_letter country_number
## 0 0
## song_id date country popularity
## Min. : 3361 Min. :2020-01-01 Length:278724 Min. : 1.00
## 1st Qu.:13594414 1st Qu.:2020-03-28 Class :character 1st Qu.: 25.00
## Median :18113962 Median :2020-06-23 Mode :character Median : 50.00
## Mean :16652602 Mean :2020-06-23 Mean : 50.43
## 3rd Qu.:20942905 3rd Qu.:2020-09-19 3rd Qu.: 75.00
## Max. :24385062 Max. :2020-12-15 Max. :100.00
## NA's :37 NA's :125553
## streams
## Min. : 4788
## 1st Qu.: 24504
## Median : 70712
## Mean : 146753
## 3rd Qu.: 224519
## Max. :4175284
##
## song_id song_name artists genres
## Min. : 3304 Length:50268 Length:50268 Length:50268
## 1st Qu.:10422553 Class :character Class :character Class :character
## Median :15012393 Mode :character Mode :character Mode :character
## Mean :15099792
## 3rd Qu.:20836410
## Max. :24448557
## NA's :1
## continent country country_code_2_letter
## Length:258 Length:258 Length:258
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## country_code_3_letter country_number
## Length:258 Min. : 4.0
## Class :character 1st Qu.:215.0
## Mode :character Median :432.0
## Mean :432.3
## 3rd Qu.:645.2
## Max. :894.0
# You could check more specifically. For e.g. if popularity scores are between 0 and 100 (based on data description)
summary(streams$popularity)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 25.00 50.00 50.43 75.00 100.00 125553
# 4. Check string formatting
# Find song names with non-ASCII characters
non_ascii_songs <- songs %>%
filter(!grepl("^[ -~]+$", song_name))
print(non_ascii_songs)
## # A tibble: 10,436 ร 4
## song_id song_name artists genres
## <dbl> <chr> <chr> <chr>
## 1 9920215 ๆไธ็จฎๆฒๅท - ้ปๅฝฑใๆฏๆฒๅทๆดๆฒๅท็ๆ
ไบใไธป้กๆฒ A-Lin mandoโฆ
## 2 10146959 ็พๅนดๆจนๆจ Hins Cheung c-popโฆ
## 3 10136753 ้พ่่ญ Eason Chan; eaโฆ c-popโฆ
## 4 10011572 (ไธๅ็ทไบบ) ไธๅๅฅณไบบ ๅๆตดๅฎค Louis Koo; Kayโฆ c-popโฆ
## 5 10136746 ๆผธๆผธ Eason Chan; eaโฆ c-popโฆ
## 6 10124408 ่ช็ๅฆๅ Jason Chan c-popโฆ
## 7 10136664 ๆๅ่ฌๆญฒ Eason Chan; eaโฆ c-popโฆ
## 8 10064688 ๆฐธ้ ้ฃ่กๆจกๅผ Dear Jane c-popโฆ
## 9 9772540 ๆชไพ่ฆ RubberBand c-pop
## 10 10137325 ๅๆใป17 Juno Mak c-popโฆ
## # โน 10,426 more rows
# 5. Common columns
common_cols1 <- intersect(names(streams), names(songs))
print(common_cols1) # song_id
## [1] "song_id"
## [1] "country"
In this section, we want to prepare the final dataset by merging the
three datasets. The "join" functions from dplyr
come in
handy for this.
Examine the effect of picking a left_join
,
right_join
, and inner_join
, and how that
affects the completeness of your data. Strive for a final data frame
that captures as much of the original data as possible.
merged_df.csv
after
merging all 3 datasets.Tips:
streams.csv
and
country_codes.csv
), you may consider adding suffixes.df <- streams %>%
inner_join(songs, by = "song_id") %>%
left_join(country_codes, by = c("country" = "country_code_2_letter"), suffix = c("_streams", "_cc"))
# Since countries in Oceania are missing in country_codes.csv we use a left join here (otherwise you'd miss out on +/- 139K records)
write_csv(df, "merged_df.csv")
A track can appear in multiple top 200 lists across countries (e.g.,
both in the Netherlands and in Belgium) for sustained periods of time
(e.g., yesterday and today). For that reason, a song_id
may
occur repeatedly within the data frame. Yet it can also happen that
exactly the same record appears more than once. For example, the scraper
that collected the data in the first place might have run twice which
duplicated some of the records, or we accidentally made a copy-paste
mistake ourselves. Check whether such duplicates are present in
df
and if so remove them. Export the result as
df_no_duplicates.csv
.
## Rows: 278724 Columns: 12
## โโ Column specification โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
## Delimiter: ","
## chr (7): country, song_name, artists, genres, continent, country_cc, countr...
## dbl (4): song_id, popularity, streams, country_number
## date (1): date
##
## โน Use `spec()` to retrieve the full column specification for this data.
## โน Specify the column types or set `show_col_types = FALSE` to quiet this message.
In roughly half of the cases, the popularity
column is
missing which turns out to be troublesome once you move towards a model
building phase. Therefore, we ask you to impute these missing values by
the median popularity score of the country on a given date. Here's a
simple example to illustrate:
song_name | country | date | popularity | popularity_filled |
---|---|---|---|---|
A | United Kingdom | 01-01-2020 | 91 | 91 |
B | United Kingdom | 01-01-2020 | 66 | 66 |
C | United Kingdom | 01-01-2020 | NA |
66 |
D | United Kingdom | 01-01-2020 | 62 | 62 |
E | United Kingdom | 02-01-2020 | 79 | 79 |
A | Netherlands | 01-01-2020 | 80 | 80 |
The popularity_filled
column has been added which takes
on the value of the popularity
column if present and
otherwise missing data (NA
) are replaced with their imputed
values. In this example, the popularity of song C
in the
United Kingdom has been computed as the median of
{62, 66, 91} = 66
. Note that it does not take into account
song E
in the United Kingdom (different date) and song
A
in the Netherlands (different country).
Implement this data imputation strategy using the
filter
, groupby
, and summarise
functions from the dplyr
package and the result as
df_imputed.csv
. We recommend first computing the imputation
values (for each country across the days), and then replacing missing
values in the popularity
column with their matching
counterparts. Evaluate your results in terms of the pros and cons of
this approach and try to come up with an alternative imputation strategy
(incl. arguments why you prefer that approach).
## Rows: 278724 Columns: 12
## โโ Column specification โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
## Delimiter: ","
## chr (7): country, song_name, artists, genres, continent, country_cc, countr...
## dbl (4): song_id, popularity, streams, country_number
## date (1): date
##
## โน Use `spec()` to retrieve the full column specification for this data.
## โน Specify the column types or set `show_col_types = FALSE` to quiet this message.
# transformation
imputation_values <- df %>% filter(!is.na(popularity)) %>%
group_by(country, date) %>%
summarise(popularity_filled = median(popularity))
## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.
df_joined <- df %>% left_join(imputation_values, by = c("country" = "country", "date" = "date"))
df <- df_joined %>% mutate(popularity_filled = ifelse(is.na(popularity), popularity_filled, popularity))
# The main disadvantage is that missing records will always be assigned a median popularity score (so never abnormally low or high). Given the sheer amount of missing records, this would highly distort the distribution of the scores.
# Since the popularity score is country independent and it is either present for all records or not available for any country (or time period), we should find a way to group songs together. For example, if the artist has other songs for which the popularity score is known we could take the median of those scores, or we could look for songs with similar genre tag words.
# output
write_csv(df, "df_imputed.csv")
More likely than not, you have come across the lm()
(i.e., linear model) function once before during your studies. Then, you
also know that a regression model can only interpret numeric data.
Categorical fields are typically converted into so-called dummy
variables that take on either the value 0
or
1
. In a similar vein, dates are often converted into trend
variables as input for a time-series model. For example, in our dataset
January the 1st, of 2020 becomes 1 and the 15th of December 2020 (the
last entry of df
) becomes 350. Follow the steps below to
create this trend variable for the date
column of
df
:
seq
function to create a sequence of dates (see
examples at the bottom of the page).dates
with two columns:
date
(i.e., the sequence of dates) and counter
(i.e., a numeric value between 1 and 350). Tip: you may first need to
convert the dates
column into character format
(as.charcter(date)
).df
and dates
and inspect your
results (which type of join do you use?)Even though the data frame represents the Spotify top 200 charts,
none of the columns indicates the rank of a given track. We can easily
derive it from the number of streams, the date
, and the
country
column. The built-in R function ranks
returns a vector of ranks for a given input. By default, the lowest
digit gets the highest rank (e.g., 1 becomes rank 1), you can reverse
the order by adding a minus sign (-
) in front of the input
vector:
## [1] 4 5 3 2 1
## [1] 2 1 3 4 5
Add a ranking
column to df
using the
dplyr
package and export the result as
df_ranking.csv
. Although it is unlikely that some tracks
have exactly the same number of streams, think about how you will deal
with these edge cases (tip: run ?rank
to look up the
documentation). Does your ranking always run from 1 to 200? Why is that?
Investigate the root cause and fix it!
# input
df <- read_csv("df_imputed.csv")
# transformation
df <- df %>% group_by(country, date) %>%
mutate(ranking = rank(-streams, ties.method = "random"))
# The maximum ranking is 400 (rather than 200)
# Once we inspect one of the listings of df[df$ranking >= 200, ] we find that these records typically have redundant data somewhere hidden in between. For example, for df[(df$date == "2020-05-17") & (df$country == "HK"),] we discover that ranking 59 appears 17 times (!) which can be attributed to missing values `song_id`
# remove rows without song_id (if you then recreate the ranking column you'll come to the conclusion the maximum ranking has become 200!)
df <- df %>% filter(!is.na(song_id))
# output
write_csv(df, "df_ranking.csv")
Most tracks are assigned to more than one genre. For example, the
famous song "Seรฑorita" by Shawn Mendes and Camilo Cabello is categorized
as pop
, but also as canadian pop
,
post-teen pop
, viral pop
, and
dance pop
. The data frame stores these records as a list of
comma-separated genres. On a high level, we like to move from this
state:
song_name | genres |
---|---|
Seรฑorita | pop, canadian pop, post-teen pop, viral pop, dance pop |
To a format in which the columns represent genre tag words and the
table values indicate which genre(s) belong to which
song_name
:
song_name | k-pop | pop | canadian pop | post-teen pop | viral pop | dance pop |
---|---|---|---|---|---|---|
Seรฑorita | 0 | 1 | 1 | 1 | 1 | 1 |
Memories | 0 | 1 | 0 | 0 | 0 | 0 |
Psycho | 1 | 0 | 0 | 0 | 0 | 0 |
To this end, we first need to separate the genres
column
of df
into individual genres. We can use the built-in R
function strsplit
for that in the following way (note the
double brackets [[]]
around 1
!):
genres_example <- c("pop", "canadian pop", "post-teen pop", "viral pop", "dance pop")
genres_example_split <- split(genres_example, ",")
genres_example_split[[1]]
## [1] "pop" "canadian pop" "post-teen pop" "viral pop"
## [5] "dance pop"
After some data wrangling we can transform the data into the following format:
song_id | song_genre |
---|---|
13594414 | pop |
13594414 | canadian pop |
13594414 | post-teen pop |
13594414 | viral pop |
13594414 | dance pop |
13977556 | pop` |
As a last step, we use the pivot_longer()
and
pivot_wider()
functions to change the format of our data
frame (see, e.g., this
overview or these code
snippets). We then store the results in
df_reshaped.csv
.
Following these steps, it's your turn to create a data set, with (a) rows being the dates, (b) columns being the song IDs, and (c) the cells indicating a song's sum of streams.
# load input
df <- read_csv("df_ranking.csv")
song_popularity = df %>% pivot_wider(id_cols = date, names_from = 'song_id', values_from = 'streams', values_fn = sum)
# let us look at the columns (--> these are song IDs!)
colnames(song_popularity)
# let us look at the rows (--> these are dates!)
song_popularity$date[1:10]
# output
write_csv(song_popularity, "df_reshaped.csv")
Congratulations! ๐ You have successfully navigated the key steps of data preparation and gained hands-on experience with essential data wrangling techniques in R. Here are the key takeaways from this tutorial:
โ Data Inspection & Cleaning
โ Data Merging & Joins
โ Handling Missing Values & Data Imputation
โ Feature Engineering & Transformations
โ Reshaping Data for Analysis
Great job completing this tutorial! Keep practicing! ๐
For technical issues outside of scheduled classes, please check the support section on the course website.