Engineering data sets (in-class tutorial)

Hannes Datta

Before we get started

Check exam dates on the course page
- Thursday, 4 April (main sit)
- Friday, 28 June (resit)
GitHub way of working
- I commented on/added issues to your team repositories!
Structure
- Back to R, but: maintain way of working using Git/GitHub

Structure of this tutorial

In-class tutorial featuring selected issues raised in:
- Tidyverse - chapter 1 & 3
- Cleaning data in R - chapter 1 & 2
- Joining Data with dplyr- chapter 1 & 2
- And…of course, our own tutorial that ties everything together

Get your cheat sheets ready!!!

After class
- Please go through the entire tutorial on the dprep site.

Why this tutorial

Final data set rarely available
Requires wrangling/transformations (can you name a few examples? there are many!)
In projects, structure scripts in four blocks – required to automate workflow later
- Setup/prerequisites/loading packages (library())
- Input (e.g., read_csv()) - always relative filenames!
- Transformation (i.e., operations on your data)
- Output (e.g., write_csv() – also w/ relative filenames)
Zoom in on ways to transform (while respecting setup-I-T-O building blocks)

Does and do not's

A few observations from your team projects.

absolutely important to work in local repositories, with issue-based branches, and pull requests.
Code files need to be structured as setup-ITO – required for automation!
Choose appropriate file format: Rmarkdowns or Quarto for reports/PDFs/HTML, .R for regular scripts
Adhere to tidyverse functions (e.g., read_csv(), not read.csv())
Incorporate programming concepts: looping, using functions

DO: Let's get today's data and start exploring

Run this snippet to download today's data (it's zipped - contains a couple of .csv files)

library(tidyverse)
download.file('https://github.com/hannesdatta/course-dprep/raw/master/content/docs/modules/week4/tutorial/data_without_duplicates.zip', 'data.zip')
unzip('data.zip')

We split this group in three parts – focusing on streams.csv, songs.csv, and country_codes.csv.

Load the data set using read_csv()
Then, use the head(), View(), summary(), dim() commands and try to describe the data set
On categorical data, use the table() command for frequencies

Common questions we can "ask" to a dataset (I)

What are columns in the data?
Do values in these columns seem complete? Do they all make sense? Are they properly encoded?
What are common columns across (multiple) files (potential for merging!)
What uniquely defines a row?

Does not require “additional” preparation of your data.

Common questions we can "ask" to a dataset (II)

Useful information can be obtained if we make use of extra wrangling…
- Covered tidyverse in our R Bootcamp at the beginning of the class (remember?)
- loading it: library(tidyverse) – will cover many packages
- has so-called “verbs” – commands that execute data wrangling operations
- can combine these verbs using %>% (the piping character)
Can you name a few of these verbs? What do they do?

Do: Better questions using tidyverse

Let's first load all datasets

library(tidyverse)
streams <- read_csv('streams.csv')
songs <- read_csv('songs.csv')
countries <- read_delim('country_codes.csv', ';')

Now, use the tidyverse verbs filter(), arrange(), and mutate() to answer the following questions.

1) How many obs. do we have for Belgium (country == 'BE')?

2) Which song ID has most streams (in which country, and when)?

3) What's the name of that song?

4) Add a new column to streams, showing the approx. revenue for each song/day/country (tip: Spotify pays about 0.0038 EUR per stream)

Answers to previous questions (I)

Let's first load all datasets

library(tidyverse)
streams <- read_csv('streams.csv')
songs <- read_csv('songs.csv')
countries <- read_delim('country_codes.csv', ';')

Now, use the tidyverse verbs filter(), arrange(), and mutate() to answer the following questions.

Answers to previous questions (II)

1) How many observations do we have for Belgium (country == 'BE')?

streams %>% filter(country=="BE") %>% count()

# A tibble: 1 × 1
      n
  <int>
1 69900

2) Which song ID has most streams in the data (in which country, and when)?

streams %>% arrange(desc(streams))

# A tibble: 278,724 × 5
    song_id date       country popularity streams
      <dbl> <date>     <chr>        <dbl>   <dbl>
 1 21831898 2020-07-24 US              NA 4175284
 2 21831405 2020-07-24 US              NA 3937822
 3 21831540 2020-07-24 US              99 3829002
 4 17274280 2020-01-17 US              NA 3781452
 5 24344394 2020-12-11 US             100 3645354
 6 17274280 2020-01-14 US              NA 3644167
 7 17274280 2020-01-18 US             100 3636720
 8 17274280 2020-01-15 US             100 3624169
 9 17274280 2020-01-16 US              NA 3577849
10 17274280 2020-01-13 US             100 3483960
# ℹ 278,714 more rows

Answers to previous questions (III)

3) What's the name of that song?

songs %>% filter(song_id==21831898)

# A tibble: 1 × 4
   song_id song_name artists      genres                     
     <dbl> <chr>     <chr>        <chr>                      
1 21831898 the 1     Taylor Swift dance pop,pop,post-teen pop

4) Add a new column to streams, showing the approx. revenue for each song/day/country (tip: Spotify pays about 0.0038 EUR per stream)

streams$revenue <- 0.0038 * streams$streams # the old way

streams <- streams %>% mutate(revenue = streams * 0.0038) # better

Grouping and summarizing

Grouping and summarizing (also called aggregation) is another powerful way to generate extra stats

streams %>% group_by(country) %>% summarise(N=n())

# A tibble: 4 × 2
  country     N
  <chr>   <int>
1 BE      69900
2 DE      69499
3 NL      69800
4 US      69525

Alternatively… (in other words: sooo many functions that make code “look” better)

streams %>% count(country)

Do: Using grouping and summaries

Let's keep on generating stats, this time using group_by() and summarise() (plus verbs we have seen earlier)

1) Create a new dataset, called streams2, summing up the streams by song.

2) Create a new dataset, called artists, showing the count of songs by artist.

3) Look at artists, and spit out the row that lists the number of songs for Ellie Goulding (==)

4) Refine your search from (3) to include also collaborations (tip: grepl()!)

Answers to previous questions (I)

Let's keep on generating stats, this time using group_by() and summarise() (plus verbs we have seen earlier)

1) Create a new dataset, called streams2, summing up the streams by song.

streams2 <- streams %>% group_by(song_id) %>% summarise(totstreams = sum(streams))

2) Create a new dataset, called artists, showing the count of songs by artist.

artists <- songs %>% count(artists)

Answers to previous questions (II)

3) Look at artists, and spit out the row that lists the number of songs for Ellie Goulding (==)

songs %>% count(artists) %>% filter(artists=='Ellie Goulding')

# A tibble: 1 × 2
  artists            n
  <chr>          <int>
1 Ellie Goulding     7

4) Refine your search from (3) to include also collaborations (tip: grepl()!)

songs %>% count(artists) %>% filter(grepl('Ellie Goulding', artists, ignore.case=T))

# A tibble: 9 × 2
  artists                               n
  <chr>                             <int>
1 Calvin Harris; Ellie Goulding         1
2 Ellie Goulding                        7
3 Ellie Goulding; Diplo; Red Velvet     1
4 Ellie Goulding; Diplo; Swae Lee       1
5 Ellie Goulding; Juice WRLD            2
6 Ellie Goulding; Juice WRLD; R3HAB     1
7 Ellie Goulding; Lauv                  2
8 Ellie Goulding; Lauv; Syn Cole        1
9 Ellie Goulding; blackbear             1

Merging/joining

Great job so far!
But, hasn't it been cumbersome to work w/ three separate datasets?
We can join them
- Why join / Why not to join? Mostly a memory issue. Think about it wisely.
- Syntax covered in “Joining data w/ dplyr” on Datacamp
Recap
- inner_join() – but keeps only observations that occur in both tables (potential of losing control!)
- my favorite is left_join() (“keep everything in the left table, add the right table where you can)
- think about order of your merge

Do: Let's join!

First, join streams (left) with songs (right)
Then, join countries to the result from (1).
Turn your entire code in a self-contained module, w/ four blocks:
- loading libraries,
- input of raw data
- transformation
- output (here: save as streams_merged.csv)

Dealing with missing values

Many reasons for values to be missing - they could be actual 0s even!
In our example, we would like to replace values with the median value of popularity in a given country at a given date.
Like with many things in data preparation, we proceed in steps.
Can you come up with a step-wise procedure to achieve this?

Do: Impute some values

First, create a new data set with the median popularity by country and data (group_by() and summarize()!)
Merge this data to streams_merged (left_join())
Add a new column, called popularity_filled (mutate())
Wrap it into a ITO building block :)

Feature engineering

Fancy term for “variable operationalization”
Simply means: measure “stuff”
Can you come up with a few new variables that characterize music listening at the country/date level? At the country level?
Invest time in thinking about which variables you want to create - you're in full control!
At this stage, think also about external data sets you may want to link (e.g., holidays on a country/day)

Final data operation for today...!

Run the code snippet below, which generates some example data (in wide format).
Describe the structure of the data.

df = songs %>% top_n(10) 
genres = unique(unlist(strsplit(df$genres, ',')))
for (genre in genres) df[[paste0('genre_', genre)]] = as.numeric(grepl(genre,df$genres))
wide = df %>% select(-genres)

Now, let's convert it to a long format.

long = wide %>% pivot_longer(cols = starts_with('genre'))

Summary of today's tutorial

Work in “ITO building blocks” (+ loading libraries, of course!)
Wrangling is necessary to get data into shape and generate meaning/understanding
Pivoting (from long to wide, and wide to long) is difficult… it's enough if you get the concept now - you will have to fiddle around with it a bit anyways
Have your cheat sheets available always – especially the dplyr one is useful (and very visual!)

Next steps (I)

This tutorial
- selected solutions available in the .Rpres file (available on GitHub)
- Work through the tutorial and complete exercises (all solutions available)
We will continue with our coaching session

Coaching session (I)

Watch out for stuff from previous weeks
- .Rmd exploration done?
- adhering to GitHub workflows when collaborating?
- getting project board into shape & become proficient in “the new way of working”
- adhering to setup-ITO structure

Coaching session (II)

Continue working on projects
Target today is to make progress on section 2.2 of the grading guidelines
- variety of common data operations (involving dplyr, tidyverse, etc.): merging, aggregating, de-duplication, reshaping, converting dates, regular expressions, etc.
- basic programming concepts: looping, vectorization, writing functions, handling errors/bugs)
- feature engineering/variable operationalization