Engineering data sets (in-class tutorial)

Hannes Datta

Before we get started (I)

A few observations from your team projects.

  • absolutely important to work in local repositories, with issue-based branches, and pull requests.
  • Code files need to be structured as setup-ITO – required for automation!
  • Choose appropriate file format: Rmarkdowns or Quarto for reports/PDFs/HTML, .R for regular scripts
  • Adhere to tidyverse functions (e.g., read_csv(), not read.csv())
  • Incorporate programming concepts: looping, using functions

Before we get started (II)

  • Asked our scheduling department to move the tutorials of next week to the computer lab
  • Request to host a “catch up” session: Let's discuss topics and dates.
  • I printed cheat sheets!

Structure of this tutorial

Get your cheat sheets ready!!!

  • After class
    • Please go through the entire tutorial on the dprep site.

Why this tutorial

  • Final data set rarely available
  • Requires wrangling/transformations (can you name a few examples? there are many!)
  • In projects, structure scripts in four blocks – required to automate workflow later
    • Setup/prerequisites/loading packages (library())
    • Input (e.g., read_csv()) - always relative filenames!
    • Transformation (i.e., operations on your data)
    • Output (e.g., write_csv() – also w/ relative filenames)
  • Zoom in on ways to transform (while respecting setup-I-T-O building blocks)

DO: Let's get today's data and start exploring

Run this snippet to download today's data (it's zipped - contains a couple of .csv files)

library(tidyverse)
download.file('https://github.com/hannesdatta/course-dprep/raw/master/content/docs/modules/week4/tutorial/data_without_duplicates.zip', 'data.zip')
unzip('data.zip')

We split this group in three parts – focusing on streams.csv, songs.csv, and country_codes.csv.

  • Load the data set using read_csv()
  • Then, use the head(), View(), summary(), dim() commands and try to describe the data set
  • On categorical data, use the table() command for frequencies

Common questions we can "ask" to a dataset (I)

  • What are columns in the data?
  • Do values in these columns seem complete? Do they all make sense? Are they properly encoded?
  • What are common columns across (multiple) files (potential for merging!)
  • What uniquely defines a row?


    Does not require “additional” preparation of your data.

Common questions we can "ask" to a dataset (II)

  • Useful information can be obtained if we make use of extra wrangling…
    • Covered tidyverse in our R Bootcamp at the beginning of the class (remember?)
    • loading it: library(tidyverse) – will cover many packages
    • has so-called “verbs” – commands that execute data wrangling operations
    • can combine these verbs using %>% (the piping character)
  • Can you name a few of these verbs? What do they do?

Do: Better questions using tidyverse

  • Let's first load all datasets
library(tidyverse)
streams <- read_csv('streams.csv')
songs <- read_csv('songs.csv')
countries <- read_delim('country_codes.csv', ';')

Now, use the tidyverse verbs filter(), arrange(), and mutate() to answer the following questions.

1) How many obs. do we have for Belgium (country == 'BE')?

2) Which song ID has most streams (in which country, and when)?

3) What's the name of that song?

4) Add a new column to streams, showing the approx. revenue for each song/day/country (tip: Spotify pays about 0.0038 EUR per stream)

Answers to previous questions (I)

  • Let's first load all datasets
library(tidyverse)
streams <- read_csv('streams.csv')
songs <- read_csv('songs.csv')
countries <- read_delim('country_codes.csv', ';')

Now, use the tidyverse verbs filter(), arrange(), and mutate() to answer the following questions.

Answers to previous questions (II)

1) How many observations do we have for Belgium (country == 'BE')?

streams %>% filter(country=="BE") %>% count()
# A tibble: 1 × 1
      n
  <int>
1 69900

2) Which song ID has most streams in the data (in which country, and when)?

streams %>% arrange(desc(streams))
# A tibble: 278,724 × 5
    song_id date       country popularity streams
      <dbl> <date>     <chr>        <dbl>   <dbl>
 1 21831898 2020-07-24 US              NA 4175284
 2 21831405 2020-07-24 US              NA 3937822
 3 21831540 2020-07-24 US              99 3829002
 4 17274280 2020-01-17 US              NA 3781452
 5 24344394 2020-12-11 US             100 3645354
 6 17274280 2020-01-14 US              NA 3644167
 7 17274280 2020-01-18 US             100 3636720
 8 17274280 2020-01-15 US             100 3624169
 9 17274280 2020-01-16 US              NA 3577849
10 17274280 2020-01-13 US             100 3483960
# ℹ 278,714 more rows

Answers to previous questions (III)

3) What's the name of that song?

songs %>% filter(song_id==21831898)
# A tibble: 1 × 4
   song_id song_name artists      genres                     
     <dbl> <chr>     <chr>        <chr>                      
1 21831898 the 1     Taylor Swift dance pop,pop,post-teen pop

4) Add a new column to streams, showing the approx. revenue for each song/day/country (tip: Spotify pays about 0.0038 EUR per stream)

streams$revenue <- 0.0038 * streams$streams # the old way

streams <- streams %>% mutate(revenue = streams * 0.0038) # better

Grouping and summarizing

  • Grouping and summarizing (also called aggregation) is another powerful way to generate extra stats
streams %>% group_by(country) %>% summarise(N=n())
# A tibble: 4 × 2
  country     N
  <chr>   <int>
1 BE      69900
2 DE      69499
3 NL      69800
4 US      69525
  • Alternatively… (in other words: sooo many functions that make code “look” better)
streams %>% count(country)

Do: Using grouping and summaries

Let's keep on generating stats, this time using group_by() and summarise() (plus verbs we have seen earlier)

1) Create a new dataset, called streams2, summing up the streams by song.

2) Create a new dataset, called artists, showing the count of songs by artist.

3) Look at artists, and spit out the row that lists the number of songs for Ellie Goulding (==)

4) Refine your search from (3) to include also collaborations (tip: grepl()!)

Answers to previous questions (I)

Let's keep on generating stats, this time using group_by() and summarise() (plus verbs we have seen earlier)

1) Create a new dataset, called streams2, summing up the streams by song.

streams2 <- streams %>% group_by(song_id) %>% summarise(totstreams = sum(streams))

2) Create a new dataset, called artists, showing the count of songs by artist.

artists <- songs %>% count(artists)

Answers to previous questions (II)

3) Look at artists, and spit out the row that lists the number of songs for Ellie Goulding (==)

songs %>% count(artists) %>% filter(artists=='Ellie Goulding')
# A tibble: 1 × 2
  artists            n
  <chr>          <int>
1 Ellie Goulding     7

4) Refine your search from (3) to include also collaborations (tip: grepl()!)

songs %>% count(artists) %>% filter(grepl('Ellie Goulding', artists, ignore.case=T))
# A tibble: 9 × 2
  artists                               n
  <chr>                             <int>
1 Calvin Harris; Ellie Goulding         1
2 Ellie Goulding                        7
3 Ellie Goulding; Diplo; Red Velvet     1
4 Ellie Goulding; Diplo; Swae Lee       1
5 Ellie Goulding; Juice WRLD            2
6 Ellie Goulding; Juice WRLD; R3HAB     1
7 Ellie Goulding; Lauv                  2
8 Ellie Goulding; Lauv; Syn Cole        1
9 Ellie Goulding; blackbear             1

Merging/joining

  • Great job so far!
  • But, hasn't it been cumbersome to work w/ three separate datasets?
  • We can join them
    • Why join / Why not to join? Mostly a memory issue. Think about it wisely.
    • Syntax covered in “Joining data w/ dplyr” on Datacamp
  • Recap
    • inner_join() – but keeps only observations that occur in both tables (potential of losing control!)
    • my favorite is left_join() (“keep everything in the left table, add the right table where you can)
    • think about order of your merge

Do: Let's join!

  1. First, join streams (left) with songs (right)
  2. Then, join countries to the result from (1).
  3. Turn your entire code in a self-contained module, w/ four blocks:
    • loading libraries,
    • input of raw data
    • transformation
    • output (here: save as streams_merged.csv)

Dealing with missing values

  • Many reasons for values to be missing - they could be actual 0s even!
  • In our example, we would like to replace values with the median value of popularity in a given country at a given date.
  • Like with many things in data preparation, we proceed in steps.
    Can you come up with a step-wise procedure to achieve this?

Do: Impute some values

  1. First, create a new data set with the median popularity by country and data (group_by() and summarize()!)
  2. Merge this data to streams_merged (left_join())
  3. Add a new column, called popularity_filled (mutate())
  4. Wrap it into a ITO building block :)

Feature engineering

  • aka “variable operationalization”
  • Simply means: measure “stuff”
  • Can you come up with a few new variables that characterize music listening at the country/date level? At the country level?
  • Invest time in thinking about which variables you want to create - you're in full control!
  • At this stage, think also about external data sets you may want to link (e.g., holidays on a country/day)

Final data operation for today...!

  • Run the code snippet below, which generates some example data (in wide format).
  • Describe the structure of the data.
df = songs %>% top_n(10) 
genres = unique(unlist(strsplit(df$genres, ',')))
for (genre in genres) df[[paste0('genre_', genre)]] = as.numeric(grepl(genre,df$genres))
wide = df %>% select(-genres)
  • Now, let's convert it to a long format.
long = wide %>% pivot_longer(cols = starts_with('genre'))

Summary of today's tutorial

  • Work in “ITO building blocks” (+ loading libraries, of course!)
  • Wrangling is necessary to get data into shape and generate meaning/understanding
  • Pivoting (from long to wide, and wide to long) is difficult… it's enough if you get the concept now - you will have to fiddle around with it a bit anyways
  • Have your cheat sheets available always – especially the dplyr one is useful (and very visual!)

Next steps

  • Complete exercises in this tutorial

  • You will continue with the coaching session