Hannes Datta
A few observations from your team projects.
.R
for regular scriptstidyverse
functions (e.g., read_csv()
, not read.csv()
)Get your cheat sheets ready!!!
library()
)read_csv()
) - always relative filenames!write_csv()
– also w/ relative filenames)Run this snippet to download today's data (it's zipped - contains a couple of .csv
files)
library(tidyverse)
download.file('https://github.com/hannesdatta/course-dprep/raw/master/content/docs/modules/week4/tutorial/data_without_duplicates.zip', 'data.zip')
unzip('data.zip')
We split this group in three parts – focusing on streams.csv
, songs.csv
, and country_codes.csv
.
read_csv()
head()
, View()
, summary()
, dim()
commands and try to describe the data settable()
command for frequenciestidyverse
in our R Bootcamp at the beginning of the class (remember?)library(tidyverse)
– will cover many packages%>%
(the piping character)library(tidyverse)
streams <- read_csv('streams.csv')
songs <- read_csv('songs.csv')
countries <- read_delim('country_codes.csv', ';')
Now, use the tidyverse
verbs filter()
, arrange()
, and mutate()
to answer the following questions.
1) How many obs. do we have for Belgium (country == 'BE')?
2) Which song ID has most streams (in which country, and when)?
3) What's the name of that song?
4) Add a new column to streams
, showing the approx. revenue for each song/day/country (tip: Spotify pays about 0.0038
EUR per stream)
library(tidyverse)
streams <- read_csv('streams.csv')
songs <- read_csv('songs.csv')
countries <- read_delim('country_codes.csv', ';')
Now, use the tidyverse
verbs filter()
, arrange()
, and mutate()
to answer the following questions.
1) How many observations do we have for Belgium (country == 'BE')?
streams %>% filter(country=="BE") %>% count()
# A tibble: 1 × 1
n
<int>
1 69900
2) Which song ID has most streams in the data (in which country, and when)?
streams %>% arrange(desc(streams))
# A tibble: 278,724 × 5
song_id date country popularity streams
<dbl> <date> <chr> <dbl> <dbl>
1 21831898 2020-07-24 US NA 4175284
2 21831405 2020-07-24 US NA 3937822
3 21831540 2020-07-24 US 99 3829002
4 17274280 2020-01-17 US NA 3781452
5 24344394 2020-12-11 US 100 3645354
6 17274280 2020-01-14 US NA 3644167
7 17274280 2020-01-18 US 100 3636720
8 17274280 2020-01-15 US 100 3624169
9 17274280 2020-01-16 US NA 3577849
10 17274280 2020-01-13 US 100 3483960
# ℹ 278,714 more rows
3) What's the name of that song?
songs %>% filter(song_id==21831898)
# A tibble: 1 × 4
song_id song_name artists genres
<dbl> <chr> <chr> <chr>
1 21831898 the 1 Taylor Swift dance pop,pop,post-teen pop
4) Add a new column to streams
, showing the approx. revenue for each song/day/country (tip: Spotify pays about 0.0038
EUR per stream)
streams$revenue <- 0.0038 * streams$streams # the old way
streams <- streams %>% mutate(revenue = streams * 0.0038) # better
streams %>% group_by(country) %>% summarise(N=n())
# A tibble: 4 × 2
country N
<chr> <int>
1 BE 69900
2 DE 69499
3 NL 69800
4 US 69525
streams %>% count(country)
Let's keep on generating stats, this time using group_by()
and summarise()
(plus verbs we have seen earlier)
1) Create a new dataset, called streams2
, summing up the streams by song.
2) Create a new dataset, called artists
, showing the count of songs by artist.
3) Look at artists
, and spit out the row that lists the number of songs for Ellie Goulding (==
)
4) Refine your search from (3) to include also collaborations (tip: grepl()
!)
Let's keep on generating stats, this time using group_by()
and summarise()
(plus verbs we have seen earlier)
1) Create a new dataset, called streams2
, summing up the streams by song.
streams2 <- streams %>% group_by(song_id) %>% summarise(totstreams = sum(streams))
2) Create a new dataset, called artists
, showing the count of songs by artist.
artists <- songs %>% count(artists)
3) Look at artists
, and spit out the row that lists the number of songs for Ellie Goulding (==
)
songs %>% count(artists) %>% filter(artists=='Ellie Goulding')
# A tibble: 1 × 2
artists n
<chr> <int>
1 Ellie Goulding 7
4) Refine your search from (3) to include also collaborations (tip: grepl()
!)
songs %>% count(artists) %>% filter(grepl('Ellie Goulding', artists, ignore.case=T))
# A tibble: 9 × 2
artists n
<chr> <int>
1 Calvin Harris; Ellie Goulding 1
2 Ellie Goulding 7
3 Ellie Goulding; Diplo; Red Velvet 1
4 Ellie Goulding; Diplo; Swae Lee 1
5 Ellie Goulding; Juice WRLD 2
6 Ellie Goulding; Juice WRLD; R3HAB 1
7 Ellie Goulding; Lauv 2
8 Ellie Goulding; Lauv; Syn Cole 1
9 Ellie Goulding; blackbear 1
inner_join()
– but keeps only observations that occur in both tables (potential of losing control!)left_join()
(“keep everything in the left table, add the right table where you can)streams
(left) with songs
(right)countries
to the result from (1).streams_merged.csv
)median
value of popularity in a given country at a given date.group_by()
and summarize()
!)streams_merged
(left_join()
)popularity_filled
(mutate()
)df = songs %>% top_n(10)
genres = unique(unlist(strsplit(df$genres, ',')))
for (genre in genres) df[[paste0('genre_', genre)]] = as.numeric(grepl(genre,df$genres))
wide = df %>% select(-genres)
long = wide %>% pivot_longer(cols = starts_with('genre'))
dplyr
one is useful (and very visual!)Complete exercises in this tutorial
.Rpres
file (available on GitHub)You will continue with the coaching session