Automation with make (in-class tutorial)

Hannes Datta

Before we get started

  • Last “content” week
  • Exam
    • registration via Osiris!
    • check example questions on the course website
  • Today's material…

Feedback on team projects (I)

  • working together in teams
    • working with Git locally (vs. in the cloud)
    • being able to push & pull; work in branches
    • scrum/project board & issues vs. individual work
  • data on GitHub
    • do not store any raw data on GitHub but download programmatically
    • start having a proper directory structure

Feedback on team projects (II)

  • coding conventions
    • absolute vs. relative file names
    • generalizable – no “amsterdam”-specific variable names when possible
    • install packages: can be done in one script – see here
  • preparation: be clear about unit of analysis (e.g., date & city, movie, etc.)

Where we are heading today

Let's describe our toolstack…

  • We use Git/GitHub to collaborate on code
    • versioning of files - frequent commits
    • issues, branches and pull requests
    • management via project board
  • We have learnt how to explore data and engineer datasets
    • Rmarkdown & plotting
    • tidyverse & data preparation, setup + ITO
  • But… process is still quite manual
    • how to “run” code? (e.g., order, all code files?)

Today, we learn how to automate projects.

Automation tools

  • Many automation tools on the market
  • Originally designed to automate software builds
  • Launches “other” software programs (e.g., R), mostly on the terminal/command prompt
  • We use make
    • probably one of the oldest ones on the market
    • free/open source, widely used

Demonstration

I'll show you the basic functions of make in one of my projects.

  1. Directory structure
  2. Look at “makefile” and its rules (more later)
  3. Run make -n, make

DO: Verify make and R are properly installed

If you can't run make and Rscript from your terminal, you won't be able to proceed.

  1. Verify make is properly installed
    • open the terminal/command prompt and type make (enter).
    • See this? make: *** No targets specified – then it is correctly installed!
  2. Verify you can run Rscript from the terminal/command prompt
    • type Rscript (enter).
    • See output starting with Usage: Rscript [options] file [args]…? – then you are all set!

Do: Getting familiar with today's data/practice project

Today's project is based on a prototype project using Airbnb data. Please download the source code now (run code below).

download.file('https://gist.githubusercontent.com/hannesdatta/1f764090ae3ebfaabe7a91ea4b971c3d/raw/2df12443cd144141dedb866967cbe985b97052f5/run_antwerp.R', 'run_antwerp.R')
  1. Open the file run_antwerp.R in RStudio to check out the code - can you run the first few lines? Does it work?
  2. Try running the file from the terminal, by typing Rscript run_antwerp.R. What happens?
  3. Try using this command instead: R --vanilla < run_antwerp.R - what difference does it make?

Understanding `run_antwerp.R`

  • One script that prepares and analyzes data
  • But: in reality, our project will consist of multiple code snippets
    • …and each of these code snippets depends on other code snippets and data
    • …and feature nicely structured ITO-blocks (remember?)

We will now gradually “modularize” (split in multiple code files) and automate our project, using make.

Writing our first make recipe

  • To automate projects, we will write a makefile and put it in the root of our project directory
  • A makefile consists of one or multiple “recipes”
    • what to build? (filenames!)
    • what are prerequisites, dependencies? (also files)
    • how to build it? (a command to execute stuff)

DO: Writing our first make recipe

  • Create a new file, called makefile (no extension)
  • Paste the code below and save (use a tab to indent R --vanilla or Rscript…)
  • Then type make on terminal. What happens?
plot_all.pdf: run_Antwerp.R
  R --vanilla < run_Antwerp.R

DO: Running make again

Let us now change something in the source code of run_antwerp.R, e.g., change Universiteitsbuurt to Stadspark to experience amake workflow.

  • Type make -n (preview what will have to be done)
  • Type make (to run the pipeline)
  • Type make again (verify everything up-to-date & makefile has been correctly specified)

Last step really important and often forgotten in projects.

DO: Cleaning up the project directory

When working with make, it's handy to keep our project tidy (i.e., get right of “old” files automatically).

Try implementing such a clean rule in make:

clean:
  R -e "unlink('*.pdf')"
  1. Run the rule by typing make clean. What happens?
  2. Do we need to add more file extensions? Just add some lines and try again!
  3. Does this code snippet feature ITO building blocks?
  4. What does R -e do?

Towards modularization

  • We ultimately will have to split our project in separate files. (Why?)

How? split up code, implement setup + ITO, write makefile rule, test rule. Step-by-step.

Demonstration: ITO and make for download.R

  • Dependencies: only itself, download.R
  • Targets: listings.csv and reviews.csv
  • How to execute: Rscript download.R

  • Define one global rule on top of the makefile to be able to run it.

Do: Let's continue splitting up our project and trying to automate it

How? split up code, implement ITO, write makefile rule, test rule. Step-by-step.

Getting the directories right

  • We now should have a fully automated pipeline.
  • But… we have not cleanly separated src files from generated output files. See also here for full description on how to name directories.

  • Do:

    • Create a src code directory and move source code files there
    • Use code to create a gen directory in an R file (preferably run early on)
    • Change directory names in output files in .R files
    • At each step, verify whether make still runs.

If we have time left...

  • I could clone one of your team repositories and
    • try automating the first few steps…
    • contributing my code via a pull request.

Let's take a step back...

  • Make is the final software package we cover in this course
  • Like always in coding, take one step at a time.
    • Do not write “one large code chunk” that doesn't work
    • Instead, work in small steps from a working code snippet to another working code snippet.
  • Make “accompanies” your coding, and constantly allows you to check whether you've done things right.
  • Can become a great timesaver in your career!

Reflection

  • You've covered it all - congrats!
  • Now it's time to let it sink in. And practice.
    • GitHub for code and project management (not data!). Work locally.
    • Work according to scrum method in feature branches.
    • Explore, prepare and analyze data using R. Focus on workflow, proper setup + ITO building blocks
    • Automate entire pipeline - ensure it runs on other computers
    • Do all steps at the same time. Not separately!
  • This is a state-of-the-art toolkit that will put you ahead of the curve in productivity. In your Master program, and beyond.

Next steps

  • Three more coaching sessions
    • one today (on campus)
    • two more (week 6 and end of week 7 - online; attendance optional)
    • will create breakout rooms
  • Exam preparation
    • in final lecture (beginning of week 7)
  • Any questions?

Coaching session

  • Let's get started to identify 'broad' issues - can try addressing these issues specifically
  • Work on grading criteria 3.1 (source code quality), and 3.2 (automation using make)