When working on large-scale data intensive projects, it’s important to define a project pipeline and automate it end to end so that you don’t have to manually produce your research results but you can just sit back and relax and have the computer do the the job for you. In this tutorial, you learn how to organize your files and directories and set-up a makefile that executes your whole project with just a single command!
Updated 26 September 2023
The solutions to the exercises and an FAQ are at the end of this document.
We will use make to automate the execution of our projects with a
“single click”, so that our entire work flow is reproducible. Before you
proceed, make sure you have fulfilled the prerequisites and installed
the required software: * Head over to Tilburg Science Hub and follow the
make installation
for either Windows, Mac or Linux. Verify whether you installed make
successfully by typing make
in the terminal. * Also, make
sure you can call R from the command prompt. On Mac and Linux, R is
available from the command line by default. Windows user can follow the
steps at the bottom of this page.
In this tutorial, we’re going to create a make file that downloads, loads, transforms, and exports data from Inside Airbnb. The website provides a visual overview of the amount, availability, and spread of rooms listed on Airbnb across a city, as well as an approximation of the number of bookings and occupancy rate, and the number of listings per host. Moreover, the website publicly shares listings, calendar, neighbourhood and reviews data of various popular Airbnb destinations.
Exercise 1
As as starting point for this tutorial, we use source
code of an empirical project on the Airbnb housing market in
Antwerp.
run_Antwerp.R
, open the
file in RStudio, and run it. What output files does it return?R --vanilla < run_Antwerp.R
to execute the same file but
this time from the terminal (depending on your installation you may need
to run Rscript run_Antwerp.R
instead).The lesson learned here is that we cannot only run R scripts from within RStudio but also from the terminal. We’ll use this knowledge when creating our first make file.
Makefiles are instructions (“rules”) for a computer on how to build “stuff”. Think of makefiles as a recipe you may know from cooking (“Baking a cake: First, take some flour, then add milk […]”) - but then for computers.
Makefiles originate in software development, where they have been used to convert source code into software programs that can then be distributed to users.
Researchers can use makefiles to define rules how individual components (e.g., cleaning the data, running an analysis, producing tables) are run. When dependencies (e.g., to run the analysis, the data set first has to be cleaned) are well-defined, researchers can completely automate the project. When making changes to code, researchers can then easily “re-run” the entire project, and see how final results (potentially) change.
A rule in a makefile generally looks like this:
targets: prerequisites
commands to build
plot_all.pdf
make
automatically checks whether any of
the dependencies has changed (e.g., a change in the source code) - so it
can figure out which rules to be run, and which ones not (saving you a
lot of computation time!).
run_Antwerp.R
(before building
plot_all.pdf
, the R script need to exist)R --vanilla < run_Antwerp.R
opens R, and runs the script run_Antwerp.R
.Taken together, the makefile could look like this:
plot_all.pdf: run_Antwerp.R
R --vanilla < run_Antwerp.R
Exercise 2
touch makefile
in your terminal; Windows users can open a
new file in an editor and save it (without file extension!) as
makefile
). Open the makefile
in a text editor.
Add the two lines above and save the changes. Open your terminal and run
make
. What happens?make
again. What does it say, why is that?Universiteitsbuurt
column
to Stadspark
and save the file. Run make -n
(to preview the changes) and then type make
to actually do
it. Does it run this time, why?
So far, we have defined a simple makefile that checks whether
plot_all.pdf
is present and if not it runs the R script.
For each minor change, such as changing one of the areas in the plot, it
runs the entire script from top to bottom. This costs unnecessary
compute resources because the listings and reviews data have already
been downloaded and preprocessed. In practice, you therefore often want
to split up a large file into multiple modules so that you only
need to run those modules that contain changes.
In the figure below we impose a flow diagram structure in which the
run_Antwerp
script has been split up in five separate R
modules:
download.R
= downloads the raw listings and review data
from Inside Airbnb.clean.R
= merges, preprocesses, and aggregates the data
into a dataframe and writes it to aggregated_df.csv
.pivot_table.R
= generates a pivot table in which the
columns are the city neighborhoods.plot_all.R
= aggregate the number of reviews across all
neighborhoods and create a plot.plot_Antwerp.R
= create a line chart for three
neighborhoods in Antwerp (Universiteitsbuurt, Sint Andries, Centraal
Station).Exercise 3
1. Split up the full R script into the 5 separate scripts according to
the diagram and definitions above. The sections indicated by comments
(e.g., ### CLEAN DATA) serve as potential split-off points.
2. Check if all scripts run without issues. If not, make the necessary
changes (tip: check if the libraries are imported at the top of each
file).
3. Redefine the dependencies in make for the targets
reviews.csv
, listings.csv
,
aggregated_df.csv
, pivot_table.csv
,
plot_Antwerp.pdf
, and plot_all.pdf
so that
workflow dependencies are triggered like a chain reaction. For example,
plot_all.pdf
requires aggregated.pdf.csv
to be
there which in turn requires listings.csv
,
reviews.csv
and clean.R
. Use the figure above
to identify the prerequisites and command files. 4. Now run
make -n
and make
and see what happens. Does it
work as expected? Indeed, make only runs the first target by default;
next we show you how to use phony targets so that it executes all
commands!
The targets we have seen so far refer to output files (e.g.,
aggregated_df.csv
). Sometimes, it’s not practical to
generate outputs. We call these targets “phony targets”. Let’s consider
the following makefile:
all: one two
one:
touch one.txt
two:
touch two.txt
Neither one
and two
are output files; they
are phony targets that create two text files. The phony target
all
at the top calls both targets. Think of it as a “meta
rule” to build it all!
Exercise 4
Create another directory and add a makefile with the same contents as
above. Run make
and see what happens. Then remove all text
files and run make one
, what happens this time?
We can apply the same principle and add the following
all
target to our makefile:
all: plot_Antwerp.pdf plot_all.pdf aggregated_df.csv pivot_table.csv listings.csv reviews.csv
Exercise 5
Switch back to the other directory, add the all
phony
target to your makefile, and run make
. What happens? Does
it create all targets this time? Would it make a difference if you left
the four csv files in the all target, why is that?
Part of building a proper data pipeline is having a directory structure in place. Up to this point, we did not explicitly tell you where or how to save your files so you probably stored all source and output files in a single folder. Especially for larger projects this tends to become messy, so instead we recommend having a structured mechanism in place for your file storage, which we explain below.
To start, let’s assume we’re working on a project, called
my_project
. Let us create that directory somewhere on our
computer, preferably not in the cloud (i.e., not in a Dropbox or Google
Drive folder). Inside this folder you find the following
subdirectories:
data
Raw data gets downloaded into the data folder of your project
(my_project/data
) from either a network drive, or a remote
file storage that is securely backed up.
src
Source code is made available in the src
folder of your
main project: my_project/src/
. Create subdirectories for
each stage of your pipeline: * data-preparation
: clean a
dataset * analysis
: analyze the data cleaned in the
previous step * paper
: produce tables and figures for the
final paper.
The directory structure for the src
directory thus
becomes:
/src/data-preparation/
/src/analysis/
/src/paper/
gen
Generated files are stored in the gen folder of your main project:
my_project/gen/
. You can use subdirectories that match your
pipeline stages to further bring structure to your project:
input
: This subdirectory contains any required input
files to run this step of the pipeline. Think of this as a directory
that holds files from preceding modules (e.g., the analysis uses the
file exchange to pull in the dataset from its preceding stage in the
pipeline, /data-preparation).
temp
: These are temporary files, like an Excel
dataset that needed to be converted to a CSV data set before reading it
in your statistical software package.
output
: This subdirectory stores the final result of
the module. For example, in the case of a data preparation module, you
would expect this subdirectory to hold the final dataset. In the case of
the analysis module, you would expect this directory to contain a
document with the results of your analysis (e.g., some tables or
figures).
Exercise 6
1. Add a src
folder in which you create two subdirectories:
data-preparation
and analysis
. Note that you
don’t need to create a data
and gen
folder as
the makefile will take care of that later.
2. Store the five R scripts you created earlier in the correct
subdirectory:
/src/data-preparation/
- download.R
- clean.R
/src/analysis/
- pivot_table.R
- plot_all.R
- plot_Antwerp.R
download.R
should store the data inside the
data
folder (even though it’s been not been created
yet).clean.R
should load the review and listings data from
the data
folder and store the output in
gen/temp
.pivot_table.R
should load
aggreagated_df.csv
from gen/temp
and export
the pivot table in gen/temp
.plot_all.R
should load aggreagated_df.csv
from gen/temp
and export the plot into
gen/output
.plot_Antwerp.R
should load pivot_table.csv
from gen/temp
and export the plot into
gen/output
.Tip: keep in mind that all file paths are relative to the current
storage location. For example, from download.R
inside
src/data-preparation
to the data
folder
requires two steps back and one change of directories (i.e.,
../../data
).
Exercise 7
Not only the file paths in the R scripts changed by moving the files to
different directories, but also the relative paths in the make file are
no longer up to date. Make the necessary changes to the targets,
commands, and prerequisites and then try running make -n
to
see whether it works as expected.
Tip: add Rscript -e "dir.create('data')"
as a command
above download.R
in the makefile so that it first creates
the data folder before it tries to store the downloaded data there. If
you forget to do this and you didn’t create the folder manually, it will
throw an error once it attempts to save a file in the unknown folder. Do
the same for the gen/temp
and gen/output
subfolders.
Writing long file paths, such as
../../gen/temp/aggregated_df.csv
is cumbersome and prone to
errors. Therefore, you can clean up your makefiles using variables like
this:
TEMP = ../../gen/temp
DATA = ../../data
$(TEMP)/aggregated_df.csv: $(DATA)/listings.csv $(DATA)/reviews.csv clean.R
R --vanilla < clean.R
In other words, you specify the file path as a variable at the top
(e.g., TEMP = ...
) after which you can reference it with
$(VARIABLE)
.
Exercise 8
Adapt your makefile from exercise 7 such that repetitive paths are
replaced by variables.
Up to now, we have created a single makefile stored in the root
directory that triggers the entire workflow. For clarity and structure
reasons, however, it is recommended to have a makefile in both the
data-preparation
and analysis
subfolder. This
also makes referencing file paths to the files more straightforward.
Therefore, we create two makefiles so that you end up with the following file structure:
/src/data-preparation/
- download.R
- clean.R
- makefile
/src/analysis/
- pivot_table.R
- plot_all.R
- plot_Antwerp.R
- makefile
Exercise 10
Create the makefile in src/analysis
and
src/data-preparation
, copy the related make commands from
the makefile in the root directory, and add an all
phony
target to both makefiles. Don’t forget to include the
R -e "dir.create('directory_name')"
commands!
Exercise 11
Run make -n
in src/analysis
and
src/data-preparation
separately and see whether it works as
expected. Then, manually remove the data
and
gen
folders and try to regenerate all files. How many times
do you have to run make
?
While having multiple makefiles creates structure and clarity, you
don’t want to run make
in each and every folder (especially
if the number of subfolders grows). Fortunately, you can have a makefile
in your root directory that triggers other makefiles. Here’s how the
syntax looks like in our case:
all: analysis data-preparation
data-preparation:
make -C src/data-preparation
analysis: data-preparation
make -C src/analysis
There are two phony targets (data-preparation
and
analysis
), each with the make -C
command and a
reference to the folder in which the makefile is stored.
Exercise 12
Create another makefile in the root directory, copy the contents from
above, and run make
again. Does it work as expected? How
many times do you have to run make
this time?
Like the all
phony target generates all files, another
convention is to have a clean
target that removes all
generated folders and files. This way, you can easily clean up your
directory and run make
again to build up all files from
scratch. The code snippet below removes the data
and
gen
folders as well as their contents.
clean:
R -e "unlink('data', recursive = TRUE)"
R -e "unlink('gen', recursive = TRUE)"
Exercise 13
Add the clean
phony target to the makefile located in the
root directory. Then, run make clean
to trigger it. Does it
work as expected? Finally, run make
again to generate the
temporary files and plots again.
If you’ve worked through the entire set of exercises, this is how the final makefiles should look like:
root folder
all: analysis data-preparation
data-preparation:
make -C src/data-preparation
analysis: data-preparation
make -C src/analysis
clean:
R -e "unlink('data', recursive = TRUE)"
R -e "unlink('gen', recursive = TRUE)"
src/data-preparation
DATA = ../../data
TEMP = ../../gen/temp
all: $(DATA)/listings.csv $(DATA)/reviews.csv $(TEMP)/aggregated_df.csv
../../data/reviews.csv: download.R
R --vanilla < download.R
../../data/listings.csv: download.R
R -e "dir.create('../../data', recursive = TRUE)"
R --vanilla < download.R
../../gen/temp/aggregated_df.csv: ../../data/listings.csv ../../data/reviews.csv clean.R
R -e "dir.create('../../gen/temp', recursive = TRUE)"
R --vanilla < clean.R
src/analysis
DATA = ../../data
TEMP = ../../gen/temp
OUTPUT = = ../../gen/output
all: $(TEMP)/pivot_table.csv $(OUTPUT)/plot_Antwerp.pdf $(OUTPUT)/plot_all.pdf
$(TEMP)/pivot_table.csv: $(TEMP)/aggregated_df.csv pivot_table.R
R -e "dir.create('../../gen/output', recursive = TRUE)"
R --vanilla < pivot_table.R
$(OUTPUT)/plot_Antwerp.pdf: $(TEMP)/pivot_table.csv plot_antwerp.R
R --vanilla < plot_antwerp.R
$(OUTPUT)/plot_all.pdf: $(TEMP)/aggregated_df.csv plot_all.R
R --vanilla < plot_all.R
Creating a make workflow that runs from start to finish is an iterative process of trial and error. Here are a few pointers to debug your makefiles.
1. Is make
properly installed?
To run makefiles, you have to have make
installed on
your system. So if you see a message like this…
'make' is not recognized as an internal or external command, operable program or batch file.
.
…head over to Tilburg Science Hub to see how to install
make
.
2. Does your code run without make?
There are two things that can go wrong when working with makefiles.
Either there’s a problem with the makefile
itself, or
there’s a problem with the code that the makefile
executes.
So, a first check is to copy-paste the commands to build (e.g.,
python script.py
or R --vanilla < script.R
)
into the console, and see whether the Python or R code runs as
expected.
Does the script run? Then you need to debug your makefile (see next). If it doesn’t run, first try to debug your source code in R/Python etc.
For example, it may happen that you get the following error:
dyld: Library not loaded: @rpath/libreadline.6.2.dylib
which means that make could not find the R installation. Using
Rscript
as opposed to R --vanilla <
may
solve the issue. Alternatively, specify the full path to your R
installation to make sure it looks for the R library in the right place
(e.g., define
Rscript = /Library/Frameworks/R.framework/Versions/4.0/Resources/bin/Rscript
at the top and use $(Rscript) script.R
to execute the code
thereafter).
3. Is your makefile
structured properly?
One of the most common mistakes in a makefile
is to not
adhere to the syntax:
source: prerequisites [separated by spaces]
COMMANDS TO RUN
The commands need to be separated by a tab character. Try to open the
makefile in another editor, and verify that you have correctly made use
of the tab. Note that some editors replace tabs by spaces (and you don’t
want that here or it will raise an error:
*** missing separator.
!).