dPrep - Course Summary & Exam Preparation

Hannes Datta

Welcome to the final lecture in dPrep!



If you haven't done so, please explore the exam page & example questions at https://dprep.hannesdatta.com/docs/exam.

The course evaluation is live at https://app.evalytics.nl. Please voice your opinions!

Agenda

  • Course summary
  • From here onwards
    • During your studies
    • After your studies
  • Course evaluation
  • Exam preparation
  • Remaining questions / Q&A

Course framework

dprep

Positioning in the study program

dprep

Lessons learnt #1: Versioning and Project Management with Git(Hub) (I)

  • Scrum & the project board as a way to collaborate with one another
    • Writing issues is hard (!)
    • Seems to put a lot of structure on stuff, did it work out?
    • Recall how to work together – team <-> individual
  • Versioning can be powerful!
    • git log to see recent commits
    • delete stuff that isn't needed (“housekeeping”), roll back when you want (git checkout COMMITID -b branch)
    • Oh no, not authenticated! (browser authentication works best)
    • Oh no! Can't push! (pull first!)
    • Use GUIs (e.g., in R or VS Code), complemented with Git Bash

Lessons learnt #1: Versioning and Project Management with Git(Hub) (II)

  • Ready to use Git/versioning in a business setting
    • clone repositories, switch brandes (git switch BRANCHNAME or git checkout BRANCHNAME)
    • Know what to version, and what not
    • Use .gitignore (and know what it CANNOT do)
    • Work together using issues, feature branches & pull requests (PRs)/merge requests
  • Collaborate on open source projects
    • You know what forks are!
    • You may even actually contributed to public projects!

Lessons learnt #2: Data Exploration

  • RMarkdown! (mixing code with reporting)
    • ability to quickly produce clean documents to share!
    • uh… but how to run it with make? (tip: R -e 'rmarkdown::render(input = "test_render.Rmd", output_file = "test_render.html")')
    • verify you can render markdown documents to HTML and PDF!
  • Doing data quality checks?
    • getting back to your data supplier if needed!
    • reporting summary statistics (summary, table, dplyr for summary stats by group)
    • plotting with ggplot2
  • Use RMarkdown when it is useful (i.e., to produce a doc, slides; NOT for data cleaning)

Lessons learnt #3: Data Engineering

  • Why do we actually have to clean up data?!
  • Data prep code can be complex
    • so… use cheatsheets (also for the exam!)
    • know & apply common data operations (what are those?)!
  • Make setup-ITO blocks
    • setup to load libraries, then input, transformation, output!
    • super crucial for make, too! (know why?)
  • Modularize code, loop, and use functions

Lessons learnt #4: Pipeline building and automation

  • Clarity about where to store stuff
    • review! (components - src, gen, data; modules - e.g., analysis, data-prep, …)
  • Run the work others have done
    • my work, your work, somebody else's work
  • Discover mistakes in code
    • e.g., packages that weren't loaded, order of cells/code
  • Save time!
    • but, yeah, NOT at the beginning when you still learn
  • Continuously improve documentation & code

Common mistakes w/ make

  • Wrong: One “all” rule with all targets as prerequisites (correct: final products/files to be built)
  • Wrong: Incomplete dependencies (correct: all input files + own source code per recipe)
  • Wrong: Multiple Rscript execution commands per recipe (correct: write separate rules)
  • Gradually build your makefile. Go in baby steps!
  • Happy to take a look at your repositories now - anyone?

Any other questions about your projects?

We can address them now!

Reflection about the course

  • it's amazing to see you learn & grow!
    • responded very well to overhauled engineering tutorial :)
    • can talk to you as “members of my own development team”!
    • many started R from scratch; some could program but still were challenged with make and GitHub!
  • we've seen a lot: different projects, data sets, tools, “way of doing things”, way of coding, videos, websites

My recommendation: take time to let it sink in. And trust in my choice of teaching you this. Your hard work will pay off.

Looking ahead: During your studies (I)

  • Selectively implement across classes or projects
    • e.g., some projects just benefit from better directory structure, while others may need “more”
    • risk of “losing” the skill (!)
  • Realize it takes time to learn (and debug!)
    • it took me years to become proficient
    • onboard team members to “your” way of working

Looking ahead: During your studies (II)

  • Use for thesis!
    • but, be aware your skills are so fresh, your professor likely has never heard about it! –> use, teach, show how productive you are!
  • Build your job market profile
    • have a compelling about page on GitHub
    • “pin” your best repositories (but don't call them dprep-team-X; remember)

Looking ahead: During your studies (III)

Looking ahead: After your studies

Next steps: Submissions and preparing for the exam

  • Deadline for project coming up next week!
    • remember: test it on different computers and operating systems
    • do not use absolute paths!
    • verify data can be downloaded smoothly
    • test make thoroughly
    • avoid testing any statistical assumptions - this is about a running workflow, not about research methods!

Exam planning (I)

  • Organization
    • 3 April (time tba; 3 hours)
    • On campus, using TestVision
  • Software & materials
    • access to R/RStudio, Git, make
    • no access to the internet
    • I'm making selecting resources available on the instruction page - check them out here
    • package installation: I will test the exam on Thu next week; if a package is not installed (yet), I will include installation instructions on the exam.
  • familiarize yourselves with how TestVision works (link on Canvas + will show here) + review example questions (next slides…)

Learning goals, distribution of points + tips (I)

  • 100 points in total, about 9 questions
  • Mix of open and closed questions

  • LG 1: Use R to clean and transform data for analysis [synthesis; 20% of points]

    • know core concepts and corresponding code in tidyverse/dplyr: e.g., aggregation, merging, de-duplication, reshaping, data conversions, regular expressions, dealing with time, pivoting)
    • become fast in coding - no time to “look up” these commands elaborately
    • open .Rdata, submit code and/or submit saved .Rdata file.
    • several “smaller” questions
    • I simulated the data

Learning goals, distribution of points + tips (II)

  • LG 2: Use GitHub for managing empirical research projects [evaluation; 10% of points]
    • conceptual questions on GitHub issues, project boards, Scrum, etc.
  • LG 3: Use Git/GitHub for versioning files and collaboration on privately-shared and publicly-available (open science) GitHub repositories [application; 30% of points]
    • question type: download a zipped repository
    • Be prepared to work with Git Bash (e.g., git checkout, git merge, git add, git status, git log, git branch, etc.)
    • download git repository, unzip, do your commits, zip again and submit as a zipfile
    • know how to make commits with commit messages, create branches, switch branches, etc.

Learning goals, distribution of points + tips (III)

  • LG 4: Use R for generating automatic reports (e.g., to assess data quality, to report research findings in a paper) and deploying research findings in novel ways (e.g., apps) [comprehension; 15% of points]
    • inspect “new”/unseen data
    • covering ggplot2 for data visualization
    • When handing in documents, check what I require (for .Rmd, I sometimes ask for rendered .pdf documents - does it work on your computer?)
  • LG 5: Make
    • Use Workflow Management Tools to create and run portable, automated, and reproducible research pipelines [application; 25% of points]
    • Expect a zip file containing a make pipeline
    • Complete a few smaller tasks: run, correct and develop new make workflow

Learning goals, distribution of points + tips (IV)

  • Work on a joint cheat sheet (link has been shared on Canvas) - deadline one week before the exam
  • Be prepared to download data sets from Testvision (.Rdata) - on the Cover page of the exam or in a specific question on the exam.

Next steps: Official course evaluation

  • Course evaluation has been immensely important to this course

    • this semester: overhauled all tutorials (e.g., ggplot in week 2, advanced engineering in week 4)
    • last semester: switched order of sessions, revised GitHub tutorial/onboarding, new data sets, coaching sessions with breakout groups on campus and online
  • Course evaluation has been critical to my career

    • Without my past evaluations, I wouldn't be teaching to you today
    • I will look at all comments. Based on student feedback, we now have Q&A for setups & additional Q&As towards the end of the course.
    • Scores are most important to show importance of this course
  • You will be invited via Evalytics at the end of the week

Next steps: Self- and peer assessment

  • You will be invited via e-mail
  • To be filled in using Google Forms

Informal feedback

  • Coaching sessions? (Online, offline)
  • How was it for beginners?
  • What's are three things you'd like me to change?

Stay in touch!