dPrep - Course Summary & Exam Preparation

Hannes Datta

Welcome to the final lecture in dPrep!



If you haven't done so, please explore the exam page & example questions at https://dprep.hannesdatta.com/docs/exam.

Agenda

  • Course summary
  • From here onwards
    • During your studies
    • After your studies
  • Course evaluation
  • Exam preparation
  • Remaining questions / Q&A

Course framework

dprep

Positioning in the study program

dprep

Lessons learnt #1: Versioning and Project Management with Git(Hub) (I)

  • Scrum & the project board as a way to collaborate with one another
    • Writing issues is hard (!)
    • Seems to put a lot of structure on stuff, did it work out?
    • Recall how to work together – team <-> individual
  • Versioning can be powerful!
    • delete stuff that isn't needed, roll back when you want
    • Oh no, not authenticated!
    • Oh no! Can't push! (pull first!)
    • GUIs (e.g., in R or VS Code) are available, too!

Lessons learnt #1: Versioning and Project Management with Git(Hub) (II)

  • Ready to use Git/versioning in a business
    • Know what to version, and what not
    • Purpose of .gitignore
    • How to work together (issues, feature branches, pull requests/PRs)
  • Collaborate on open source projects
    • You know what forks are!
    • You may even actually contributed to public projects!

Lessons learnt #2: Data Exploration

  • RMarkdown! (mixing code with reporting)
    • ability to quickly produce clean docs to share!
    • uh… but how to run it w/ make?
    • verify you can render markdown documents!
  • Doing data quality checks?
    • getting back to your data supplier if needed!
    • reporting summary stats
  • Use it when it is useful (e.g., produce a doc, slides, etc., NOT data cleaning)

Lessons learnt #3: Data Engineering

  • Why do we actually have to clean up data?!
  • Data prep code can be sooo complex
    • so… use cheatsheets (also for the exam!)
    • know & apply common data operations (what are those?)!
  • Make ITO blocks
    • setup to load libraries, then input, transformation, output!
    • super crucial for make, too!
  • Modularize code, loop, and use functions

Lessons learnt #4: Pipeline building and automation

  • Clarity about where to store stuff
    • review! (components - src, gen, data; modules - e.g., analysis, data-prep, …)
  • Run the work others have done
    • my work, your work, somebody else's work
  • Discover mistakes in code
    • e.g., packages that weren't loaded, order of cells/code
  • Save time!
    • but, yeah, NOT at the beginning when you still learn
  • Continuously improve documentation & code

Common mistakes w/ make

  • Multiple targets, not multiple rules
  • The rule on top! That's where everything starts.
  • Gradually build your makefile. Go in baby steps!
  • Happy to take a look at your repositories now - anyone?

Any other questions about your projects?

We can address them now!

Reflection about the course

  • it's amazing to see you learn & grow!
    • that feeling you get when your code runs…!
    • I can talk to you as “members of my own dev team”
    • I've witnessed you “seeing code and feeling what it does!”
  • different levels
    • many started R from scratch; some could program but still were challenged!
  • saw so much: different projects, data sets, tools, “way of doing things”, way of coding, videos, websites

My recommendation: take time to let it sink in. And trust in my choice of teaching you this. Your hard work will pay off.

Looking ahead: During your studies (I)

  • Gradually implement across classes or projects
    • e.g., some projects just benefit from better directory structure, while others may need “more”
    • risk of “losing” the skill (!)
  • Realize it takes time to learn
    • it took me years to become proficient

Looking ahead: During your studies (II)

  • Use for thesis!
    • but, be aware your skills are so fresh, your professor likely has never heard about it! –> use, teach, show how productive you are!
  • Build your job market profile
    • e.g., have a Hugo website with your CV
    • e.g., “pin” your best repositories (but don't call them dprep-team-X; remember, you're a marketer!)
    • have a compelling about page on GitHub

Looking ahead: During your studies (III)

Looking ahead: After your studies

Next steps: Submissions and preparing for the exam

Exam planning

  • Organization
    • 4 April (time tba; 3 hours)
    • On campus, using TestVision
  • Software & materials
    • access to R/RStudio, Git, make
    • access to github.com/course-dprep and classroom.github.com; no access to ChatGPT or other AI tools
    • I'm making selecting resources available on the instruction page - check them out here
  • How to prepare?

Some tips for your exam

  • Expect an unexpected data set & data wrangling
  • Know common data operations in dplyr & become fast!
  • When handing in documents, check what I require (for .Rmd, I sometimes ask for rendered .pdf documents - does it work on your computer?)
  • Be prepared to work with Git Bash
    • know how to make commits with commit messages, create branches, switch branches, etc.
    • download git repository, unzip, do your commits, zip again and submit as a zipfile
  • Be prepared to work with GitHub
    • know how to clone, fork, write issues, do PRs
    • know how to roll back to previous version
  • Be prepared to use make
    • run, correct and develop new make workflows
  • Be prepared to download data sets from Testvision (.Rdata) - on the Cover page of the exam or in a specific question on the exam.

Learning goals + distribution of points

  • 100 points in total, about 25 questions
  • Mix of open and closed questions
  • Learning goals & question weights
    1. Use R to clean and transform data for analysis (e.g., aggregation, merging, de-duplication, reshaping, data conversions, regular expressions) [synthesis; 20% of points]
    2. Use GitHub for managing empirical research projects (e.g., GitHub Issues and Project Boards) [evaluation; 10% of points]
    3. Use Git/GitHub for versioning files and collaborating on privately-shared and publicly-available (open science) GitHub repositories [application; 30% of points]
    4. Use R for generating automatic reports (e.g., to assess data quality, to report research findings in a paper) and deploying research findings in novel ways (e.g., apps) [comprehension; 15% of points]
    5. Use Workflow Management Tools to create and run portable, automated, and reproducible research pipelines [application; 25% of points]

Next steps: Official course evaluation

  • Course evaluation has been immensely important to this course

    • this semester: switched order of sessions, revised GitHub tutorial/onboarding
    • last semester: new data set, coaching sessions with breakout groups on campus and online
  • Course evaluation has been critical to my career

    • Without my past evaluations, I wouldn't be teaching to you today
    • I will look at all comments.
    • Scores are most important to show importance of this course
  • You will be invited via Evalytics at the end of the week

Next steps: Self- and peer assessment

  • You will be invited via e-mail
  • To be filled in using Google Forms

Informal feedback

  • Coaching sessions? (Online, offline)
  • How was it for beginners?
  • What's are three things you'd like me to change?

Stay in touch!