dPrep - Opening lecture

Hannes Datta

Welcome to dPrep!

We're about to start with the first lecture of this class.

If you haven't done so, please

Agenda

  • Part 1 (10.45 to about 12.15)
    • Meeting each other!
    • Motivation for the course
    • Course framework and learning goals
    • Agenda and practical arrangements
  • Break
  • Part 2: R Bootcamp on your laptops (about 12.30 - 14.30)

Disclaimer

  • Minimal (use of) slides - use Ctrl/Cmd + or - to zoom in/out in your browser
  • Recording sessions (no editing, do remind me to post, please)
  • You will have to invest a lot of time and energy
  • Mix of lectures, team work and self-study – all are necessary
  • Consider me your coach, not your distant professor
  • Slow me down (yes, that's needed!)

Research and teaching (all openly available)

Getting to know your programming skills...

  • Who's ever used R? (Share your coolest R experience!)
  • Experience with (data-intensive) research? How did you “tame” the data? How did you collaborate?
  • Fast forward to the future… where would you like to work? What's your passion?

Motivation for course (I)

  • did my PhD in quantitative marketing
  • coded a lot (data preperation, statistical modeling), but didn't learn how to structure my work
  • created a complete chaos (but still got published)

Motivation for course (II)

  • Cannot find code that prepped the data set
  • Cannot find code of the econometric model that eventually got published

What was so bad about it?

  • Replicability
    • I couldn't replicate results whenever I wanted to
  • Reproducibility
    • My peers didn't really understand how I did things - so they also couldn't implement similar designs to test effects
  • Efficiency
    • when making changes to data, I had to go the the beginning, repeating all steps
    • a colleague asked me for the data years after; it wasn't properly documented!

Why should you care?

  • You will soon work on data-intensive projects - whether at University (e.g., thesis) or in business
  • You will change code continuously before a project is final
  • Team members or colleagues will look at and use your code
    • to help you
    • to take over your work
  • Learning 'good coding practices' is a costly investment (time/effort), but…
  • Small efficiency gains will pay off soon!

What's efficient?

  • develop and prototype the “final pipeline” of your project, refine later
  • reduce setup costs to return to a project (e.g., finding relevant files, getting your code to work)
  • reduce coding mistakes (or catching them at all!)
  • rotating and collaborating on tasks (e.g., team members taking over)
  • sharing/reusing code (e.g., in so-called “packages” or “libraries”)
  • receiving feedback from others

What's not efficient?

  • Waiting (e.g., for results, for data to be prepped)
  • Getting distracted while waiting
  • Forgetting how things were done/implemented (and why)
  • Losing data
  • Using code which isn't properly documented (“don't know how to use it”)
  • Becoming frustrated, feeling lost

Course objectives (I): Develop your coding skills

1) Use R to clean and transform data for analysis (e.g., aggregation, merging, de-duplication, reshaping, data conversions, regular expressions)

2) Use R for generating automatic reports (e.g., to assess data quality, to report research findings in a paper) and deploying research findings in novel ways (e.g., apps)

Course objectives (II): Collaborate on research projects using GitHub (1/2)

1) collaborate with others on projects

2) document your own code, and use unlimited version history

  • e.g., check out [how a project looked like a while ago
  • e.g., clean up stuff (because you can roll back anyways!)

a few examples

Course objectives (III): Automate your research pipeline

  • create and run portable, automated, and reproducible data pipelines

a few examples

Tips & tricks

  • Take time to become acquainted with the format of the course (e.g., public website for course material, Canvas for submissions)
  • Can be tough at first, but you will become more experienced rapidly!
  • Start preparing early on! (The first weeks will be the most challenging!)
  • Have the same group members across courses, mix skill levels
  • Collaborate with each other and try to help one another!

Course structure

  • Skill building (weeks 1-5)
    • On-campus tutorial (Hannes), followed by coaching sessions for feedback & activities (Roshini)
    • Self-study (readings, tutorials)
  • Project phase (weeks 2-7)
    • Apply skills in a team project
    • Evolves gradually as your skills improve
    • Building blocks / code snippets used to customize projects

Course framework

Course framework

Course website

Visit https://dprep.hannesdatta.com!

  • it's freely available - spread the word
  • Course website is your #1 resource, Canvas only used for
    • posting important announcements,
    • sign up for teams, and
    • submitting data challenges/projects
  • do all students have Canvas access?

Project

  • build a research pipeline
    • download data - prepare for analysis - run analysis - report
    • based on public data from Yelp and IMDb, specifics see course page
  • team requirements
    • five persons (mix expertise if possible), register on Canvas by early next week (not less - change of dropout is high!)
    • you can keep previous grade (repeater) - still register on Canvas (!)
  • evaluation
    • self- and peer assessment (own versus team performance)
    • grading rubric available on the course website

Project coaching

  • teams meet on campus
    • during scheduled coaching class hours
    • except in the last two weeks (online feedback in breakout rooms)
  • instructor walks around and addresses questions

Tutorials

  • make use of courses by datacamp.com (get access - see links on site and Canvas) and Software Carpentry
  • will require a lot of effort & feedback
  • four tutorials
    • explore new data using RMarkdown
    • learn how to collaborate on code using GitHub
    • prepare datasets for analysis
    • automate workflows using make
  • all skills are highly valued on the job market

Grading

  • Team project (40%)
    • with self- and peer assessment
    • hand in via GitHub
  • Computer exam (60%) on campus, no internet
    • mix of open (free-flowing answers) and closed questions (multiple choice)

My commitment

  • Boost your own efficiency by making your work reproducible & transparent
  • Help others by sharing work and collaborating on projects
  • Start learning R, but, becoming an expert requires years of practice
  • Open software only (usable right away, no admin rights required)
  • Bring in your own ideas, don't be afraid to study topics off the main stream!
  • Discuss own work and ideas, but this requires interaction & working hard

Brain-dead by coding

  • Coding can be extremely frustrating if you're starting out
  • I tend to become semi-“brain-dead” after a day of coding
  • Take breaks! Stop coding. Go for a run. Start again.
  • You will learn from your mistakes
  • Use cheat sheets and our support section.

→ quick feedback loops in first few weeks

Getting support

  • Course chatbot in collaboration with tilburg.ai (see https://dprep.tilburgai.nl)
  • Text me instead? Please use WhatsApp: +31 13 466 8938.

WhatsApp

  • Email is usually much slower
  • Check out the course's support section

What's in for you?

  • Enhanced data analysis abilities: knowing the latest packages
  • Competitive advantage: everyone learns R, barely anyone learns GitHub and automation skills (but, they are so useful)
  • Efficient collaboration: we focus also on project management - making you more productive in the future
  • Reproducible workflows: saves tim