dPrep - Opening lecture

Hannes Datta

Welcome to dPrep!

We're about to start with the first lecture of this class.

If you haven't done so, please

Agenda

  • Part 1 (8.45 to about 9.45)
    • Meeting each other!
    • Motivation for the course
    • Course framework and learning goals
    • Agenda and practical arrangements
  • Break
  • Part 2: R Bootcamp on your laptops (about 10.00 - 12.30)

Disclaimer

  • Minimal (use of) slides - use Ctrl/Cmd + or - to zoom in/out in your browser
  • Recording sessions (no editing, do remind me to post, please)
  • You will have to invest a lot of time and energy
  • Mix of lectures, team work and self-study – all are necessary
  • Consider me your coach, not your distant prof.
  • Slow me down (yes, that's needed)

Research and teaching (all openly available)

Getting to know your "science" skills...

  • Who's ever used R? (Share your coolest R experience!)
  • Experience with (data-intensive) research? How did you “tame” the data? How did you collaborate?
  • Fast forward to the future… where would you like to work? What's your passion?

Motivation for course (I)

  • did my PhD in quantitative marketing
  • coded a lot (data preperation, statistical modeling), but didn't learn how to structure my work
  • created a complete chaos (but still got published)

Motivation for course (II)

  • Cannot find code that prepped the data set
  • Cannot find code of the econometric model that eventually got published

What was so bad about it?

  • Reproducibility
    • I couldn't reproduce results whenever I wanted to
  • Replicability
    • My peers didn't really understand how I did things - so they also couldn't check whether I did it correctly
  • Efficiency
    • when making changes to data, I had to go the the beginning, repeating all steps
    • a colleague asked me for the data years after; it wasn't properly documented!

Why should you care?

  • You will soon work on data-intensive projects - whether at University (e.g., thesis) or in business
  • You will change code continuously before a project is final
  • Team members or colleagues will look at and use your code
    • to help you
    • to continue your work
  • Costly investment in terms of time and effort, but…
  • Small efficiency gains will pay off soon!

What's efficient?

  • develop and prototype the “final pipeline” of your project, refine later
  • reduce setup costs to return to a project (e.g., finding relevant files, getting your code to work)
  • reduce coding mistakes (or catching them at all!)
  • rotating and collaborating on tasks (e.g., team members taking over)
  • sharing/reusing code (e.g., in packages)
  • receiving feedback from others

What's not efficient?

  • Waiting (e.g., for results, for estimation)
  • Getting distracted while waiting
  • Forgetting how things were done/implemented (and why)
  • Losing data
  • Using code which isn't properly documented (“don't know how to use it”)
  • Becoming frustrated, feeling lost

Course objectives (I): Develop your coding skills

a few examples

Course objectives (II): Collaborate on research projects using GitHub (1/2)

a few examples

1) view, run, learn from, and extend the work of others

Course objectives (II): Collaborate on research projects using GitHub (2/2)

2) document your own code, and use unlimited version history

  • e.g., check out how a project looked like a while ago
  • e.g., clean up stuff (because you can roll back anyways!)

3) collaborate with others on projects

Course objectives (III): Automate your research pipeline

Course objectives (IV)

Apply coding, collaboration and automation in a data- and/or computation-intensive project

  • Data intensity
    • “big data”: volume, variety, veracity, velocity
    • “Where prototyping on small data makes sense”
    • e.g.,: InsideAirbnb
  • Computation intensity
    • small data, but long computation time
    • potential for running things in parallel

Tips & tricks

  • Take time to become acquainted with the format of the course (e.g., public website, no Canvas)
  • Can be tough at first, but you will become more experienced rapidly!
  • Start preparing early on! (The first weeks will be the most challenging!)
  • Have the same group members across courses, mix skill levels
  • Collaborate with each other and try to help one another!

Course structure

  • Skill building (weeks 1-5)
    • Kicking off the week (video)
    • On-campus tutorial, followed by coaching sessions for feedback & activities
    • Self-study (readings, tutorials)
  • Project phase (weeks 2-7)
    • Apply skills in a team project
    • Evolves gradually as your skills improve
    • Building blocks / code snippets used to customize projects

Course framework

Course framework

Course website

Visit https://dprep.hannesdatta.com!

  • it's freely available - spread the word
  • Course website is your #1 resource, Canvas only used for
    • posting important announcements,
    • sign up for teams, and
    • submitting data challenges/projects
  • do all students have Canvas access?

Project

  • build a research pipeline
    • download data - prepare for analysis - run analysis - report
    • based on public data from IMDB; alternatively, you can use Inside AirBnb
    • specifics on the course site
  • team requirements
    • five persons (mix expertise if possible), register on Canvas by early next week (not less - change of dropout is high!)
    • you can keep previous grade (repeater) - still register on Canvas (!)
  • evaluation
    • self- and peer assessment (own versus team performance)
    • grading rubric available on the course website

Project coaching

  • teams meet on campus
    • during scheduled coaching class hours
    • except in the last two weeks (online feedback in breakout rooms)
  • instructor walks around and addresses questions
  • coaching sessions may require preparation - check on the course website

Tutorials

  • make use of courses by datacamp.com (get access - see links on site and Canvas) and Software Carpentry
  • will require a lot of effort & feedback
  • four tutorials
    • explore new data using RMarkdown
    • learn how to collaborate on code using GitHub
    • prepare datasets for analysis
    • automate workflows using make
  • all skills are highly valued on the job market

Grading

  • Team project (40%)
    • with self- and peer assessment
    • hand in via GitHub
  • Computer exam (60%) on campus
    • mix of open (free-flowing answers) and closed questions (multiple choice)
  • Potential to get bonus points (max. .5 on your final grade!)
    • substantial enrichment of existing course material
    • contributing to a Tilburg-based open source project (e.g., Tilburg Science Hub, music-to-scrape.org, course websites)
    • talk to me before starting to work on it

My commitment

  • Boost your own efficiency by making your work reproducible & transparent
  • Help others by sharing work and collaborating on projects
  • Start learning R, but, becoming an expert requires years of practice
  • Open software only (usable right away, no admin rights required)
  • Bring in your own ideas, don't be afraid to study topics off the main stream!
  • Discuss own work and ideas, but this requires interaction & working hard

Brain-dead by coding

  • Coding can be extremely frustrating if you're starting out
  • I tend to become semi-“brain-dead” after a day of coding
  • Take breaks! Stop coding. Go for a run. Start again.
  • You will learn from your mistakes
  • Use cheat sheets and our support section.

→ quick feedback loops in first few weeks

Use of WhatsApp

  • Please use WhatsApp: +31 13 466 8938.