

Please pick one of these data sets as your primary source data for the team project.

Dataset 1: Yelp Reviews


Yelp is a widely-used platform for discovering local businesses, including restaurants, shops, services, and more. It hosts a vast collection of user-generated reviews, photos, and detailed business information. The Yelp dataset is a curated subset of this data, designed to support academic research and educational purposes.

The Yelp dataset can be used to answer various research questions and practical applications, such as:

  • What factors influence the ratings and reviews that businesses receive?
  • Can we predict business success based on review sentiment?
  • What are the key characteristics of businesses in different metropolitan areas?
  • How do visual elements in user-uploaded photos relate to review sentiment or business ratings?
  • What are the trends in customer feedback across different industries and regions?
  • How can mobile apps leverage user-generated content to enhance local business discovery?
  • What are the characteristics of businesses that receive a high volume of tips?
  • What role do tips play in influencing other users’ perceptions and decisions?
  • What role do review photos play in influencing user decisions and conversion rates?

Download the data

Multiple datasets are available for download.

  1. Obtain the conversion script by downloading or copying the code from this Gist.
  2. Save the script as in your working directory.
  3. Run the script from your terminal with the following command:
python yelp_academic_dataset.json

Replace yelp_academic_dataset.json with the path to the JSON file you wish to convert.

The script will generate a CSV file with the same name as your input JSON file, located in the same directory. For example, the input yelp_academic_dataset.json will produce yelp_academic_dataset.csv.

Dataset 2: IMDb


IMDb (Internet Movie Database) is a go-to online platform for information about movies, TV shows, actors, directors, and more. It offers details like titles, release dates, cast info, ratings, and reviews, making it a popular resource for entertainment enthusiasts and professionals. Subsets of IMDb data can be accessed for personal/non-commercial purposes.

The IMDb data can allow you to answer a variety of research questions, such as:

  • How have trends in genre popularity evolved over time in the entertainment industry?
  • Is there a connection between the format of a title (movie, TV series, etc.) and its audience reception?
  • Does the involvement of specific creators (directors, writers) impact the success of their projects?
  • Which types of titles generate the most online-word-of-mouth (as indicated by user-generated content like reviews and ratings)?
  • What factors most strongly influence the average rating of a movie or a TV show?
  • Are there differences in audience engagement between content targeted at adults and non-adults?
  • How does popularity of crew member(s) influence a title’s viewer engagement (e.g. votes, ratings)?
  • Is an individual’s fame related to his/her birth year?
  • Are there patterns in viewer engagement with TV series episodes based on release timing?

Download the data

Multiple datasets are available for download.