top of page
Search
Michael Lan

Regression Analysis on Video Game Sales using Unstructured Data from Reddit

Updated: Feb 2, 2020

Non-Technical Explanation


Reddit is a popular social media app with a video game community called “r/gaming” that has over 20 million subscribers. The main objective of this project was to assess if a game’s popularity on r/gaming was good at predicting that game’s sales revenue in North America. Our analysis discovered that the r/gaming popularity metric we developed was the strongest predictor of video game sales revenue in North America out of all the metrics we looked at.


The results of this analysis can help determine how much a gaming company should invest in marketing on Reddit. Social media is a powerful tool for enhancing customer engagement and thus, marketing through apps like Reddit can prove invaluable to customer acquisition.



Executive Summary


Reddit is a popular social media app with a video game community called “r/gaming” that has over 20 million subscribers. The main objective of this project was to assess how good a game’s popularity on r/gaming was for predicting that game’s sales revenue in North America. This assessment was investigated by fitting a regression model, using metrics that included r/gaming data, to predict North American video game sales.


Web scraping was done with Python, while data analysis and regression were performed using R.


Web Scraping: The Pushshift API was employed to scrap posts on r/gaming, identifying game discussions via post titles. Video game sales data is hard to obtain for gaming industry outsiders, so an existing dataset containing sales and other metrics was used. To fill in data and metrics missing from this dataset that were needed for our analysis, we obtained information through web APIs that scraped Wikipedia infoboxes and Google Search Results to expand the dataset. Various techniques were required to handle the inconsistent formatting of Wikipedia’s semi-structured infoboxes. Additionally, posts about games did not always contain the full name of the game in their titles, so heuristics were developed to identifying games in these posts. Hash Tables were utilized to optimize the scrapping of information required for calculating r/gaming metrics.


Data Exploration: We analyzed how r/gamings’ interests compared to the sales in various categories, and used discrepencies in this analysis to discover errors in our dataset. The Reddit metrics and Sales had right skewed distributions, so log transforms were taken to shift their distributions towards a Gaussian shape. A correlations matrix heatmap for numerical variables was created to examine variable interactions.


Model Fitting: The data was split for cross validation. Using LASSO, we selected a final model with a Residual Standard Error of 0.544 and an adjusted R^2 of 0.345; on the testing data, it had an MSE of 0.209. Let mentions_before be the popularity of a game 28 days before release. The final model was:

However, 2 out of 4 regression assumptions were violated. Thus, this model is susceptible to serious inaccuracies, and its predictions should be taken with caution.


It was found that out of all variables in our final model, the r/gaming popularity metric had the highest predictive ability, indicating that it is a strong predictor of video game sales revenue in North America.


This report was divided into 2 parts. Click on a link below to go to that part:




The code for this project can be found on Github:

 

Table of Contents


0.1) Problem Description

0.2) Data Description and Assumptions

1) Data Cleaning and Web Scraping

  • 1.1 Improving Web Scraping by Post Title via Keyword Generation

  • 1.2 Web Scraping from Wikipedia using Google Search Results

  • 1.3 Cleaning Data before Scraping from Reddit

  • 1.4 Optimizing Web Scraping with Hash Tables

  • 1.5 Summary of Code Pipeline

2) Categorical Analysis of r/gaming’s Interests vs Sales


3) Data Exploration

  • 3.1 Cleaning Data and Identifying Outliers for Model Fitting

  • 3.2 Distributions and Transformations to Reduce Skewness

  • 3.3 Numerical Correlations and Categorical Associations

  • 3.4 Descriptive Statistics and Scatterplots for Selected Variables

4) Regression Results and Interpretation

  • 4.1 Variable Selection using LASSO

  • 4.2 Checking Regression Assumptions of the Final Model

  • 4.3 Summary of Important Final Model Metrics

  • 4.4 Interpretation of Final Model

5) Conclusion



108 views0 comments

Comments


bottom of page