Regression Analysis on Video Game Sales using Unstructured Data from Reddit

Michael Lan
Jan 4, 2020
3 min read

Updated: Feb 2, 2020

Non-Technical Explanation

Reddit is a popular social media app with a video game community called “r/gaming” that has over 20 million subscribers. The main objective of this project was to assess if a game’s popularity on r/gaming was good at predicting that game’s sales revenue in North America. Our analysis discovered that the r/gaming popularity metric we developed was the strongest predictor of video game sales revenue in North America out of all the metrics we looked at.

The results of this analysis can help determine how much a gaming company should invest in marketing on Reddit. Social media is a powerful tool for enhancing customer engagement and thus, marketing through apps like Reddit can prove invaluable to customer acquisition.

Executive Summary

Reddit is a popular social media app with a video game community called “r/gaming” that has over 20 million subscribers. The main objective of this project was to assess how good a game’s popularity on r/gaming was for predicting that game’s sales revenue in North America. This assessment was investigated by fitting a regression model, using metrics that included r/gaming data, to predict North American video game sales.

Web scraping was done with Python, while data analysis and regression were performed using R.

Web Scraping: The Pushshift API was employed to scrap posts on r/gaming, identifying game discussions via post titles. Video game sales data is hard to obtain for gaming industry outsiders, so an existing dataset containing sales and other metrics was used. To fill in data and metrics missing from this dataset that were needed for our analysis, we obtained information through web APIs that scraped Wikipedia infoboxes and Google Search Results to expand the dataset. Various techniques were required to handle the inconsistent formatting of Wikipedia’s semi-structured infoboxes. Additionally, posts about games did not always contain the full name of the game in their titles, so heuristics were developed to identifying games in these posts. Hash Tables were utilized to optimize the scrapping of information required for calculating r/gaming metrics.

Data Exploration: We analyzed how r/gamings’ interests compared to the sales in various categories, and used discrepencies in this analysis to discover errors in our dataset. The Reddit metrics and Sales had right skewed distributions, so log transforms were taken to shift their distributions towards a Gaussian shape. A correlations matrix heatmap for numerical variables was created to examine variable interactions.

Model Fitting: The data was split for cross validation. Using LASSO, we selected a final model with a Residual Standard Error of 0.544 and an adjusted R^2 of 0.345; on the testing data, it had an MSE of 0.209. Let mentions_before be the popularity of a game 28 days before release. The final model was:

However, 2 out of 4 regression assumptions were violated. Thus, this model is susceptible to serious inaccuracies, and its predictions should be taken with caution.

It was found that out of all variables in our final model, the r/gaming popularity metric had the highest predictive ability, indicating that it is a strong predictor of video game sales revenue in North America.

This report was divided into 2 parts. Click on a link below to go to that part:

Part 1: Web Scraping and Community Analysis

Part 2: Data Exploration and Model Fitting

The code for this project can be found on Github:

https://github.com/wlg1/video-game-sales-regression