This post is Part 2 of 2 of “Regression Analysis on Video Game Sales using Unstructured Data from Reddit”. The reader should read through the Introduction and Part 1 of the series first to fully understand the project. Click on a link below to go to a part:
Table of Contents
3) Data Exploration
3.1 Cleaning Data and Identifying Outliers for Model Fitting
3.2 Distributions and Transformations to Reduce Skewness
3.3 Numerical Correlations and Categorical Associations
3.4 Descriptive Statistics and Scatterplots for Selected Variables
4) Regression Results and Interpretation
4.1 Variable Selection using LASSO
4.2 Checking Regression Assumptions of the Final Model
4.3 Summary of Important Final Model Metrics
4.4 Interpretation of Final Model
5) Conclusion
Recap of Part 1
In Part 1, we discussed how we expanded the video game dataset by scraping from Reddit and Wikipedia and cleaning our data. In Part 2, we will further explore that data and use it to fit a model. Let us review the variables we will use for our analysis:
- Dependent variables: North American Sales ($ in millions)
- Independent variables:
r/gaming metrics (each metric is between 0 and 1):
1. Mentions_b: Number of post titles mentioning video game name 28 days before release (Normalized by total number of posts in that time period)
2. Mentions_a: Number post titles mentioning video game name 28 days after release (Normalized by total number of posts in that time period)
3. Upvotes_before: Number of Upvotes of ‘before’ posts, similarly normalized by total number of posts
4. Upvotes_after: Number of Upvotes of ‘after posts, similarly normalized by total number of posts
Non-Reddit metrics, Numerical (each metric is between 0 to 10, spaced by intervals of 0.1):
5. Critic Score
6. User Score
Non-Reddit metrics, Categorical:
7. Genre (12 levels)
8. Rating (6 levels)
9. Publisher (94 levels)
10. Platform (12 levels)
11. Release month (12 levels)
3) Data Exploration
3.1 Cleaning Data and Identifying Outliers for Model Fitting
We disregarded games that were not mentioned on Reddit since a large chunk of games were not discussed at all, and thus it would not be informative to find how Reddit metric values of 0 predict widely different Sales scores. Thus, games with no mentions either before or after the release month were removed, as we declared them as outliers.
We also removed samples with missing data in their numerical, represented as NAs. First, among the variables we considered in our analysis, we checked which variables contained at least one NA. Next, we removed all samples containing NAs in those variables. Though Rating does not have Nas but represents some ratings as NaN, we do not remove samples with NaN ratings as some games are officially not rated, and thus NaN is not missing data, but their actual rating; using just the data we gathered, we cannot differentiate between games with Not Rated ratings and those with missing ratings.
Finally, based on erroneous outliers discovered in Section 2) of this report, we identified any titles out of the 407 remaining samples that contained “&” or “Game”. Only one game, “Back to the Future: The Game”, met this criteria, so that game was removed.
Table 10: The number of samples left in our data after removing outliers and missing data.
3.2 Distributions and Transforms to Reduce Skewness
Table 11: Skewness values for numerical variables before and after log() transforms
The log transform pulls the extreme values on the right tail towards the median, giving the data more of a normal distribution. The Reddit metrics were all Right Skewed due to having skewness > 1, so we will use log transforms of them for model fitting. User_Score and Critic_Score become more negatively skewed as they were already left skewed, while Sales’s right skewness is reduced, albeit it still remains right skewed. Due to having a vast improvement, however, we will use a log transformed Sales for model fitting.
Though removing outliers after log transform reduced the positive skewness of Sales, its skewness was still around 0.87, which was close to 1, so its distribution was still very right skewed. During model fitting, we compared models fitted by keeping vs removing outliers, and found removing outliers did not improve the model’s fit by much. Thus, as removing outliers Sales only reduced its skewness from 1.24 to 0.87 while barely improving the model fit, we did not remove outliers just to decrease its positive skewness.
Figure 5: Left Histograms: Reddit Metrics Before log() ; Right Histograms: After log()
Top Left: Mentions Before ; Top Right: Mentions After ;
Bottom Left: Upvotes Before ; Bottom Right: Upvotes After
The Reddit metric QQ-plots against the theoretical normal distribution also showed a heavy right skew before the log() transform, as the plots fall under the straight line. The QQ-plots for the log transformed variables fall into a straight line, indicating the log transformed variable have approimately normal distributions.
Figure 6: Left QQ-Plot: Reddit Metrics Before log() ; Right QQ-Plot: After log()
Top Left: Mentions Before ; Top Right: Mentions After ;
Bottom Left: Upvotes Before ; Bottom Right: Upvotes After
Critic Score and User Score were found to have negative skewness, so to reduce their negative skewness, we transformed them using a function we will call “Transformation UnN”. This function works as follows: For each sample of the variable, the new value will take the Maximum variable value and subtract it by that sample’s old value. Since log(0) is -inf, using adding 1 to every sample in the variable before taking log() avoids computation of log(0). The skewness values after transformations on Critic Score and User Score were:
Critic Score Pre-Transformation: -0.88
Critic Score after Transformation UnN: 0.88
Log( Critic Score after Transformation UnN ): -1.37
User Score Pre-Transformation: -1.57
Critic Score after Transformation UnN: 0.24
This transformation reduced the negative skewness of User Score, but not that of Critic Score. Thus, we will only use this transformation on User Score.
3.3 Numerical Correlations and Categorical Associations
Figure 7: Correlation Matrix Heatmap for the numerical variables after transforming them.
Since the four Reddit metrics are all highly correlated with one another, we decided to only pick one of the four for model fitting. We chose to use the variable with the highest correlation with sales. This variable was Mentions Before. Critic Score is also decently correlated with Sales, so it will also be selected for model fitting.
Though we cannot assess how well a variable will affect a Multiple Regression model simply by correlation values alone, as interaction effects may beget significant results, we hypothesized that since User Score has very a low correlation with Sales, and a decently high negative correlation with Critic Score, it will likely not be useful in our model. During the regression analysis, we fit first order models with and without User Score, and compared their fit.
In Figure 8, We also evaluated the correlation of variables at different stages of transformation. Since all the Reddit metrics were highly correlated with one another, we saw that the heatmaps comparing the correlations of their transformations were all very similar to one another. Thus, we will only evaluate the heatmap for Mentions Before, which has the highest correlation with Sales.
Figure 8: Correlation Heatmap for Mentions Before and Sales at different stages of their transformations
The highest correlation between Mentions Before and Sales is at “normalized and log transformed Mentions Before” and “log transformed Sales”, which has a correlation of 0.56. This shows that the transformation improved correlation between the independent variable and Sales.
Associations between the Categorical Variables and the Dependent Variable
Since correlation measures the change in the one variable as another increases or decreases, we could not measure the correlation between our numerical dependent variable and the categorical variables. Thus, we measured the association between those variables by fitting a linear regression model and analyzing each model’s the adjusted R squared value. One interpretation of the square root of R^2 is that it is the multiple correlation coefficient, which approximately represents is the correlation between the observed independent variables and the model’s predictions. The results are shown in Table 12.
Table 12: Correlation coefficients for 1st order models b/w categorical variable and Sales
As these are fitted first order models and not correlations, the statement about how we cannot use correlations to assess a variable’s effect in model fitting does not apply here. Thus, we did not use categorical variables whose fitted model correlation coefficients are too low. The only categorical variable with a sufficient association was Publisher, so Publisher was the only categorical variable chosen for model fitting.
3.4 Descriptive Statistics and Scatterplots for Selected Variables
To reduce the number of variables to analyze, we will only consider the statistics of numerical variables to be used in the model.
Table 13: Summary Statistics for selected numerical variables
Due to log transforming the Reddit metric, which was previously between 0 and 1, all values in the Reddit metric are negative.
Figure 9: Scatterplots for all pairs of selected numerical variables
For the variables that were correlated with Sales, we fit a first order linear regression model between that variable and Sales, and plotted their best fit line alongside their scatter plot. These best fit lines clearly show positive correlation between the independent and dependent variables.
Figure 9: Scatterplots for correlated Independent Variables vs Sales.
Left: Mentions Before vs Sales ; Right: Critic Score vs Sales
These plots indicate that there may be a second order, polynomial relationship. However, we did not fit a polynomial model in this project.
Distribution Outliers for all 4 selected numerical variables were identified with boxplots.
Figure 11: Boxplots for selected numerical variables.
Table 14: Number of boxplot outliers for each selected numerical variable
As seen in section 3.3, we found that removing outliers did not result in a better or significantly different model. Thus, we chose not to remove outliers to train the final model.
4) Regression Results and Interpretation
4.1 Variable Selection using LASSO
Using a percentage split of 95/5, the data was randomly split into a training set of 386 samples and test set of 19 samples. The test set may have contained sample(s) with levels of Publisher that were not contained in the training set, so we identified these levels, discovered 1 sample that met this violation, and removed it from the test set. Thus, as the test set had originally contained 21 samples, it was shrunk to 19 samples.
We used LASSO for variable selection, as LASSO allows coefficients to shrink towards 0. The R package uses a Generalized Linear Model using a Gaussian dependent variable to specify fitting a linear model. We did not use stepwise regression due to numerous controversies surrounding the technique, such as criticisms about how its usage of multiple hypothesis tests greatly increases the chances of accepting false positives, and thus accepting bad models.
The model we initially fit using LASSO was formalized as:
In Table 15, we observed the coefficients obtained using LASSO, and saw that it had gotten rid of Publisher. We compared the coefficients obtained using LASSO and without using LASSO and noticed that except for the coefficient for Publisher, they are pretty similar.
Table 15: Beta coefficients of model w/o LASSO vs model w/ LASSO
Thus, we drop Publisher, and end up with the final model of:
Table 16: Coefficients of Final Model
Table 17: LASSO vs FINAL MODEL Test Metrics
Table 18: Summary of Final Model Results
10 out of 19 predicted test points, their prediction intervals, and the actual values are given in Table 19. We used prediction intervals instead of confidence intervals because we are obtaining intervals around individual data points.
Table 19: Actual vs Predicted values, with prediction intervals, for 10 out of 20 test samples
While the prediction intervals are very wide, most of the prediction values, save for the ones that are supposed to be in the vicinity of “2”, appear to be close to the actual values.
4.2 Checking Regression Assumptions of the Final Model
Figure 12: Checking Assumptions of Final Model
Let Residuals = (predicted values – actual values). We will now check 4 assumptions of linear regression; if any of these are violated, then the model’s predictions and insights may be erroneous, biased, or misleading.
ASSUMPTION 1: No relationship between Residuals vs Fitted
This assumption is satisfied if the fitted line is horizontal. The top-left plot shows the fitted line in blue is slightly curved, indicating that Assumption 1 has been violated.
ASSUMPTION 2: Residual Errors should be normally distributed
Most residuals are on the straight dashed line on the QQ-plot, so Assumption 2 is satisfied.
ASSUMPTION 3: Check for homoskedascity in Predicted Values vs Residual Errors
The blue fitted line in the bottom-left plot should be close to the dashed horizontal line to satisfy Assumption 3. But the blue fitted line increases as the predicted Sales values increase, indicating the presence of heteroskedasticity. Thus, Assumption 3 is violated.
ASSUMPTION 4: Independence of errors
Let e(t) denote the residuals. The least squares regression line for e(t) vs e(t-1) is E(e(t)) = β0 + β1*e(t-1). The test hypothesis for checking residual independence is:
H0: β1 = 0 H1: Not H0
The Durbin-Watson test gives a Durbin-Watson statistic of 2.0516 with a p-value of 0. 6957, which is much greater than α = 0.05. Thus, we accept the Null Hypothesis H0 and claim β1 = 0. Additionally, the statistic is around 2, indicating that there is very little autocorrelation between the residual errors. There is not enough evidence against residual independence, so Assumption 4 is satisfied. Thus, Assumption 4 is satisfied.
Due to violations of two important assumptions- linearity and homoskedasticity- the fitted model is susceptible to serious inaccuracies. We hypothesize that based on looking at the scatterplots, a second order model may overcome these violations. However, we did not fit one for this project.
4.3 Summary of Important Final Model Metrics
We used a boxplot of Sales to remove distribution outliers after transforming variables. Table X compares the results of keeping vs removing these outliers. Removing outliers does not change the model’s metrics by much; thus, we did not remove outliers to train the final model.
Table 20: Comparing model results of keeping vs removing Sales Boxplot Outliers
T. stands for “Test”, while Asm stands for “Assumption”. The yes/no in the Asm columns indicate if that assumption is satisfied or not.
4.4 Interpretation of Final Model
Despite violating 2 regression assumptions, the model was somewhat adept at predicting Sales values with decently low MSE and decently high R^2. Thus, we chose to interpret the model, whilst acknowledging its potential faults and inaccuracies. Other inaccuracies may include errors in the Kaggle dataset.
To compare variable coefficients, we needed to standardize them, as their units varied. Table 21 shows that the standardized coefficient of “Mentions Before”, which estimated the popularity of a game 28 days before release, had the highest value. Thus, we concluded that it was the most important predictor in our model. The hypothesis that this metric is predictive of a game’s sales in North America appeared to be supported by this model.
User Scores had the least predictive ability, less so than Critic Scores; this may indicate that consumers listen more to professional reviews than audience reviews when deciding if they want to purchase a product or not.
Table 21: Standardized Coefficients for the Final Model’s predictors
5) Conclusion
The main objective of the project was to assess how good a game’s popularity on r/gaming was for predicting that game’s sales revenue in North America. Thus, the aim of this objective was met: the model suggests that a game’s popularity on r/gaming before release correlates with and helps with predicting that game’s North American Sales. More so, out of the three variables in our final model, the r/gaming popularity metric had the highest predictive ability, indicating that it is a strong predictor of video game sales revenue in North America.
Using LASSO, we selected a final model with a Residual Standard Error of 0.544 and an adjusted R^2 of 0.345. The final model was:
However, 2 out of 4 regression assumptions were violated. Thus, the fitted model is susceptible to serious inaccuracies, and its predictions should be taken with caution.
The following are potential metrics for further investigation:
Performing a Sentiment Analysis on posts to obtain metrics on post positivity/negativity
Comparing Reddit metrics with Google trend data
If comments have at least X mention of game, where X is some threshold, use them to identify posts discussing a game that do not contain the game name in title. Trust more in comments with more upvotes.
If have certain number of upvotes, then use binary metric to consider post as hot or not. Number of Hot posts differs from number of posts.
Train a classifier to classify which game a post’s image belongs to. Obtain training images from Google images when typing in game name. Also look at game-specific subreddit to obtain image data.
Comments