Part 1, Regression Analysis on Video Game Sales: Web Scraping and Community Analysis

Michael Lan
Jan 2, 2020
16 min read

Updated: Feb 2, 2020

This post is Part 1 of 2 of “Regression Analysis on Video Game Sales using Unstructured Data from Reddit”. The reader should read through the Introduction first to fully understand the project. Click on a link below to go to a part:

Introduction

Part 2: Data Exploration and Model Fitting

0.1) Problem Description

0.2) Data Description and Assumptions

1) Data Cleaning and Web Scraping

1.1 Improving Web Scraping by Post Title via Keyword Generation
1.2 Web Scraping from Wikipedia using Google Search Results
1.3 Cleaning Data before Scraping from Reddit
1.4 Optimizing Web Scraping with Hash Tables
1.5 Summary of Code Pipeline

2) Categorical Analysis of r/gaming’s Interests vs Sales

0.1) Problem Description

Reddit is a popular Western social media app with a large population of gamers, as one of its most populous forums, r/gaming, has over 20 million subscribers. Thus, assuming that a popular game should sell more, and that the popularity of a game on r/gaming should reflect the game’s popularity in overall Western gaming populations, one can postulate that a game’s popularity on r/gaming is a capable predictor of how many copies the game will sell. But just how accurate is this statement?

To help answer this question, a regression analysis was performed by fitting North American video games sales with metrics measured on data scrapped from r/gaming. The results of this analysis can help determine how much a gaming company should invest in marketing on Reddit. Social media is a powerful tool for enhancing customer engagement, and posting about a game means the product is on the forefront of Reddit user’s minds. These posts also help spread word about the game to other users.

Coding and analysis took took approximately 3 weeks to complete, and the report took a week to draft and revise.

0.2) Data Description and Assumptions

For those outside of the industry, video games sales data is hard to obtain. Luckily, sales data for over 10000 games from the 1980s to 2016 was already compiled into a dataset hosted on Kaggle. The project this dataset was originally compiled for did not perform a linear regression analysis on it, and instead performed classification using logistic regression. A link to that project is given here:

https://www.kaggle.com/ignacioch/predicting-vg-hits-1-million-sales-with-lr-rfc/data

However, this dataset did not contain a lot of information needed for our analysis, so more data had to be scrapped and added utilizing APIs that scoured through Wikipedia infoboxes and Google search results. Afterwards, cleaning was done on this expanded dataset.

As shown by the link below, nearly 50% Reddit’s visits are from the United States, and the country with the second most visits is the United Kingdom at 7.79%. Adding up the visits from UK, Canada, Australia and Germany gives a sum of around 23%. Thus, Reddit and subsequently r/gaming’s demographics are overwhelmingly North American. Thus, we will only consider North American Sales.

https://www.statista.com/statistics/325144/reddit-global-active-user-distribution/

A game’s popularity was measured as the number of posts that mentioned a game during a specified time interval. We looked popularity both around a month before release, and data around a month after release. “Before Release” shows how engaged consumers are in the product before they play the game, and “After Release” shows how engaged consumers are after they hear reviews of the gameplay. The time period used was 28 days instead of exactly one month to make all games have equal time intervals, as the number of days in each month varies. The number of mentions is affected by the growing Reddit population, as shown in the link below:

https://redditmetrics.com/r/gaming

According to redditmetrics.com, in 2013 r/gaming had around 2 million subscribers; in 2016, it had around 9 million subscribers. Figure 1 shows the number of posts made on r/gaming each month in 2010 compared to those made in 2016; the two barplots have very different patterns. Figure 2 illustrates just how the activity on r/gaming radically changed from 2010 to 2016. Thus, the number of mentions in the 28 day interval was normalized by the total number of posts made on r/gaming during the 28 day interval.

This study dealt with unstructured data because several metrics were obtained by identifying r/gaming posts based on keywords in unstructured online text. It also dealt with semi-structured data because Wikipedia infoboxes have no formal schema and are prone to inconsistent labeling issues, such as how they label release dates.

Figure 1: Comparing monthly r/gaming activity in 2010 vs in 2016

Figure 2: r/gaming Post Activity from 2010 - 2016. Each point is # of posts in each (month, year) combination.

- Dependent variables: North American Sales ($ in millions)

- Independent variables:

r/gaming metrics (each metric is between 0 and 1):

1. Mentions_b: Number of post titles mentioning video game name 28 days before release (Normalized by total number of posts in that time period)

2. Mentions_a: Number post titles mentioning video game name 28 days after release (Normalized by total number of posts in that time period)

3. Upvotes_before: Number of Upvotes of ‘before’ posts, similarly normalized by total number of posts

4. Upvotes_after: Number of Upvotes of ‘after posts, similarly normalized by total number of posts

Non-Reddit metrics, Numerical (each metric is between 0 to 10, spaced by intervals of 0.1):

5. Critic Score

6. User Score

Non-Reddit metrics, Categorical:

7. Genre (12 levels)

8. Rating (6 levels)

9. Publisher (94 levels)

10. Platform (12 levels)

11. Release month (12 levels)

We did not consider the number of comments mentioning a game, as comments that refer to the game usually discuss the post topic, which is already about the game, so it is redundant. We assume that threads with lots of upvotes also have lots of comments, so we did not consider the number of comments in each post as a factor.

When interpreting the data, we employed the following assumptions:

Assume that the resurgence of video games is not common; most games sell the most when they’re first released. Some games have downloadable content and expansion packs to draw in new customers, but we assume that a game’s popularity depends on the core fanbase it obtains early on.
Assume that giving a game its ‘legs’ requires it to be talked about during the month after release; if it’s not talked about, it requires an event to trigger resurgence over its popularity, which is not assumed to be common.
Assume sequels factor into consumer engagement; discussion of previous games counts as consumer engagement for sequels.
Assume sequels are not released or re-released at same time.
Assume holidays don’t affect posting, as they do not affect product engagement but instead increase consumer spending on all products in general. Thus, people will not be more engaged in a game simply due to holiday seasons, and will not post more about the game during holidays.

1) Data Cleaning and Web Scraping

Web scraping and data cleaning was done with Python. Reddit posts were scraped using the Pushshift API, which identifies mentions via case-insensitivie post titles. The Kaggle dataset did not contain the month and day of release. Thus, this data needed to be scrapped from Wikipedia using the Wikipedia python API.

One issue with this approach was that game titles obtained from the Kaggle dataset did not always result in a Wikipedia page when queried from the API, and if no page is found, then the API did not return the next best search result. For instance, “Pokemon Sun/Moon” does not return a Wikipedia page via the API. To solve this problem, game names that did not return a Wikipedia page were used as queries in Google using the Serpapi API, searching only on “site:wikipedia.org”. If the first Wikipedia page’s infobox followed recognizable “video game structure”- for instance, if it contained platform names or release dates- then data was scrapped from that infobox.

1.1 Improving Web Scraping by Post Title via Keyword Generation

On r/gaming, games are not always referred to by the name they are given on the Kaggle dataset. Thus, to detect if a game was mentioned on Reddit by an nickname, a column for nicknames called “keywords” was added to the dataset. For instance, the game “Pokemon Sun/Moon” would rarely be referred to on Reddit by its full name, and would usually be referred to by the keyword “Pokemon”. Therefore, as the Pushshift API identifies game mentions using keywords in a post title, our scraping procedure was developed to look not for just a game’s title given by the Kaggle dataset, but for all of the game’s “nicknames”. These nickname keywords were generated by a script before scrapping from Reddit.

Various methods were used to obtain nicknames. If a game name contained “:”, then it likely used “:” as a separator between the title and the subtitle. Since people may refer to a game only by its title or subtitle, the title and subtitle for games with “:” were each added as separate keywords.

Games also have nicknames that are hard to extract from a title. For instance, the popular franchise “Call of Duty” is often referred to by its nickname “CoD”, whereas most games do not go by such abbreviations as they are too general and can cause confusion. Since there were thousands of games, it was not feasible to manually record nicknames for all games. Thus, manually recorded nicknames were only considered for popular franchises. These nicknames were manually written in a csv file called “added_keywords.csv”. Then, a script parsed through the dataset to see if a game contained that popular franchise’s name; if it did, all of that franchise’s nicknames were added to that game’s list of nicknames. To find popular franchises, this list on Wikipedia was consulted:

https://en.Wikipedia.org/wiki/List_of_best-selling_video_game_franchises

Only franchises that sold at least 20 million copies were deemed “popular”. However, the franchise “LEGO” was not included as there are too many LEGO games that were released around the same time and were are not releated to one another (Eg. LEGO Star Wars and LEGO Batman). We assumed most sequels won’t be released around the same time, so when using the game name and nicknames as keyword queries on Reddit posts, the numbers, either Arabic or Roman numerals, were removed. Symbols such as (, ), and others were removed, too. To prevent overcounting a game’s mention, each mention was only counted if its post ID had not been recorded so far for that specific game. For each new game, this post ID tracking list resets.

Some games were referred to by a single word within the title; for instance, “Gears of War” is often called “Gears”. However, splitting a title into single words and using each one as a keyword would be problematic, as without contextual understanding, these keywords would often refer to other things. For instance, “Dark” is too common of a word. Thus, keywords were not obtained by splitting a title into single words.

1.2 Web scraping from Wikipedia using Google Search Results

We only wanted the North American release date. In the Wikipedia infobox, the NA (North America) release date is usually prefaced by NA. If NA is not in the infobox, but EU or JP (Europe or Japan) is, then the NA release date may be prefaced by WW (Worldwide). If JP is there, but NA is not, then we did not use that game in our analysis, as it was likely not released outside of Japan. If none of these symbols prefaced the release date, then we simply used the release date in the infobox, as this meant that the game was released on the same date worldwide. In rare cases, the game release date was not found on Wikipedia, or the game was not even on Wikipedia, so those games were excluded from our analysis.
Many game titles were repeated in the Kaggle dataset because each game released on a separate platform was treated as a separate sample by the original dataset creator. Thus, the release date for that specific platform was obtained from Wikipedia. Several Wikipedia pages which listed the release dates for both region and platform were studied to determine how to parse infoboxes to accurately obtain release dates.
In some cases, both title and release date for multiple observations were identical; this meant the game was released simultaneously on more than one platform. In our case, we treated them as a single observation, as we wanted to find how mentions before and after the release date correlated with sales. Thus, before scrapping Reddit posts, games with the same release date were combined into one row. The sales were summed, and the averages of scores were taken. This also saved time from scrapping redundant games.
Release dates on Wikipedia must match the game’s ‘Year of Release’ stated in the Kaggle dataset.
In a few cases, the wiki API did not always get all the correct infobox data from Wikipedia. For example, for the game Naruto Shippuden: Ultimate Ninja Storm 4, any visitor to the page can see that it contains the Windows release date, but the API fails to retrieve this information. In another case, the release date for Gears of War 4 on Wikipedia differs from the one retrieved by the API by 4 days. The infoboxes also have inconsistent platform and region formatting; for instance, the schema for indicating platform and region on the Overwatch page differs from that on the Diablo III page. We adapted our code to deal with these inconsistencies.

1.3 Cleaning data Before Scraping from Reddit

Only games released after 2010 were considered, as 2010 was when Reddit started to be considered mainstream by the public.
Only games with > 0 sales in North America were considered.
When labeling release dates on different platforms, Wikipedia infoboxes occassionally used different labels than the Kaggle dataset, using variations of the platform name; for instance, the platform PS4 in the Kaggle dataset was labeled as PS4 or PlayStation 4. Thus, all variation names for that platform had to be considered. Most games released on OSX (Mac) are released on Windows, so any game with the platform “PC” in the Kaggle dataset was re-named with the platform “Windows”.
We deleted games released in December 2016 because the data ends on 12/22, which meant these games had less time to sell than other games; their sales data would not be comparable to the other games.

1.4 Optimizing Web Scraping with Hash Tables

After slicing the dataset to obtain a subset suited for our analysis, including combining games with the same name and release date, we had a dataset of 1001 games. Each metric needed to be normalized by the number of r/gaming posts made in a 28 day interval. However, scrapping all the posts made during a day is a costly operation. Getting the total number of posts for 1000 games means scrapping 28*1000 = 28000 days. Alternatively, we can scrap 2555 days from 2010-2016 to store the number of posts made on each day in a lookup table, and to obtain the total number of posts made a month before a game’s release date, we only need to add up 28 values from the lookup table. Below is an example of this lookup table, using artificial data:

Table 1: Lookup table for the number of r/gaming posts made on each day from 2010 – 2016

We discovered that adding up values was far faster than scraping web data. Scraping one day took between 1 to 10 seconds, whereas retrieving and adding up 28 values from a pre-made lookup table implemented as a Hash Table (a dictionary in Python) took approximately 0.25 seconds. The pre-made Hash Table only required scraping 2555 days.

NO HASH TABLE: Scrap 28000 days: 28000 * 3 = 84000 seconds

HASH TABLE: Scrap 2555 days + 84000 operations: 2555 * 3 + 28000/4 = 14665 seconds

It took around 2 hours to scrap 2555 dates. Then, it took approximately 30-40 minutes to normalized the metrics for 1001 games. So web scrapping Reddit took 2.6 hours.

In contrast, if we did not used this method to reduce scrapping time, we would have had to scrap 28000/2555 ~= 11 times longer, which would mean around 20 hours of scrapping Reddit.

1.5 Summary of CODE PIPELINE

To ensure that data scrapped and wrangled would not be lost while the code was running, different stages of the data were outputted to Excel as .csv files.

Figure 3: Flowchart of Python files and CSV inputs/outputs

An explanation of each file is given below:

Video_Games_Sales.csv: Original Kaggle Data
get_keywords.py: Adds keyword col used for querying Reddit posts.
get_release_dates.py: Fills in missing release dates using Wikipedia infobox data.
combine_games.py: Combines games with the same name and release date into one row.
day_to_count.py: Creates Hash Table used for faster normalization of Reddit metrics.
get_reddit_data.py: Scraps Reddit posts to obtain game mention and upvote metrics. Post data was also collected. This was skimmed to ensure that most post titles did discuss the game they were associated with.
split_by_platform.py: Undos combine_games after scrapping Reddit posts. Splits games on mulitple platforms into multiple rows, one platform per row. Its output was only used for categorical analysis, not for model fitting.

Data analysis and regression were performed using R. All code was uploaded to Github.

2) Categorical Analysis of r/gaming Interests vs Sales

We assumed that r/gaming’s interests reflected the interests of the Western gaming community. We aimed to see how r/gamings’ interests compared to the sales in certain categories. Each level in every categorical variable was deemed a “category”. For instance, “Action” in the variable “Genre” was a category. We explored which categories were most popular on r/gaming, where popularity was measured summing up by the normalized number of mentions for 28 days before plus 28 days after release date. It was found that the normalized rankings differed from non-normalized rankings.

Since rankings could be due to certain categories simply having more games instead of being more popular on r/gaming, we found the number of games in each category and compared this to the categories popular on r/gaming. We expect that the more games there are in a category, the more that category will be discussed on Reddit. Thus, we will look for abnormalities in which a category’s frequency rank differs from its popularity rank. This will tell us how Reddit’s discussion of that category is influenced by more than just the number of games in that category.

Note that a category’s popularity could be influenced by popular games that just happened to be that category. For instance, “Skyrim” is a very popular Role Playing game that is discussed frequently due to its replayability; this outlier could greatly inflate the popularity of Role Playing games . The list of games in the Kaggle dataset may also not be comprehensive.

We did not remove outliers with no mentions for this analysis as all games are required to accurately count the number of games released in each category, and there were too many games with no mentions. Additionally, though the original dataset contained the Developer for each game, the Developer variable was not analyzed due to having too many missing values of NaN compared to the number of unique developers.

This analysis was also useful for identifying a few erroneous outliers by finding ranking discrepencies contrary to domain knowledge about gaming communities, as discussed in the section about ranking by Rating.

The following tables all compare the number of games in each category VS the games’ r/gaming popularity in each category VS the total sales in each category:

Table 2: Rankings for “Rating”

We did not remove the NaNs because Not Rated is an actual Rating that some games carry. Other than the rank for NaNs, the ranking of Rating categories by frequency and their ranking by r/gaming popularity was identical. Thus, we will look into the Total Mentions for the NR games.

Table 3: Ranking NR games by their total mentions

“Game & Wario” and “Back to the Future” are ranked highest in popularity, even though knowledge about the gaming community indicates that it should not be more popular than games such as Skyrim. This is because when we split keywords, we split subtitles by ‘&’ and ‘:’. This resulted in “Game” being a keyword in “Game & Wario” and “Back to the Future”, so every time “Game” appeared in a post, it counted as a mention for those games. Thus, those 2 games are erroneous outliers.

We will remove all games that contain ‘&’ or ‘Game’ and redo the analysis. Additionally, when we perform regression and after removing outliers with NA critic or user scores, we will check if any games contain “&” or “Game” in them and judge if they should be removed. Identifying these outliers Requires domain knowledge about what video games should and shouldn’t be popular. After removing these samples, our dataset went from 1001 to 958 games.

Using the Wikipedia API, we could have also filled in the Ratings missing data for games that were clearly rated, such as Skyim. However, we did not consider this until after we finished the regression analysis, and from regression, we found that Rating was not an important variable. Thus, this process was not done.

After dropping the erroneous outliers, we re-did the analysis and found NR’s popularity rank to decrease to the same ranks as Game Frequency and Total Sales. Games rated E were not as popular, despite selling a lot. This implies that r/gaming’s userbase is older, and thus is not a great representation of the younger gaming community.

Table 4: Rankings for “Rating” after removing erroneous outliers.

Table 5: Rankings for “Genre”

Despite many games being sold in the Sports genre, sports games were not often discussed on r/gaming compared to other games. Sports was the 3rd most frequent and 3rd highest sold genre, but was only the 6th most discusssed genre on r/gaming. Instead, Role Playing games, while being the 5th most sold genre, was the 3rd most discussed genre on r/gaming.

Table 6: Rankings for “Publisher”. Since there were 94 levels for the categorical variable “Publisher”, we only show the top 15 levels.

Game popularity via publisher differed greatly from how many games were developed by each publisher. Bethesda Softworks and Square Enix do not make that many games, yet they were the 3rd and 8th most popular publisher on r/gaming, respectively. As these publishers focus on developing Role Playing games, this discrepency further supports the popularity of Role Playing games on r/gaming. But the most surprising result was the popularity of companies Valve that were not in the top 20 highest selling publishers. Still, using domain knowledge that despite not releasing many games during 2010-2016, Valve created high selling games such as Portal 2, which was the second best-selling game in the US in April 2011, we may attribute this discrepency due to other publishers releasing far more games than Valve, meaning that their total Sales would surpass Valve even though Valve had a few best selling games. However, if the discrepency is due to error in the Kaggle dataset, we note that this decreases the reliability of the dataset. Portal 2 sales were found on Wikipedia:

https://en.wikipedia.org/wiki/Portal_2#Sales

Since we had previously combined rows with the same name and release date into one row, we will now de-combine them into separate rows that differ by platform. When we analyze Reddit posts after de-combining, the results for platform categories were:

Table 7: Rankings for “Platform”

PC games appear to be very popular on r/gaming despite PC being only the 7th highest selling platform. This implies that r/gaming favors PC gaming more than the general North American population. Multi-platform games make it hard to discern if a game was bought based on which platform it is on. Due to the high frequency of multi-platform games, we do not use platform as a factor in regression.

Table 8: Rankings for “Release Years”

Due to earlier years having more games, the dataset may have more games from earlier years than later years, and may not be reflective of all games released between 2010 to 2016. Otherwise, there were no other surprising results from this table. Release Year was not used as a variable in model fitting because years do not repeat in the future, and thus a model trained using years would be useless for predicting games released in years after 2016, as it was not trained on any games released in years after 2016.

Table 9: Rankings for “Release Months”

The Fall season appears to be when games are released and sell more, and the Spring season comes in second place.

Now we will compare overall activity on r/gaming with game release times, instead of just measuring r/gaming activity using mentions of games in our dataset:

Figure 4: Number of Games Released in Month vs Sales vs r/gaming Monthly activity

These sales and r/gaming barplots share very little similarities. Sales and the Number of Games seem to be very similar. Despite having June being the 5th ranked month in terms of number of games released, games do not sell as well during that month. On the contrary, r/gaming is highly popular during June. This does not mean game release does not influence activity in that month, but it means it is not the main or only factor in r/gaming activity.

There could be a number of reasons to explain spikes in r/gaming activity; they may be due to more games being released during those times, or may be due to Reddit users being more active during those times, not directly related to game release times. This also could be influenced by popular games that just happened to be released during an arbitrary month.

Click here to continue to Part 2...

Part 1, Regression Analysis on Video Game Sales: Web Scraping and Community Analysis

Table of Contents

Recent Posts

Comments