LAYMAN'S SUMMARY
The main objective of this project was to predict, using known social media network data, whether two people know each other in real life despite not being friends on social media. The models that were constructed had high success rates for prediction.
ABSTRACT
In a social network, each node is a person, and there is an edge between two people if they are friends. In Link Prediction, given a pair of nodes in which it is unknown if there is an edge between them, an algorithm is trained to classify if the pair contains an edge or not. This project studies Link Prediction on networks of social media platforms and is divided into 2 parts:
For part 1, we trained classifiers on social networks, and then tested the models on networks with different topologies represented using statistical measures. The classifiers performed well, with accuracy and AUC scores ranging from around 0.7 to 0.9.
For part 2, we removed a proportion of edges to obtain a network with unknown links, called a ‘corrupted network’. We trained classifiers on the corrupted network, and then tested the models both on node pairs whose edges were removed, and on unaltered node pairs in the testing set. For different proportions of removed edges, we ran 100 samples which each removed a random subset of edges, and took their average scores. For proportions of 0.1 to 0.6, the accuracy and AUC scores were above 0.8.
The code for this project can be found on Github:
INTRODUCTION
0.1) Overview of Project
0.2) Description of Network Data
0.3) Choosing a Machine Learning Algorithm
0.4) Model Selection of C and Gamma Hyperparameters
PART 1- LINK PREDICTION BETWEEN UNCORRUPTED NETWORKS
1.1) Description and Methods
1.2) Results and Discussion
PART 2- LINK PREDICTION ON CORRUPTED NETWORKS
2.1) Description and Methods
2.2) Results and Discussion
CONCLUSION AND REFERENCES
INTRODUCTION
0.1) Overview of Project
The main objective of this work is to study how trained network classifiers perform under different data limitations: in part 1, the limitation is the lack of training data with similar network topology as the testing data. In part 2, the limitation is when there is a lack of data about a network’s edges.
This project utilized two datasets: the first consisted of a YouTube social network from a paper analyzing statistical properties of online social networks [3]. The second consisted of two Facebook social networks that we gathered using the Netvizz app. We believe that prior to this work, Link Prediction had not been applied to either of these datasets.
The predictor variables used in this project are based on the sociological theory of triadic closures, generalized to more than 3 nodes. The idea is that if A and B are friends, and B and C are friends, then B and C are likely to be friends. In other words, two people who share mutual links are more likely to be linked. In this context, one potential application of Link Prediction is to recommend two users, who are not Facebook friends yet, to connect with each other online.
This project was implemented in Python. Machine Learning was performed using Python’s sci-kit learn library. Python was chosen over R because it could handle memory and running time issues that R ran into. The code was run on a Acer Predator G3-571 Laptop with a NVIDIA GeForce GTX 1060 graphics card. The project took approximately two weeks to complete.
0.2) Description of Network Data
Network A and Network B are the Facebook networks of two different non-famous users, who are not included in their networks. These two users share no mutual friends. The networks were collected using the Netvizz app in December 2014. They were obtained with the consent of the users for the purpose of data analysis, as long as all profiles remain anonymized.
The YouTube data gathered by Mislove, et al. [3] was originally used to study network characteristics such as small world properties. It was collected during January 2007. The nodes represent users, and the edges represent “friends”; since YouTube has no formal definition of a user’s friends, it is assumed that the authors defined user P being “friends” with user Q if P subscribed to Q’s channel. This assumption also explains why the edges are directed, as P can subscribe to Q but Q does not have to subscribe to P. Though this was an directed network, 79.1% of links were symmetric, so most of the network was undirected. The average weighted degree and average unweighted degrees were the same for YouTube networks. Each network’s highest degree node also had an indegree that equaled the node’s outdegree. In this article, YouTube will be referred to using the abbreviation ‘YT’.
Python encounters a memory error when dealing with networks with more than 100k edges, so the entire YT network could not be used. Instead, it was split into two sub-networks: YT Big and YT Small. YT Big used nodes with IDs 1 to 3000 and YT Small used nodes with IDs 3001 to 4000. Edges were only kept if they were between two nodes in the sub-network.
Table 1: Network Statistics
HD and Density were not reported in original paper with YouTube data. When the full network was opened in Gephi, it took too long to calculate its statistics, so they were not calculated.
The Facebook networks were denser than the YouTube networks, indicating that there were more connections in ratio to the number of users. This is expected as Facebook reflects communities where users know one another in real life, which facillitates the meeting of mutual friends, whereas YouTube connections are often between users who only recognize each other by their uploaded videos and do not communicate with one another after subscribing.
On the other hand, YouTube has more nodes with higher degrees. This may be because YouTube focuses on connecting users through other famous users, or “hubs”, who create content to draw fans in. Facebook profiles of non-famous users usually do not have fans, and thus are less likely to have hubs.
Figure 1: Network Visuals. The nodes are colored according to degree distribution. The node color scheme can be found on the .gephi files on Github.
(Top Left)- Network A; (Top Right)- Network B;
(Bottom Left)- YT Big; (Bottom Right)- YT Small
0.3) Choosing a Machine Learning Algorithm
To model the classifier, Support Vector Machines (SVMs) were chosen for their ability to perform classification well in similar studies [1]. SVMs use a soft margin to allow some misclassification with the tradeoff of improved fitting. Since only 5 features were used, the number of features was far less than the number of samples, so there was no need to use dimension reduction techniques such as PCA. This also meant that instead of the Linear Kernel, features should be mapped to a different high dimensional space using the RBF kernel.
Accuracy scores were calculated to assess how many samples were correctly classified. AUC scores were calculated to assess, summarizing over each threshold dividing the two classes, how much higher the true positive rate is compared to the false positive rate. For different class proportions in a dataset, different thresholds can beget different accuracies. Thus, AUC assesses how well the model performs for different class distributions on datasets. ROC curves were also outputted by the code, but they were not included in this report. In this scenario, neither false positives nor false negatives were deemed more costly than the other, so precision and recall were not considered.
0.4) Model Selection of C and Gamma Hyperparameters
SVM with the RBF kernel has two adjustable hyperparameters, C and Gamma. SVM aims to find a hyperplane that leaves a wide margin between two classes to reduce overfitting and improve generalization; a wide margin is analogous to decreasing variance but increasing bias. A wider margin increases the chances of wrongly classifying more training points. As C gets smaller, the margin widens and variance decreases, but bias increases and there is a higher chance of underfitting. This also happens when Gamma gets larger. Thus, we want to find values of C and Gamma that optimally trade off bias and variance, making sure neither one is “too high”.
Optimal values for C and Gamma were found by grid search using stratified 5-fold cross validation with an 80/20 training/testing split. It was found that unscaled data yielded better accuracy scores than data scaled by subtracting the mean and scaling to unit variance, so all data was unscaled.
For all networks, the best parameters found were {'C': 1000, 'Gamma': 0.00001}, with accuracy scores around 0.96. These scores show a vast improvement over the accuracy score of 0.780 obtained using the default values of C = 1 and Gamma = 1/(# of features) = 0.2. However, C = 10 and Gamma = 0.001 were found to give better training and testing accuracy and AUC scores when actually running the models; thus, these values were used instead. A heatmap for network A showing accuracy values for various combinations of C and Gamma is given below.
Figure 2: Heatmap for Hyperparameters C and Gamma
PART 1- LINK PREDICTION BETWEEN UNCORRUPTED NETWORKS
1.1) Description and Methods
These features are based on generalized triadic closure. Let Γ(x) be the set of node x’s neighbors. The Jaccard Index is the normalized form of Common Neighbors, so the two features would presumably be very correlated. Still, other papers such as Budar, et al [2] use both features in their models, so we will also use both features in our model.
Table 2: Features description. The first four equations come from [4].
The training sets needed to be balanced with an equal amount of node pairs from both classes; otherwise, the accuracy score would not be a good measure to interpret. For instance, a network like YT Small has many nodes with no neighbors, so most of the node pairs are in class 0. If the training/testing data split were randomly sampled, most observations in each set would have class 0, and the classifier may find a higher accuracy if it predicts class 0 most of the time for the training data. But when this classifier acts on a balanced testing dataset, it would only have an accuracy closer to 50%.
Pipeline from Data Preprocessing to Training to Testing:
1) Converted network from .gdf format (Netvizz data), or from YouTube data’s original format, to an adjacency list, represented using a hash table for fast retrieval.
2) Selected training observations:
To get a balanced training set with an equal amount of class 0s and class 1s, we took all the pairs with a link between them and all the pairs without a link between them. The link pairs were stored in array ‘Linked’. For the pairs with no link, we further calculated which one of these pairs had at least one path between length 2 and Kmax, and we stored all these pairs in an array ‘Deg Sep’. The pairs which were not within Kmax degrees of separation were stored in an array ‘Very Far’.
Then, we found which of these 3 arrays were the smallest; this would be N, the number of training observations. This ensures that all the 3 arrays are greater than N/2. So the training observations consisted of: 1/2 of Linked, 1/4 of Deg Sep, and 1/4 of Very Far.
NOTE: the feature, ‘Shortest Paths’, would be able to distinguish between Deg Sep node pairs and not Deg Sep node pairs. However, node pairs not in Deg Sep can be either in class 0 or 1, so this using this feature by itself is not enough for classification.
3) Transformed graph data into a feature matrix, trained SVMs, and tested the models
1.2) Results and Discussion
Four classifiers, each trained on a different network, competed to be the best in four categories. In each category, they were judged by how they performed in classifier test data from a certain network. In this report, scores are rounded to 3 decimal places.
Table 3: Number of training observations used to train each classifier. Due to how we sampled observations to get balanced datasets, we do not use all the node pairs in a network for training.
Figure 3: Accuracy Scores for Uncorrupted: Rows are trained classifiers, columns are the dataset the classifier is tested on.
Figure 4: AUC Scores for Uncorrupted: Uses same axis labels as Figure 6.
The scores on the diagonal are training scores. Splitting a network into training and testing is the same as corrupting a network, as corruption removes node pairs (observations) from training and later, the trained classifier is tested on the removed node pairs. Therefore, the Facebook networks were not split into training and testing datasets because a corrupted network analysis will be applied to them in the next part. YT Big and YT Small can be considered as training and testing subsets of one another, since they are sampled from the same dataset, which means their observations belong to related statistical distributions and topologies.
For predicting pairs in the Facebook networks, the classifier trained on YT Big performed the best. For predicting pairs in the YouTube networks, the classifier trained on Network A performed the best. One interesting result was that for the undirected YouTube networks and the directed Facebook networks, the classifiers trained on undirected graphs performed slightly better on directed graphs than the classifiers trained on directed graphs, and vice versa. Overall, classifiers appear to be generalizable for predicting networks similar to the network they were trained on.
YT Small performed the worst out of all classifiers, but this was expected as it was trained on the least amount of node pairs. Following the same logic, YT Big performed the best due to having the most amount of node pairs in its training set.
SVM using the RBF kernel does not have feature ranking because it does not assign weights to the original features. It uses a nonlinear kernel transformation to transform the original feature space into a new space, and then constructs the hyperplane in the new space. On the other hand, SVM using a Linear kernel uses a hyperplane to separate the classes by maximum margin. The hyperplane is determined by feature weights, which are the coordinates of the vector orthogonal to the hyperplane. Since it still operates in the original feature space, weights are assigned to the original features, and feature ranking is possible.
Thus, to get a very rough estimate of feature ranking using the RBF kernel, a classifier was trained with the SVM Linear Kernel algorithm using the training set from Network A, and feature ranking was performed:
Preferential Attachment: -8618.096
Adamic-Adar Coefficient: 19.241
Jaccard Index: -10.780
Common Neighbors: -4.186
Number of Shortest Paths of length 2 and 3: 0.025
Shortest path has the least predictive power, whereas Preferential Attachment has several magnitudes greater predictive power than the feature in second place. This implies that, given information about the network but with incomplete information about certain node pairs, node pairs that contain higher degree nodes are more likely to have edges. Interestingly enough, Preferential Attachment has little to do with triadic closure, so we can hypothesize that, in contrast to our initial guess, triadic closure is not as impactful on Link Prediction compared to node degree.
In Budur, et al., the AUC when using only Preferential Attachment is the highest (though by only 0.005).
Figure 5: Feature Ranking from Budur, et al.
Al Hasan, et al. [1] tested a classifier on 2 networks of research authors (connected if they coauthor), each with between 100,000 to over 1.5 million nodes, to predict if two authors will coauthor. They found SVM with RBF kernel performed the best out of all learning algorithms tested- these included SVM with Linear kernel, decision trees, multilayer perceptron, and bagging. SVM with RBF kernel had accuracy scores of 90.56% and 83.18% for the two networks. Though we did not use the same data and methods as Al Hasan, et al., their scores are similar in range to our study’s scores.
PART 2- LINK PREDICTION ON CORRUPTED NETWORKS
2.1) Description and Methods
Consider a scenario in which person P and person Q are not friends on Facebook, but know each other in real life. Can we predict that they know each other using their Facebook social network? Since data for A and B knowing each other offline is hard to obtain, studies have approximated it by removing edges from a network to obtain a ‘corrupted’ network [2]. Researchers know that the node pairs with removed edges were actually friends, but the corrupted data does not show it. A classifier can be trained on the corrupted network, and can then be employed to predict the class of pairs with removed edges.
Uncorrupted networks cannot be used to predict corrupted network because in practice, if one only has a corrupted network then one does not have the uncorrupted network. Additionally, there is no point in using a classifier trained on corrupted network A to predict network B’s classes, because it is the same as using a watered down classifier trained on network A to predict network B. Only one network, network A, was corrupted for analysis.
The training pipeline is a modification of Part 1’s pipeline; models from both parts use the same features. Let PCT_RMVD be the % of edges removed by random selection. To reduce the error introduced by random sampling of node pairs, we took the average of prediction scores for 100 samples, each one removing a different subset of node pairs at random. This was done for PCT_RMVD values from 0.1 to 0.9, in intervals of 0.1. There were 10,886 node pairs to sample from.
For the test data observations, we included the node pairs that had their edges removed whose true class is 1 because our aim is to see how well the classifier does at predicting if those removed edges are of class 1. However, we cannot only populate the test data with observations whose true class is 1, since we want to look for false positives. Additionally, node pairs that have degrees of separation greater than Kmax are usually easy to predict; when we look at their features, most of them are 0. Thus, we also included node pairs, between 2 and Kmax degrees of separation, that were not used to train the model and did not have their edges removed, and so had true class 0. These are usually harder to classify right.
Pipeline from Data Preprocessing to Training to Testing:
For each PCT_RMVD:
For 100 samples:
Corrupted network by proportion PCT_RMVD and got its adjacency list and adjacency matrix
Selected training observations: Though each random sample had a random subset of edges removed, the training observation pairs for each random network were the same. For example, if the pair (0,1) was used in one sample as a training observation, it was used in all 100 samples. This allowed each sample’s accuracy scores to be more comparable with one another. We did not allow any node pairs which had their edges removed to be included in the training observation. This is because they will be used in the test data matrix.
Using the corrupted network, calculated and stored the features for each training observation. We also took all the observations in Deg Sep which were not used for training, calculated their features, and stored them in the test data matrix. Then we took all the pairs that had their edges removed, calculated features for them, and stored them in the test data matrix.
Ran SVM and prediction
2.2) Results and Discussion
Figure 6: The rows show the % removed. (Left)- Accuracy; (Right)- AUC
Figure 7: (Left)- PCT_RMVD vs Avg Test Acc; (Right)- PCT_RMVD vs Avg Test AUC
For all PCT_RMVD values, the classifier’s performance decreased exponentially. The classifier appears to perform well until after PCT_RMVD is around 0.6. This suggests that there is a threshold at which too much corruption prevents the classifier from working well. Overall, this means that for a network with topology statistics similar to network A, then even if the network is 30-40% corrupted, a Link Prediction classifier may be trained on it.
Nodes with higher degrees have a higher chance of having their edges removed. If a network contained a node which had the majority of edges, it would soak up the corruption, preventing the network from becoming disconnected. However, network A did not contain such a node.
Budur, et al. [2] corrupted a network with >1 million nodes, >3 million edges. The classifier was trained via Gradient Boosting Machine. They found prediction score increased as more edges were removed; they claimed it was because edge removal made networks less clustered, and thus training on less clustered networks led to improved prediction. For our project’s datasets, edge removal did not lead to improved prediction, perhaps because the networks did not become less clustered. Instead, the clusters remained intact.
Figure 8: % Removed vs. Testing AUC from Budur, et al. The light-grey function’s shape is consistent the function on the right in Figure 7.
CONCLUSION
For Link Prediction on uncorrupted networks, classifiers were able to generalize to networks that were different than the ones they were trained on. Preferential attachment, which is higher for node pairs with high degree nodes, was an important feature for Link Prediction. For Link Prediction corrupted networks, an exponentially decreasing relationship was found between the % of edges removed and both accuracy and AUC scores.
The following are potential areas for further investigation:
Analyze the types of nodes involved in node pairs where classification was correct, vs nodes where classification was incorrect. For instance, are hubs more likely to connect to any random node? If so, would the classifiers be trained to assume any hub with any other node would likely have an edge between them? This hypothesis is based on how important Preferential Attachment is in Link Prediction.
Use more diverse networks types with varying levels of density, different communities, degree distributions, and more.
For corrupted networks, remove nodes, especially those with high betweenness, which would more likely cut networks into disjoint clusters.
References
[1] Al Hasan, Mohammad, Vineet Chaoji, Saeed Salem, and Mohammed Zaki. "Link Prediction using supervised learning." In SDM06: workshop on link analysis, counter-terrorism and security. 2006.
[2] Budur, Emrah, Seungmin Lee, and Vein S. Kong. "Structural Analysis of Criminal Network and Predicting Hidden Links using Machine Learning." arXiv preprint arXiv:1507.05739 (2015).
[3] Mislove, Alan, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. "Measurement and analysis of online social networks." In Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, pp. 29-42. ACM, 2007.
[4] Wang, Peng, BaoWen Xu, YuRong Wu, and XiaoYu Zhou. "Link Prediction in social networks: the state-of-the-art." Science China Information Sciences 58, no. 1 (2015): 1-38.
Comments