Welcome to Team Reddit's Webpage!

Project for COSC 4610


Project maintained by Nickg97 Hosted on GitHub Pages — Theme by mattgraham

Project: Reddit User Predictions

Team Name: Team Reddit

Team Members:

Abstract

As a social network, Reddit has many unique features which make it ideal for data mining and machine learning research. The data generated by the relationships between users and content can be used to generate content recommendations for users, leading to a more personalized experience for the redditor. Our project creates a classifying model to predict whether a user will upvote or downvote a given post. We will make use of a user’s past history of upvoting or downvoting as a training set and look at a variety of methods to find a pattern or correlation. Data will be obtained by drawing from popular subreddits and examining common users. Results from our classification method have a variety of applications, from user recommendations to targeted advertising for specific users.

Introduction

Social networks have served as a frequent area of data mining and machine learning research for many years now. With the rise of the internet and the dawn of the information age, social networks have become a key part of the broader social fabric in a globalized world. Understanding the relationship between users and content is a key aspect of any social network, as effectively managing these relationships has the potential to be financially lucrative. Our project focuses on Reddit, a large social network which allows users to interact with content on smaller aspects of the network, known as sub-reddits. While user interaction can be quite complicated, we are going to examine one particular feature by which a user utilizes “upvotes” and “downvotes” to interact with content. Using a variety of prediction and classification methods, we will attempt to establish relationships between users, their subreddits, and the content they interact with. In particular, one goal of this project is to predict how a frequent user of a given subreddit will interact with content on another subreddit using their upvote/downvote history.

Related Works and Prior Research

This project draws inspiration from many prior attempts to utilize the rich data contained within social networks. Common techniques usually include some element of clustering for establishing relationships between data points, and applications of graph theory are used for identifying patterns between users (link one). Due to the scope and variability of Reddit, no existing study has done a wholly comprehensive treatment of the entire site. In particular, user classification has not seen extensive study; one past research project opted for a very qualitative classification approach for users (link two), but this leaves room for a more quantitative approach. Past prediction models on the success of individual posts (link four) and on recommending individual posts (link three) utilized the existing upvote and downvote structure within Reddit to provide a quantitative basis for their predictions. The research carried out by these projects will be particularly beneficial to our classification attempt, as they investigate a wide variety of models and analyze their effectiveness in specific context. Important challenges have been brought to light in past research on this topic; namely, removing subjective bias and identifying external factors have proven quite challenging in prior research. (link one)

Data Set and Features

To achieve our project goals, we need a variety of data on how users interact with posts. This data will include the posts that a given user upvotes or downvotes, the subreddit those posts are located in, and a list of users within those subreddits. The upvote/downvote data will allow us to get a sense of the topics a particular user is interested in. Our model will use a metric we call “support”, which is the ratio of upvotes to downvotes by a given user on a given subreddit. This metric will tell us how much support the user gives to the given subreddit. Our use of this metric assumes that a subreddit with support greater than one, where the user has upvoted more than they’ve downvoted, is a subreddit that discusses a topic the user generally agrees with. Likewise, a subreddit with support less than one, where the user has downvoted more posts than upvoted, is a subreddit that discusses a topic the user generally disagrees with. In conjunction with support, we will also use a metric called “interest”, which is the ratio of posts the user has interacted with to the total number of posts. This metric describes how involved or interested a user is with a given subreddit. A user with a high interest value is assumed to be highly involved in the subreddit in question, responding and interacting with the subreddit a lot. To get this data we will use a scraper to quickly gather user, subreddit, and post data on Reddit. The scraper will utilize the Python package PRAW, which is a Python Reddit API Wrapper. The benefit of this package is its implementation is well integrated with Pandas. Data preprocessing to build a model from should be fairly simple. We expect only combining datasets and reorganizing rows and columns within the set to be necessary to build a model from.

Methods

Our solution will consist of a binary classification model. To evaluate a binary classification model, there are a few metrics that can be used to measure performance. The true positive rate (TPR) measures the proportion of actual positives that are correctly identified as such. The TPR is measured using the following equation:

True Positive Rate (TPR) = TP / (TP + FN)

The TPR is calculated by taking the true positives (TP) and dividing it by the sum of the true positives and false negatives (FN). The false positive rate (FPR) is the ratio between the number of negative events that were wrongly categorized as positive and the number of actual negative events. The FPR is defined as:

False Positive Rate (FPR) = 1 - (TN / (TN + FP)) The FPR is calculated by subtracting from one the result of taking the true negatives (TN) and dividing it by the sum of true negatives and false positives. Accuracy for this model refers to the ratio between correctly identified positive and negative events and wrongly identified positive and negative events. Accuracy can be defined as the following: Accuracy = (TP+ TN) / (TP + TN + FP + FN) Accuracy takes the sum of the true positives and true negatives and divides it by the sum of true positives, true negatives, false positives, and false negatives. Error rate is the remained of accuracy, it shows the percentage of time that the model incorrectly identified an events. Error rate can be defined as: Error Rate = 1 - Accuracy Error rate is simply calculated as the remainder of subtracting the accuracy from one. Precision answers how many of the positive events detected by the model were actually correct. Precision is defined as: Precision = TP / (TP + FP) Precision is calculated by taking the true positives and dividing by the sum of true positives and false positives. Together, accuracy and precision will be the primary metrics in which the model will be evaluated by.

Expected Outcomes and Future Work

This report serves as a preliminary report on our findings, and our desired project goal of developing a prediction model for the voting of a Reddit user has yet to be successfully achieved. Future work for this project includes solidifying the details of our classification model, like expressing our metrics mathematically and putting data in a format that can be used easily with our model. When completed, our project will have a fully functioning classification model which will predict whether a user will upvote or downvote a post in another subreddit based on previously identified features.

Appendix

Nicholas Garcia was responsible for researching the classification methods which will be used in our prediction model. He additionally maintained the team website and assisted with the writing of this report.

Michael Mastalish was responsible for analyzing related research and prior reports on this topic. He additionally helped frame the overall goals and scope of the project, and assisted with the writing of this report.

Aaron Moriak was responsible for the data collection for this stage of the project. He additionally added insight into potential project outcomes, and assisted with the writing of this report.

Cited Works

Adedoyin-Olowe, M., Gaber, M. and Stahl, F. (2014). A Survey of Data Mining Techniques for Social Media Analysis. Journal of Data Mining & Digital Humanities. Buntain, C. and Golbeck, J. (2019). Identifying social roles in reddit using network structure. In: 23rd International Conference on World Wide Web. [online] New York, NY, USA: ACM, pp.615-620. Available at: https://dl.acm.org/citation.cfm?id=2579231 [Accessed 26 Mar. 2019]. Pooja, N., Sachin, A. and Vincent, K. (2015). Classification of posts on Reddit. San Diego, University of California. Poon, D., Wu, Y. and Zhang, D. (2001). Reddit Recommendation System. Stanford University.