Reddit is an expansive site. Anyone who has spent any significant amount of time on it knows what I mean. There is a subreddit for seemingly every topic anyone could ever want to discuss or even think about (and many that most do not want think about).
Reddit is a powerful site; a tool for connecting and sharing information with like- or unlike-minded individuals around the world. When used well, it can be a very useful resource.
On the other hand, the deluge of information that's constantly piling into the pages of can be overwhelming and lead to wasted time. As with any tool, it can be used for good or for not-so-good.
A common problem that Redditors experience, particularly those who are relatively new to the site, is where to post content. Given that there are subreddits for just about everything, with wildly varying degrees of specificity it can be quite overwhelming trying to find the best place for each post.
Just to illustrate the point, some subreddits get _weirdly_ specific. I won't go into the _really_ weird or NSFW, but here are some good examples of what I mean by specific:
...need I go on? (If you're curious and/or want to be entertained indefinitely, here is a thread with these and much, much more.)
Most of the time when a post is deemed irrelevant to a particular subreddit, it will simply be removed by moderators or a bot. However, depending on the subreddit and how welcoming they are to newbies, sometimes it can lead to very unfriendly responses and/or bans.
So how does one go about deciding where to post or pose a question?
Post Here aims to take the guesswork out of this process.
The goal with the Post Here app, as mentioned, is to provide a tool that makes it quick and easy to find the most appropriate subreddits for any given post. A user would simply provide the title and text of the their prospective post and the app would provide the user with a list of subreddit recommendations.
Recommendations are produced by a multinomial naive Bayes model that attempts to predict which subreddit a given post would belong to. The model was built using Scikit-learn, and trained on a large dataset of reddit posts. In order to serve the recommendations to the web app, an API was built using Flask and deployed to Heroku.
The live version of the app is linked below.
I worked on the Post Here app with a remote, interdisciplinary team of data scientists, machine learning engineers, and web developers. I was one of two machine learning engineers on the team, responsible for the entire process of building and training the machine learning models. The two data scientists on the team were primarily responsible for building and deploying the API.
The main challenge we ran into, which directed much of the iterative process, was scope management.
At this point in my machine learning journey, this was one of the larger datasets that I'd taken on. Uncompressed, the dataset we used was over 800mb of mostly natural language text. The dataset and the time constraint—we had less than four full days of work to finish the project—were the primary causes of the challenges we ended up facing.
With such a dataset, one important concept we had to keep in mind was the curse of dimensionality, which is basically a title for the various problems and phenomena that occur when dealing with extremely highly dimensional datasets. When processed, a natural language dataset of this size would likely fall prey to this curse and may prove somewhat unwieldy without large amounts of processing power.
I ended up researching and applying various methods of addressing this problem in order to fit the processing/modeling pipeline on the free Heroku Dyno, with a memory limit of 500mb, while preserving adequate performance. Many of our deployments failed because the pipeline, when loaded into memory on the server, exceeded that limit.
One important tradeoff we had to wrangle with was how much, and in what ways we could limit the dataset—i.e. how many classes to try and predict, and how many observations per class to include when training. The original dataset contains data for 1,000 subreddits. It was not within the scope of a a four-day project to build a classification model of a caliber that could accurately classify 1,000 classes.
In the beginning, we did try to build a basic model trained on all 1,000 classes. But with the time and processing power I had, it proved to be untenable. In the end, we settled for a model that classified text into 305 subreddits with a test precision-at-k of .75, .88, and .92 for 'k' of 1, 3, and 5, respectively.
The dataset we ended up using to train the recommendation system is called the Reddit Self-Post Classification Task dataset, available on Kaggle thanks to Evolution AI. The full dataset clocks in at over 800mb, containing 1,013,000 rows: 1,000 posts each from 1,013 subreddits.
For more details on the dataset, including a nice interactive plot of all the subreddits and their relevance to one another, refer to Evolution AI's blog post.
Wrangling and Exploration
First, I needed to reduce the size of the dataset. I defined a subset of 12 categories which I thought were most relevant to the task at hand, and used that list to do the initial pruning. Those 12 categories left me with 305 unique subreddits and 305,000 rows. The list I used was as follows:
Next, I took a random sample of those 305,000 rows. The result was a dataset with 91,500 rows, now consisting of between 250 and 340 rows per subreddit. If I tried to use all of the features (tokens, or words) that resulted from this corpus, even in its reduced state, it would still result in a serialized vocabulary and/or model too large for our free Heroku Dyno. However, the features used in the final model can be chosen based on how useful they are for the classification.
According to the dataset preview on Kaggle, there are quite a large number of missing values in each of the features—12%, 25%, and 39% of the subreddit, title, and selftext columns, respectively. However, I did not find any sign of those null values in the dataset nor mention of them in the dataset's companion blog post or article. I chocked it up to an error in the Kaggle preview.
Finally, I went about doing some basic preprocessing to get the data ready for vectorization. As described in the description page on Kaggle, newline and tab characters were replaced with their HTML equivalents, '<lb>' and '<tab>'. I removed those and other HTML entities using a simple regular expression. I also concatenated 'title' and 'selftext' into a single text feature in order to process them together.
A vectorizer is used to extract numerical features (information) from a corpus of natural language text. I used a bag-of-words method of vectorization, which for the most part, disregards grammar.
The output of this vectorizer is a document-term matrix, with the documents (observations, or rows) on one axes and the terms (features, which are words and/or n-grams) on the other. This matrix can be thought of as a sort of vocabulary, or text-number translator.
At first, the "vocabulary" derived from the corpus using the vectorizer was the largest object when serialized. Luckily, there are many options and parameters available to reduce its size, most of which are simply different methods for reducing the number of features (terms) it contains.
One option is to put a hard limit of 100,000 on the number of features in the vocabulary. This is a simple, naive limit on the generated features, and thus, the resulting vocabulary size.
I decided to remove stopwords before vectorization in hopes that this would reduce the size of the vector vocabulary. To my initial surprise, removing the stop words (using NLTK's list) actually increased the size of the serialized vocab from 59mb to 76mb.
After some consideration, I found this to be a reasonable result. I figured that many of the stop words are short ("I", "me", "my", etc.), and their removal caused the average length of words (and therefore bigrams as well) in the vocab to increase. While this may not account for the entirety of the difference, this provides some intuition as to why there is a difference.
Although the vocab without stop words was larger, I ended up using it anyways because it provided an extra ~0.01 in the precision-at-k score of the final model.
As mentioned previously, the size of the corpus means the dimensionality of the featureset after vectorization will be very high. I passed in 100,000 as the maximum number of features to the vectorizer, limiting the initial size of the vocab. However, the features would have to be reduced more before training the model, as it is generally not good practice to have a larger number of features (100,000) than observations (91,500).
To reduce it down from that 100,000, I used a process called select k best, which does exactly what it sounds like: selects a certain number of the best features. The key aspect of this process is how to measure the value of the features; how to find which ones are the "best". The scoring function I used in this case is called chi2 (chi-squared).
This function calculates chi-squared statistics between each feature and the target, measuring the dependence, or correlation, between them. The intuition here is that features which are more correlated with the target are more likely to be useful to the model.
I played around with some different values for the maximum number of features to be selected. Ultimately, I was once again limited by the size of the free Heroku Dyno, and settled on 20,000. This allowed the deployment to go smoothly while retaining enough information for the model to have adequate performance.
In this case, the model has a target that it is attempting to predict—a supervised problem. Therefore, the performance can be measured on validation and test sets.
To test out the recommendations I copied some posts and put them through the prediction pipeline to see what kinds of subreddits were getting recommended. For the most part, the predictions were decent.
The recommendations were sometimes a little less than ideal when I pulled example posts from subreddits that were not in the training data. This is somewhat to be expected, and for the few examples I used, the model generally did a good job recommending similar subreddits.
For the baseline model, I decided to go with a basic random forest. This choice was somewhat arbitrary, though I was curious to see how a random forest would do with such a high target cardinality (number of classes/categories).
The baseline precision-at-k metrics for the random forest on the validation set were .54, .63, and .65, for 'k' of 1, 3, and 5, respectively.
Multinomial Naive Bayes
Multinomial naive Bayes is a probabilistic learning method for multinomially distributed data, and one of two classic naive Bayes algorithms used for text classification. I decided to iterate with this algorithm because it is meant for text classification tasks.
The precision-at-k metrics for the final multinomial naive Bayes model on the validation set were .76, .88, and .9188, for 'k' of 1, 3, and 5, respectively. Performance on the test set was nearly identical: .75, .88, and .9159.
As mentioned, the model, vocab, and feature selector were all serialized using Python's pickle module. In the Flask app, the pickled objects are loaded and ready for use, just like that.
I wrote a function to return a list of the subreddits with the highest predicted probabilities, rather than returning only the top prediction.
I'll go over the details of the deployment, such as how the Flask app was set up, in a separate blog post.
For me, the most important and valuable aspects of this project were mainly surrounding the challenge of scope management. I constantly had to ask myself, "What is the best version of this I can create given our limitations?"
At first, I thought it would be feasible to predict all of the 1,000+ subreddits in the data, and used up hours of valuable time attempting to do so. While I had tested various strategies of reducing the complexity of the model, the performance was rather terrible when it was trained on 100 or less examples of each of the complete list of subreddits.
The data scientist who I primarily worked with (we had one data scientist in addition to him and one other machine learning engineer, both of whom did not contribute significantly to the project) kept telling me that I should try reducing the number of classes first, allowing for more examples of each class and fewer classes for the model to predict.
Ultimately, this is the strategy that worked best, and I wasted valuable time by not listening to him the first few times he recommended that strategy. Good teamwork requires the members being humble and listening, something that I have taken to heart since the conclusion of this project.
Scope Management, Revisited
As time was very short while building this initial recommendation API, there are many things that we wished we could have done but simply did not have the time. Here are a few of the more obvious improvements that could be made.
The first, and most obvious one, is to simply deploy to a more powerful server, such as one hosted on AWS Elastic Beanstalk or EC2. This way, we could use the entire dataset to train an optimal model without worrying (as much) about memory limits.
Second, I could use a Scikit-learn pipeline to validate and tune hyperparameters using cross-validation, instead of a separate validation set. Also, this pipeline could be serialized as a single large object, rather than as separate pieces (encoder, vectorizer, feature selector, and classifier). As a final note for this particular train of thought, Joblib could potentially provide more efficient serialization than the Pickle module, allowing a more complex pipeline to be deployed on the same server.
Third, a model could've been trained to classify the input post first into a broad category. Then, some sort of model could be used to to classify into a specific subreddit within that broad category. I'm not sure about the feasibility of the second part of this idea, but thought it could be an interesting one to explore.
Lastly, different classes and calibers of models could have been tested for use in the various steps in the pipeline. In this case, I'm referring primarily to using deep learning/neural networks. For example, word vectors could be generated with word embedding models such as Word2Vec. Or the process could be recreated with a library like PyTorch, and a framework like FastText.
I plan to explore at least some of these in separate blog posts.
As always, thank you for reading! I'll see you in the next one.