Introduction

The Problem

Reddit is an expansive site. Anyone who has spent any significant amount of time on it knows what I mean. There is a subreddit for seemingly every topic anyone could ever want to discuss or even think about (and many that most do not want think about).

Reddit is a powerful site; a tool for connecting and sharing information with like- or unlike-minded individuals around the world. When used well, it can be a very useful resource.

On the other hand, the deluge of information that's constantly piling into the pages of can be overwhelming and lead to wasted time. As with any tool, it can be used for good or for not-so-good.

A common problem that Redditors experience, particularly those who are relatively new to the site, is where to post content. Given that there are subreddits for just about everything, with wildly varying degrees of specificity it can be quite overwhelming trying to find the best place for each post.

Just to illustrate the point, some subreddits get weirdly specific. I won't go into the really weird or NSFW, but here are some good examples of what I mean by specific:

...need I go on? (If you're curious and/or want to be entertained indefinitely, here is a thread with these and much, much more.)

Most of the time when a post is deemed irrelevant to a particular subreddit, it will simply be removed by moderators or a bot. However, depending on the subreddit and how welcoming they are to newbies, sometimes it can lead to very unfriendly responses and/or bans.

So how does one go about deciding where to post or pose a question?

Post Here aims to take the guesswork out of this process.

Post Here: Subreddit Suggester

The Solution

The goal with the Post Here app, as mentioned, is to provide a tool that makes it quick and easy to find the most appropriate subreddits for any given post. A user would simply provide the title and text of the their prospective post and the app would provide the user with a list of subreddit recommendations.

Recommendations are produced by a model attempts to predict which subreddit a given post would belong to. The model was built using Scikit-learn, and was trained on a large dataset of reddit posts. In order to serve the recommendations to the web app, an API was built using Flask and deployed to Heroku.

The live version of the app is linked below.

My Role

I worked on the Post Here app with a remote, interdisciplinary team of data scientists, machine learning engineers, and web developers. I was one of two machine learning engineers on the team, responsible for the entire process of building and training the machine learning models. The two data scientists on the team were primarily responsible for building and deploying the API.

The main challenge we ran into, which directed much of the iterative process, was scope management.

At this point in my machine learning journey, this was one of the larger datasets that I'd taken on. Uncompressed, the dataset we used was over 800mb of mostly natural language text. The dataset and the time constraint—we had less than four full days of work to finish the project—were the primary causes of the challenges we ended up facing.

With such a dataset, one important concept we had to keep in mind was the curse of dimensionality, which is basically a title for the various problems and phenomena that occur when dealing with extremely highly dimensional datasets. When processed, a natural language dataset of this size would likely fall prey to this curse and may prove somewhat unwieldy without large amounts of processing power.

I ended up researching and applying various methods of addressing this problem in order to fit the processing/modeling pipeline on the free Heroku Dyno, with a memory limit of 500mb, while preserving adequate performance. Many of our deployments failed because the pipeline, when loaded into memory on the server, exceeded that limit.

One important tradeoff we had to wrangle with was how much, and in what ways we could limit the dataset—i.e. how many classes to try and predict, and how many observations per class to include when training. The original dataset contains data for 1,000 subreddits. It was not within the scope of a a four-day project to build a classification model of a caliber that could accurately classify 1,000 classes.

In the beginning, we did try to build a basic model trained on all 1,000 classes. But with the time and processing power I had, it proved to be untenable. In the end, we settled for a model that classified text into 305 subreddits with a test precision-at-k of .75, .88, and .92 for 'k' of 1, 3, and 5, respectively.

Imports and Configuration

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import os
import janitor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import chi2, SelectKBest

# === NLP Imports === #
from sklearn.feature_extraction.text import TfidfVectorizer
# Configure pandas display settings
pd.options.display.max_colwidth = 100

# Set random seed
seed = 92

The Data

The dataset we ended up using to train the recommendation system is called the Reddit Self-Post Classification Task dataset, available on Kaggle thanks to Evolution AI. The full dataset clocks in at over 800mb, containing 1,013,000 rows: 1,000 posts each from 1,013 subreddits.

For more details on the dataset, including a nice interactive plot of all of the subreddits, refer to Evolution AI's blog post.

Wrangling and Exploration

First, I needed to reduce the size of the dataset. I defined a subset of 12 categories which I thought were most relevant to the task at hand, and used that list to do the initial pruning. Those 12 categories left me with 305 unique subreddits and 305,000 rows. The list I used was as follows:

  • health
  • profession
  • electronics
  • hobby
  • writing/stories
  • advice/question
  • social_group
  • stem
  • parenting
  • books
  • finance/money
  • travel

Next, I took a random sample of those 305,000 rows. The result was a dataset with 91,500 rows, now consisting of between 250 and 340 rows per subreddit. If I tried to use all of the features (tokens, or words) that resulted from this corpus, even in its reduced state, it would still result in a serialized vocabulary and/or model too large for our free Heroku Dyno. However, the features used in the final model can be chosen based on how useful they are for the classification.

According to the dataset preview on Kaggle, there are quite a large number of missing values in each of the features—12%, 25%, and 39% of the subreddit, title, and selftext columns, respectively. However, I did not find any sign of those null values in the dataset nor mention of them in the dataset's companion blog post or article. I chocked it up to an error in the Kaggle preview.

Finally, I went about doing some basic preprocessing to get the data ready for vectorization. As described in the description page on Kaggle, newline and tab characters were replaced with their HTML equivalents, <lb> and <tab>. I removed those and other HTML entities using a simple regular expression. I also concatenated title and selftext into a single text feature in order to process them together.

rspct = pd.read_csv("assets/data/rspct.tsv", sep="\t")
print(rspct.shape)
rspct.head(3)
(1013000, 4)
id subreddit title selftext
0 6d8knd talesfromtechsupport Remember your command line switches... Hi there, <lb>The usual. Long time lerker, first time poster, be kind etc. Sorry if this isn't the right place...<lb><lb>Alright. Here's the story. I'm an independent developer who produces my ow...
1 58mbft teenmom So what was Matt "addicted" to? Did he ever say what his addiction was or is he still chugging beers while talking about how sober he is?<lb><lb>Edited to add: As an addict myself, anyone I know whose been an addict doesn't drin...
2 8f73s7 Harley No Club Colors Funny story. I went to college in Las Vegas. This was before I knew anything about motorcycling whatsoever. Me and some college buddies would always go out on the strip to the dance clubs. We alwa...

Nulls

Kaggle says that 12%, 25%, and 39% of the subreddit, title, and selftext columns are null, respectively. If that is indeed the case, they did not get read into the dataframe correctly. However, it could be an error on Kaggle's part, seeing as there is no mention of these anywhere else in the description or blog post or article, nor sign of them during my explorations.

rspct.isnull().sum()
id           0
subreddit    0
title        0
selftext     0
dtype: int64

Preprocessing

To prune the list of subreddits, I'll load in the subreddit_info.csv file, join, then choose a certain number of categories (category_1) to filter on.

info = pd.read_csv("assets/data/subreddit_info.csv", usecols=["subreddit", "category_1", "category_2"])
print(info.shape)
info.head()
(3394, 3)
subreddit category_1 category_2
0 whatsthatbook advice/question book
1 CasualConversation advice/question broad
2 Clairvoyantreadings advice/question broad
3 DecidingToBeBetter advice/question broad
4 HelpMeFind advice/question broad
rspct = pd.merge(rspct, info, on="subreddit").drop(columns=["id"])
print(rspct.shape)
rspct.head()
(1013000, 5)
subreddit title selftext category_1 category_2
0 talesfromtechsupport Remember your command line switches... Hi there, <lb>The usual. Long time lerker, first time poster, be kind etc. Sorry if this isn't the right place...<lb><lb>Alright. Here's the story. I'm an independent developer who produces my ow... writing/stories tech support
1 talesfromtechsupport I work IT for a certain clothing company and they use iPod Touchs for scanning some items [ME]- Thank you fro calling Store support, this is David. How may I help you?<lb><lb>[Store]- Yeah, my iPod is frozen<lb><lb>[ME]- Okay, can I have you hold down the power and the home button at t... writing/stories tech support
2 talesfromtechsupport It... It says right there on the screen...? Hi guys! <lb><lb>&amp;nbsp;<lb><lb>LTL, FTP - all that jazz. Starting you off with a short one.<lb><lb>&amp;nbsp;<lb><lb>I'm the senior supporter at a smaller tech company with clients all over t... writing/stories tech support
3 talesfromtechsupport The computers not working. FIX IT NOW! Hey there TFTS! This is my second time posting. I don't work for any tech support company, but I do have friends, family and teachers at school that have no idea how stuff works.<lb><lb>This tale ... writing/stories tech support
4 talesfromtechsupport A Storm of Unreasonableness Usual LTL, FTP. I have shared this story on a different site, but after reading TFTS for sometime I figured it'd belong here as well. <lb><lb>This is from when I worked at a 3rd party call center ... writing/stories tech support
rspct.isnull().sum()  # That's a good sign
subreddit     0
title         0
selftext      0
category_1    0
category_2    0
dtype: int64
rspct["category_1"].value_counts()
video_game               100000
tv_show                   68000
health                    58000
profession                56000
software                  52000
electronics               51000
music                     43000
sports                    40000
sex/relationships         31000
hobby                     30000
geo                       29000
crypto                    29000
company/website           28000
other                     27000
anime/manga               26000
drugs                     23000
writing/stories           22000
programming               21000
arts                      21000
autos                     20000
advice/question           18000
education                 17000
animals                   17000
politics/viewpoint        16000
social_group              16000
card_game                 15000
food/drink                15000
stem                      14000
hardware/tools            14000
parenting                 13000
religion/supernatural     13000
books                     12000
appearance                11000
finance/money             10000
board_game                 9000
meta                       9000
movies                     7000
rpg                        7000
travel                     5000
Name: category_1, dtype: int64
keep_cats = [
    "health",
    "profession",
    "electronics",
    "hobby",
    "writing/stories",
    "advice/question",
    "social_group",
    "stem",
    "parenting",
    "books",
    "finance/money",
    "travel",
]

# Prune dataset to above categories
# Overwriting to save memory
rspct = rspct[rspct["category_1"].isin(keep_cats)]
print(rspct.shape)
print("Unique subreddits:", len(rspct["subreddit"].unique()))
rspct.head(2)
(305000, 5)
Unique subreddits: 305
subreddit title selftext category_1 category_2
0 talesfromtechsupport Remember your command line switches... Hi there, <lb>The usual. Long time lerker, first time poster, be kind etc. Sorry if this isn't the right place...<lb><lb>Alright. Here's the story. I'm an independent developer who produces my ow... writing/stories tech support
1 talesfromtechsupport I work IT for a certain clothing company and they use iPod Touchs for scanning some items [ME]- Thank you fro calling Store support, this is David. How may I help you?<lb><lb>[Store]- Yeah, my iPod is frozen<lb><lb>[ME]- Okay, can I have you hold down the power and the home button at t... writing/stories tech support
rspct = rspct.sample(frac=.3, random_state=seed)
print(rspct.shape)
rspct.head()
(91500, 5)
subreddit title selftext category_1 category_2
594781 stepparents Ex Wants Toddler Son (2M) to Meet Her AP/SO - x-post from /r/divorce Quick background: My soon-to-be ex-wife (26F) and I (27M) have been separated for about 5 months now. She has been in a serious relationship her AP (23M) whom she met and cheated on me with 6 mont... parenting step parenting
617757 bigseo Do we raise our pricing? I took a management role at an agency. We're way, way under $500/mo for SEO pricing - and I'm embarrassed to say that we're hurting for business. Seems to me that it's a struggle to get clients to... profession seo
642368 chemistry Mac vs. PC? Hello, all! I am currently a senior in high school and in the fall I will be going to SUNY Geneseo, majoring in chemistry and minoring in mathematics. <lb><lb>Geneseo requires it’s students to get... stem chemistry
325221 migraine Beer as an aural abortive? Hiya folks,<lb><lb>I've been a migraine sufferer pretty much my whole life. For me intense auras, numbness, confusion, the inability to speak or see is BY FAR the worst aspect of the ordeal. When ... health migraine
524939 MouseReview Recommend office mouse I was hoping you folks could help me out. Here's my situation and requirements:<lb><lb>* I don't play games at all<lb>* Budget $30.00 or less<lb>* Shape as close to old Microsoft Intellimouse Opti... electronics computer mouse
# Concatenate title and selftext
rspct["text"] = rspct["title"] + " " + rspct["selftext"]

# Drop categories
rspct = rspct.drop(columns=["category_1", "category_2", "title", "selftext"])
# NOTE: takes a couple minutes to run
rspct["text"] = rspct["text"].str.replace("(<lb>)*|(<tab>)*|(&amp;)*|(nbsp;)*", "")
rspct.head()
subreddit text
594781 stepparents Ex Wants Toddler Son (2M) to Meet Her AP/SO - x-post from /r/divorce Quick background: My soon-to-be ex-wife (26F) and I (27M) have been separated for about 5 months now. She has been in a serious...
617757 bigseo Do we raise our pricing? I took a management role at an agency. We're way, way under $500/mo for SEO pricing - and I'm embarrassed to say that we're hurting for business. Seems to me that it's a s...
642368 chemistry Mac vs. PC? Hello, all! I am currently a senior in high school and in the fall I will be going to SUNY Geneseo, majoring in chemistry and minoring in mathematics. Geneseo requires it’s students to...
325221 migraine Beer as an aural abortive? Hiya folks,I've been a migraine sufferer pretty much my whole life. For me intense auras, numbness, confusion, the inability to speak or see is BY FAR the worst aspect o...
524939 MouseReview Recommend office mouse I was hoping you folks could help me out. Here's my situation and requirements:* I don't play games at all* Budget $30.00 or less* Shape as close to old Microsoft Intellimou...
subreddits = rspct["subreddit"].unique()
print(len(subreddits))
subreddits[:50]
305
array(['stepparents', 'bigseo', 'chemistry', 'migraine', 'MouseReview',
       'Malazan', 'Standup', 'preppers', 'Invisalign', 'whatsthisplant',
       'CrohnsDisease', 'KingkillerChronicle', 'OccupationalTherapy',
       'churning', 'Libraries', 'acting', 'eczema', 'Allergies',
       'bigboobproblems', 'AskAnthropology', 'psychotherapy',
       'WayfarersPub', 'synthesizers', 'StopGaming', 'stopsmoking',
       'eroticauthors', 'amazonecho', 'TalesFromThePizzaGuy',
       'rheumatoid', 'homestead', 'VoiceActing', 'FinancialCareers',
       'Sleepparalysis', 'ProtectAndServe', 'short', 'Fibromyalgia',
       'teaching', 'PlasticSurgery', 'insomnia', 'PLC', 'rapecounseling',
       'peacecorps', 'paintball', 'autism', 'Nanny', 'Plumbing',
       'Epilepsy', 'asmr', 'fatpeoplestories', 'Magic'], dtype=object)
rspct["subreddit"].value_counts()
Dreams             340
Gifts              337
HFY                333
Cubers             333
cassetteculture    333
                  ... 
foreignservice     265
WritingPrompts     263
immigration        263
TryingForABaby     262
Physics            250
Name: subreddit, Length: 305, dtype: int64

Modeling

Label Encoding

# This process naively transforms each class of the target into a number
le = LabelEncoder() # Instantiate a new encoder instance
le.fit(y_train)  # Fit it on training label data

# Transform both using the trained instance
y_train = le.transform(y_train)
y_val = le.transform(y_val)
y_test  = le.transform(y_test)

y_train[:8]
array([ 92, 140,  65,  90, 278,  65, 272, 212])

Vectorization

A vectorizer is used to extract numerical features (information) from a corpus of natural language text. I used a bag-of-words method of vectorization, which for the most part, disregards grammar.

The output of this vectorizer is a document-term matrix, with the documents (observations, or rows) on one axes and the terms (words, bigrams) on the other. This matrix can be thought of as a sort of vocabulary, or text-number translator.

At first, the "vocabulary" derived from the corpus using the vectorizer was the largest object when serialized. Luckily, there are many options and parameters available to reduce its size, most of which are simply different methods for reducing the number of features (terms) it contains.

One option is to put a hard limit of 100,000 on the number of features in the vocabulary. This is a simple, naive limit on the generated features, and thus, the resulting vocabulary size.

I decided to remove stopwords before vectorization in hopes that this would reduce the size of the vector vocabulary. To my initial surprise, removing the stop words (using NLTK's list) actually increased the size of the serialized vocab from 59mb to 76mb.

After some consideration, I found this to be a reasonable result. I figured that many of the stop words are short ("I", "me", "my", etc.), and their removal caused the average length of words (and therefore bigrams as well) in the vocab to increase. While this may not account for the entirety of the difference, this provides some intuition as to why there is a difference.

Although the vocab without stop words was larger, I ended up using it anyways because it provided an extra ~0.01 in the precision-at-k score of the final model.

lengths = []
three_or_below = []
for word in stop_words:
    lengths.append(len(word))
    if len(word) <= 4:
        three_or_below.append(len(word))
        
print(f"There are {len(stop_words)} stop words in the list.")
print(f"{len(three_or_below)} are 4 chars long or shorter.")
print(f"Average length is: {np.mean(lengths)}.")
There are 179 stop words in the list.
109 are 4 chars long or shorter.
Average length is: 4.229050279329609.
tfidf = TfidfVectorizer(
    max_features=100000,
    min_df=10,
    ngram_range=(1,2),
    stop_words=stop_words,  # Use nltk's stop words
)

# Fit the vectorizer on the feature column to create vocab (doc-term matrix)
vocab = tfidf.fit(X_train)

# Get sparse document-term matrices
X_train_sparse = vocab.transform(X_train)
X_val_sparse = vocab.transform(X_val)
X_test_sparse = vocab.transform(X_test)

X_train_sparse.shape, X_val_sparse.shape, X_test_sparse.shape
((65880, 63588), (7320, 63588), (18300, 63588))

Feature Selection

As mentioned previously, the size of the corpus means the dimensionality of the featureset after vectorization will be very high. I passed in 100,000 as the maximum number of features to the vectorizer, limiting the initial size of the vocab. However, the features would have to be reduced more before training the model, as it is generally not good practice to have a larger number of features (100,000) than observations (91,500).

To reduce it down from that 100,000, I used a process called select k best, which does exactly what it sounds like: selects a certain number of the best features. The key aspect of this process is how to measure the value of the features; how to find which ones are the "best". The scoring function I used in this case is called ch2 (chi-squared).

This function calculates chi-squared statistics between each feature and the target, measuring the dependence, or correlation, between them. The intuition here is that features which are more correlated with the target are more likely to be useful to the model.

I played around with some different values for the maximum number of features to be selected. Ultimately, I was once again limited by the size of the free Heroku Dyno, and settled on 20,000. This allowed the deployment to go smoothly while retaining enough information for the model to have adequate performance.

selector = SelectKBest(chi2, k=20000)

selector.fit(X_train_sparse, y_train)

X_train_select = selector.transform(X_train_sparse)
X_val_select = selector.transform(X_val_sparse)
X_test_select  = selector.transform(X_test_sparse)

X_train_select.shape, X_val_select.shape, X_test_select.shape
((65880, 20000), (7320, 20000), (18300, 20000))

Model validation

In this case, the model has a target that it is attempting to predict—a supervised problem. Therefore, the performance can be measured on validation and test sets.

To test out the recommendations I copied some posts and put them through the prediction pipeline to see what kinds of subreddits were getting recommended. For the most part, the predictions were decent.

The cases where the recommendations were a little less than ideal happened when I pulled example posts from subreddits that were not in the training data. The model generally did a good job recommending similar subreddits.

Baseline

For the baseline model, I decided to go with a basic random forest. This choice was somewhat arbitrary, though I was curious to see how a random forest would do with such a high target cardinality (number of classes/categories).

The baseline precision-at-k metrics for the random forest on the validation set were .54, .63, and .65, for k of 1, 3, and 5, respectively.

def precision_at_k(y_true, y_pred, k=5):
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    y_pred = np.argsort(y_pred, axis=1)
    y_pred = y_pred[:, ::-1][:, :k]
    arr = [y in s for y, s in zip(y_true, y_pred)]
    return np.mean(arr)
rfc = RandomForestClassifier(max_depth=32, n_jobs=-1, n_estimators=200)
rfc.fit(X_train_select, y_train)
RandomForestClassifier(max_depth=32, n_estimators=200, n_jobs=-1)
y_pred_proba_rfc = rfc.predict_proba(X_val_select)

# For each prediction, find the index with the highest probability
y_pred_rfc = np.argmax(y_pred_proba_rfc, axis=1)
y_pred_rfc[:10]
array([296, 139, 177,  78,  12, 177, 161, 216,  40,  31])
print("Validation scores:")
print("  [email protected] =", np.mean(y_val == y_pred_rfc))
print("  [email protected] =", precision_at_k(y_val, y_pred_proba_rfc, 3))
print("  [email protected] =", precision_at_k(y_val, y_pred_proba_rfc, 5))
Validation scores:
  [email protected] = 0.5368852459016393
  [email protected] = 0.6282786885245901
  [email protected] = 0.6502732240437158

Multinomial Naive Bayes

Multinomial naive Bayes is a probabilistic learning method for multinomially distributed data, and one of two classic naive Bayes algorithms used for text classification. I decided to iterate with this algorithm because it is meant for text classification tasks.

The precision-at-k metrics for the final Multinomial naive Bayes model on the validation set were .76, .88, and .9188, for k of 1, 3, and 5, respectively. Performance on the test set was nearly identical: .75, .88, and .9159.

nb = MultinomialNB(alpha=0.1)
nb.fit(X_train_select, y_train)
MultinomialNB(alpha=0.1)

Evaluate on validation set

y_pred_proba_val = nb.predict_proba(X_val_select)

# For each prediction, find index with highest probability
y_pred_val = np.argmax(y_pred_proba_val, axis=1)
y_pred_val[:10]
array([274, 139,  57,  78,  12,  17, 151, 216,  40, 171])
print("Validation scores:")
print("  [email protected] =", np.mean(y_val == y_pred_val))
print("  [email protected] =", precision_at_k(y_val, y_pred_proba_val, 3))
print("  [email protected] =", precision_at_k(y_val, y_pred_proba_val, 5))
Validation scores:
  [email protected] = 0.7599726775956284
  [email protected] = 0.8834699453551913
  [email protected] = 0.9188524590163935

Evaluate on test set

y_pred_proba_test = nb.predict_proba(X_test_select)

# For each prediction, find index with highest probability
y_pred_test = np.argmax(y_pred_proba_test, axis=1)
y_pred_test[:10]
array([ 97, 199, 116, 249,  43, 203, 263, 275,  96,  27])
print("Test scores:")
print("  [email protected] =", np.mean(y_test == y_pred_test))
print("  [email protected] =", precision_at_k(y_test, y_pred_proba_test, 3))
print("  [email protected] =", precision_at_k(y_test, y_pred_proba_test, 5))
Test scores:
  [email protected] = 0.7498360655737705
  [email protected] = 0.8834426229508197
  [email protected] = 0.9159562841530055

Recommendations

The API should return a list of recommendations, not a single prediction. To accomplish this, I wrote a function that returns the top 5 most likely subreddits and their respective probabilities.

# The main functionality of the predict API endpoint
def predict(title: str, submission_text: str, return_count: int = 5):
    """Serve subreddit predictions.
    
    Parameters
    ----------
    title : string
        Title of post.
    submission_text : string
        Selftext that needs a home.
    return_count    : integer
        The desired number of recommendations.

    Returns
    -------
    Python dictionary formatted as follows:
        [{'subreddit': 'PLC', 'proba': 0.014454},
         ...
         {'subreddit': 'Rowing', 'proba': 0.005206}]
    """
    # Concatenate title and post text
    fulltext = str(title) + str(submission_text)
    # Vectorize the post -> sparse doc-term matrix
    post_sparse = vocab.transform([fulltext])
    # Feature selection
    post_select = selector.transform(post_sparse)
    # Generate predicted probabilities from trained model
    proba = nb.predict_proba(post_select)
    # Wrangle into correct format
    proba_dict = (pd
                .DataFrame(proba, columns=[le.classes_])  # Classes as column names
                .T  # Transpose so column names become index
                .reset_index()  # Pull out index into a column
                .rename(columns={"level_0": "name", 0: "proba"})  # Rename for aesthetics
                .sort_values(by="proba", ascending=False)  # Sort by probability
                .iloc[:return_count]  # n-top predictions to serve
                .to_dict(orient="records")
               )
    proba_json = {"predictions": proba_dict}
    
    return proba_json
title_science = """Is there an evolutionary benefit to eating spicy food that lead to consumption across numerous cultures throughout history? Or do humans just like the sensation?"""

post_science = """I love spicy food and have done ever since I tried it. By spicy I mean HOT, like chilli peppers (we say spicy in England, I don't mean to state the obvious I'm just not sure if that's a global term and I've assumed too much before). I love a vast array of spicy foods from all around the world. I was just wondering if there was some evolutionary basis as to why spicy food managed to become some widely consumed historically. Though there seem to

It way well be that we just like a tingly mouth, the simple things in life."""

science_recs = predict(title_science, post_science)
science_recs
{'predictions': [{'name': 'GERD', 'proba': 0.009900622287634142},
  {'name': 'Allergies', 'proba': 0.009287774623361566},
  {'name': 'ibs', 'proba': 0.009150308633162811},
  {'name': 'AskAnthropology', 'proba': 0.009028660140513678},
  {'name': 'fatpeoplestories', 'proba': 0.00851982441049019}]}
title_pc = """Looking for help with a build"""

post_pc = """I posted my wants for my build about 2 months ago. Ordered them and when I went to build it I was soooooo lost. It took 3 days to put things together because I was afraid I would break something when I finally got the parts together it wouldn’t start, I was so defeated. With virtually replacing everything yesterday it finally booted and I couldn’t be more excited!"""

post_pc_recs = predict(title_pc, post_pc, 10)
post_pc_recs
{'predictions': [{'name': 'lego', 'proba': 0.008418484170536294},
  {'name': 'rccars', 'proba': 0.008112076951648648},
  {'name': 'MechanicalKeyboards', 'proba': 0.0078335440606017},
  {'name': 'fightsticks', 'proba': 0.007633958584830632},
  {'name': 'Luthier', 'proba': 0.00716176615193545},
  {'name': 'modeltrains', 'proba': 0.007088134228361153},
  {'name': 'cade', 'proba': 0.007058109839673285},
  {'name': 'vandwellers', 'proba': 0.006700262151491209},
  {'name': 'cosplay', 'proba': 0.006536648725434882},
  {'name': 'homestead', 'proba': 0.006166832450007183}]}
post_title = """What to do about java vs javascript"""

post = """I am a new grad looking for a job and currently in the process with a company for a junior backend engineer role. I was under the impression that the position was Javascript but instead it is actually Java. My general programming and "leet code" skills are pretty good, but my understanding of Java is pretty shallow. How can I use the next three days to best improve my general Java knowledge? Most resources on the web seem to be targeting complete beginners. Maybe a book I can skim through in the next few days?

Edit:

A lot of people are saying "the company is a sinking ship don't even go to the interview". I just want to add that the position was always for a "junior backend engineer". This company uses multiple languages and the recruiter just told me the incorrect language for the specific team I'm interviewing for. I'm sure they're mainly interested in seeing my understanding of good backend principles and software design, it's not a senior lead Java position."""

# === Test out the function === #
post_pred = predict(post_title, post)  # Default is 5 results
post_pred
{'predictions': [{'name': 'cscareerquestions', 'proba': 0.516989539243874},
  {'name': 'devops', 'proba': 0.031462691062989795},
  {'name': 'interviews', 'proba': 0.02846504725703069},
  {'name': 'datascience', 'proba': 0.024227300545057697},
  {'name': 'bioinformatics', 'proba': 0.017516176338177075}]}
title_book = "Looking for books with great plot twists"

# This one comes from r/suggestmeabook
post2 = """I've been dreaming about writing my own stort story for a while but I want to give it an unexpected ending. I've read lots of books, but none of them had the plot twist I want. I want to read books with the best plot twists, so that I can analyze what makes a good plot twist and write my own story based on that points. I don't like romance novels and I mostly enjoy sci-fi or historical books but anything beside romance novels would work for me, it doesn't have to be my type of novel. I'm open to experience after all. I need your help guys. Thanks in advance."""

# === This time with 10 results === #
post2_pred = predict(title_book, post2, 10)
post2_pred
{'predictions': [{'name': 'suggestmeabook', 'proba': 0.4070015062748489},
  {'name': 'writing', 'proba': 0.14985778378113648},
  {'name': 'eroticauthors', 'proba': 0.07159411817054702},
  {'name': 'whatsthatbook', 'proba': 0.06062653422250441},
  {'name': 'ComicBookCollabs', 'proba': 0.027277418056905547},
  {'name': 'Malazan', 'proba': 0.019514923212723943},
  {'name': 'TheDarkTower', 'proba': 0.017162701613834493},
  {'name': 'DestructiveReaders', 'proba': 0.0151031907793204},
  {'name': 'WoT', 'proba': 0.011165890302931272},
  {'name': 'readyplayerone', 'proba': 0.007566597361383115}]}

Model deployment

As mentioned, the model, vocab, and feature selector were all serialized using Python's pickle module. In the Flask app, the pickled objects are loaded and ready for use, just like that.

I will go over the details of how the Flask app was set up in a separate blog post.


Final Thoughts

For me, the most important and valuable aspects of this project were mainly surrounding the challenge of scope management. I constantly had to ask myself, "What is the best version of this I can create given our limitations?"

At first, I thought it would be feasible to predict all of the 1,000+ subreddits in the data, and wasted hours of valuable time attempting to do so. While I had tested various strategies of reducing the complexity of the model, the performance was rather terrible when it was trained on 100 or less examples of each of the complete list of subreddits.

The data scientist who I primarily worked with (we had one data scientist in addition to him and one other machine learning engineer, both of whom did not contribute significantly to the project) kept telling me that I should try reducing the number of classes first, allowing for more examples of each class and fewer classes for the model to predict.

Ultimately, this is the strategy that worked best, and I wasted valuable time by not listening to him the first few times he recommended that strategy. Good teamwork requires the members being humble and listening, something that I have taken to heart since the conclusion of this project.

Scope Management, Revisited

As time was very short while building this initial recommendation API, there are many things that we wished we could have done but simply did not have the time. Here are a few of the more obvious improvements that could be made.

The first, and most obvious one, is to simply deploy to a more powerful server, such as one hosted on AWS Elastic Beanstalk or EC2. This way, we could use the entire dataset to train an optimal model without worrying (as much) about memory limits.

Second, I could use a Scikit-learn pipeline to validate and tune hyperparameters using cross-validation, instead of a separate validation set. Also, this pipeline could be serialized as a single large object, rather than as separate pieces (encoder, vectorizer, feature selector, and classifier). As a final note for this particular train of thought, Joblib could potentially provide more efficient serialization than the Pickle module, allowing a more complex pipeline to be deployed on the same server.

Third, a model could've been trained to classify the input post first into a broad category. Then, some sort of model could be used to to classify into a specific subreddit within that broad category. I'm not sure about the feasibility of the second part of this idea, but thought it could be an interesting one to explore.

Lastly, different classes and calibers of models could have been tested for use in the various steps in the pipeline. In this case, I'm referring primarily to using deep learning/neural networks. For example, word vectors could be generated with word embedding models such as Word2Vec. Or the process could be recreated with a library like PyTorch, and a framework like FastText.

I plan to explore at least some of these in separate blog posts.

As always, thank you for reading! I'll see you in the next one.