The Problem

Although recreational cannabis has been legalized in many states, it is important to remember that for many, cannabis is a medication used to treat specific conditions. As with any medication, there are many factors that contribute to its effectiveness—not only dosage and intake method, but the patient's weight, age, fitness, etc.

I have not found solid data to back up this assertion, but I'd bet there are many people out there who could benefit greatly from it but don't want to try because of one or more bad experiences in the past.

For example, take one of our user personas Kate, who would like to find strains to help with fatigue and depression. She (really our UX designer, who created the persona based on user research) said, "I don't want to feel like I'm high; I want to feel like I'm just having a really great day."

I'd bet many people who think the same don't realize that this experience is just as feasible with cannabis as it is with other forms of medication, if not more so. But it may take some effort to find the right combination of strain, dosage, and timing.

The MediZen app aims to help with this effort.

The App

MediZen is an app for cannabis patients to get strain recommendations based on their desired effects and characteristics. The goal was to provide a platform for users to both find new strains and document their experiences with each one they try by keeping track of dose amounts and times, as well as a list of favorites.

I worked with a remote interdisciplinary team of data scientists, web developers, and a UX designer. Myself and the other data scientists on the team were tasked with developing and deploying the strain recommendation system. Our API would receive a set of effects, characteristics, and descriptions chosen by the user, and return a list of strains.

The recommendation API was built using Flask and deployed to Heroku, with the recommendations themselves coming from two serialized models: a TF-IDF vectorizer to convert text inputs into vectors, and a k-nearest neighbors model to find similar strains. Both of these were built using Scikit-learn.

The team had one week-long sprint (really 4 days, as Friday was the demo day) of full-time work to complete the project. The MVP recommendation system was up and running within two days, with multiple iterations being deployed by the end of the project.

Unfortunately, the UX designer never got to see her designs come all the way to life in the live app, as the web development team was not quite finished by the end of the week. Therefore, I cannot point you to a link for the live app, however much I'd like to do so. All I have is some screenshots of the designs, which are interspersed into the text in the proceeding sections.

The recommendation API is live, however, in case you want to play around with it. Instructions for doing so can be found in the README in the repository linked below, alongside the code and notebooks for the data science/machine learning side of the project.

MediZen desktop landing page design.

The dataset we used is the Cannabis Strains dataset, which consists of the following information on each of 2,350 strains:

  • Strain
  • Type (indica, sativa, or hybrid)
  • Rating (Leafly users' average rating)
  • Effects (creative, relaxed, happy, uplifted, etc.)
  • Flavor (earthy, sweet, citrus, etc.)
  • Description (background and summary of strain)


The dataset is relatively clean, with only 46 null values in the 'flavor' column and 33 in 'description'. However, upon further inspection, it was found that 77 rows had null values for 'flavor' and 'effects' indicated by the string "None". After converting those to proper null values and dropping them, the resulting dataset was 2,163 rows.

We decided to drop 'rating' right off the bat, as it would be somewhat silly for a user to enter rating preference—we doubted anyone would search for a 4.2 strain over a 4.4 strain. Unless we had actual user data to complement the rating, that data was useless to us.

We decided to use natural language processing (NLP) techniques to represent the text data as numbers. NLP models trained on large amounts of text tend to be very large when serialized, and Heroku has a limit of 500mb for non-Docker deployments. With that in mind, we wanted to keep things relatively simple, at least to start, to minimize the issues while deploying. Once we hit MVP, which was an API that successfully served reasonable strain recommendations, we could iterate to our hearts' content.

To that end, we decided to start without using the 'description' column, as it constituted about 90% of the dataset, making the size of the serialized models much more unwieldy. Better to start small and add complexity as needed. Furthermore, it had a much wider variety of words, potentially adding more noise than useful data.

The last bit of wrangling was concatenating 'type', 'effects', and 'flavor' into a single column. This way, each row was a short document (NLP jargon for a row/observation) with only the most important words describing each strain.


The data, as far as we were working with it, was all text features. The 'type' feature has three categories: hybrid, indica, and sativa. Hybrid is the most common by far with 1,105 strains, followed by indica (652), and finally sativa (406).

Histogram showing the counts for each strain type.

The 'effects' and 'flavor' features contained comma-separated lists of characteristics, consisting of 15 unique effects and 49 flavors. They had a mean character count of 38.71 and 19.87, and a mean word count of 4.95 and 2.97, respectively.

The frequency of each of the effects and flavors are plotted in the barcharts below.

Most frequent effects and their counts.
Most frequent flavors and their counts.

The 'description' feature was exactly as the name suggests, a short-answer type of text summary and background of the strain.

It had a mean character count of 454.35 and mean word count of 74.08, and was much more dispersed around these to numbers, the character and word counts topping out at 1,180 and 188, respectively. Furthermore, 'description' contains the vast majority of the total characters and words in the dataset, clocking in at 88.58% and 90.34%, respectively.

As opposed to 'effects' and 'flavors', which are lists of adjectives, not "natural" like someone would speak or write, the 'description' text is what would be considered natural language. And considering the free-form nature of such text, it's reasonable to assume that the words it consists of are much more varied. The following heat map shows the top 20 words, after removing stop words, that make up the highest percentage of the total number of words.

Heatmap showing top 20 most common words in 'description', excluding stop words.

Albert Einstein once said...

"Everything should be made as simple as possible, but not simpler."

We figured the adjectives describing the effects and flavors of each strain would be sufficient to provide decent recommendations. And this had the benefit of allowing us to start with small yet effective models that were easily deployable to a free Heroku server.

So with that, let's move onto the modeling!

I say models, plural, because there were technically two models that were trained: a vectorizer and a nearest neighbors model.

Due to the simplicity of the dataset and the initial MVP, we had another option available to us that would potentially achieve a similar goal as a recommendation system: a database filtering tool. Because the user would check off their preferences for types, flavors, and effects (see screenshots below), we could set up the database such that it could be queried directly.

We chose not to use this method was because the returned list of recommendations would be limited only to the strains that exactly matched the options that the user chose. In other words, it would not be as much a recommendation system as a filtering tool.

For example, if a user chose to filter by sativa, they would only see sativa strains. So far so good. But what if the actual best strain for them, when taking all of the other characteristics into account, is a hybrid?

Sure, they might still be able to find a good match. However, the "recommendations" (actually just a filtered list) would fail to adequately gauge their needs and miss out on providing value to the user.

Plus, it wouldn't be as flexible when taking user data into account (though we never got that far). Indeed, this method would've been simple enough that, given an extra day or two, we likely could have included it in addition to the recommendation system.

Design of page (desktop layout) where users can pick their preferences.
Design of page (desktop layout) where users can check off their desired effects.


The actual recommendations are powered by the nearest neighbors model. However, computational algorithms like this can only use numerical data. And, as explored in the previous section, the data we were working with was textual, calling for some natural language processing techniques. One of the fundamental concepts in NLP, and the first step in many NLP projects, is vectorization—converting text into numbers.

Before the nearest neighbors model could be trained, the vectorizer had to be trained on all of the text features in the dataset. In this context, "trained" means running the text data through the vectorizer in order for it to "learn" the numerical representations of the corpus; to generate its vocabulary for translating from text to numbers. Once it is trained, it can be used to transform arbitrary text, such as a user's strain preferences, in a consistent manner.

The vectorizer we used is called term frequency - inverse document frequency (TF-IDF), which is a method of finding the unique aspects of documents. Term frequency is the number of times each word in a document appears in that document. Inverse document frequency means a word's value is penalized for appearing in multiple documents across the corpus.

TF-IDF method is useful because it can represent the importance of each word in a corpus of text.

Nearest Neighbors

Now that the documents have all been vectorized, they can be used to generate recommendations. The nearest neighbors algorithm works by calculating the n-dimensional distance between the documents—'n' being the number of features. Once the distances between each document are calculated, the model can then be queried for the similarity between documents.

In this case, each document exists in a sort of multi-dimensional neighborhood with other documents in the corpus, with varying distances between each one and its neighbors. The closer two documents are to one another, the more similar they are.

This is how we set up our recommendation system: the input string is vectorized using the same vocabulary that was fitted to the original corpus, then the most similar documents are retrieved using the nearest neighbors algorithm. Those k-nearest neighbors to the input vector become the recommendations that are served back to the user.

Model Evaluation

As this was an unsupervised problem (no predetermined target to predict) and because we did not have user preference data against which we could compare the recommendations, the process of validating that the system was providing good recommendations was mostly manual and done by feel.

The majority of it was simply to play around with the input by entering various combinations of characteristics and considering whether the resulting recommendations were relevant to the query or not.

As we had surmised, using only 'type', 'effect', and 'flavor' was quite sufficient to get decent recommendations, at least as far as we could see. However, given more time, we could (and would) have done much more to evaluate and improve the model.

This was my first time working on any kind of software project as a team. As such, it provided me with a great many learning opportunities.

One of the most valuable skills I began learning during this project was how to manage Git as part of a team. I had been using Git to manage my own code and writing repositories for over a year by this point, but had never done so as part of a team. I had a lot of fun researching, discussing, and applying best practices with my teammates, primarily one teammate in particular, Vera Mendes. The link to her blog can be found below.

Another obvious, though still very important skill, that this experience allowed me to practice was effective and adaptive teamwork.

I won't go into detail here. But suffice to say that one of the more important members of the data science team—the one who was primarily responsible for parts of the project requiring knowledge that the rest of us had not learned yet—ended up not being much of a team player. As a result, myself and the two other data scientists had to quickly adapt, learning and applying these concepts on the fly.

As tends to be the case in these types of situations, this complication led to an even greater learning experience for the rest of us. In fact, as someone who learns best by doing, this is the type of situation in which I excel.

Overall, it was a great experience (even though the app never went fully live). I hope you enjoyed reading about it.

As always, thank you for reading, and I'll see you in the next one!