print(fiction) is a solo project I worked on to explore the data on and around fictional stories.

I used Scrapy to scrape metadata for over 20,000 books from GoodReads and used it to train a series of machine learning classifiers. The final version of the model classified books as either fiction or nonfiction with ~80% accuracy.

The dataset is freely available for download from GitHub.

I built an interactive dashboard using Plotly Dash that can be used to tinker with the model parameters and view the resulting prediction in real time.

You can find the current live version of the app here:

This project is part of an on-going series of exploratory articles and projects called Sci-Fi IRL, through which I am exploring the relationship between science-fiction and the real world. It is my belief that the stories we read, write, and believe in, particularly about our future, have an effect on how that future ultimately turns out.

Our human minds are geared toward thinking about what could go wrong. It follows that the majority of stories in popular and niche culture are written about how things could go wrong, usually catastrophically so—it "tells a good story".

In the case of science-fiction, stories tend to be dystopian in nature, showing what could go wrong if our technology advances along certain trajectories.

But does this affect our outlook on what these technologies can do for us?

While it is always good to consider the possible ramifications of technological advances, I believe that too many dystopian stories are causing humans, as a civilization, to fall short of our potential. If instead of describing dystopia, the majority of science-fiction was utopian—exploring the possible ways that things could go _right_ for us—I believe it would, in a very real sense, point us a little bit more in that direction.

If that's a bit too lofty for you, another way to think about this is to imagine what your life could be like 100 years from now (i.e. if you'd been born 60 years from now). Depending on how things go, you could be scraping by with a group of other radiation-poisoned humans as the world recovers from nuclear holocaust. Or, you could be out exploring the galaxy in a luxury space yacht, with a potential lifespan in the centuries or millennia.

Which is more interesting to you? Which future would you rather live in?

This is the area I'm exploring with this series. I want to find the data and conduct the analyses that begins to show how our collective narrative (aliased by popular science-fiction) can bring about changes in our technological progress.

Of course this area is too large to explore in a single project, which is why I am doing it as a series. The first article in the series explored, at a very basic level, how technical terminology disperses through popular culture. You can find that article here: Tech Term Velocity.

In this project, print(fiction) the broad question I wanted to explore was this:

What separates fact from fiction?

...which is really just a cliché way of saying I wanted to explore the differences between nonfiction and fiction stories. My favorite method of consuming science-fiction is through books. Therefore, I chose to look at the differences between fiction and nonfiction books.

Without diving into the actual content of books (that's a project for a later time when I have more experience with natural language processing), my goal was to look for patterns in the book metadata that could distinguish the two genres.

One last quick note before getting into it.

I'm not going to walk through every single step (though I'll do my best to paint the whole picture) or show the actual code in this article. If you're curious about the nitty-gritty details and code I wrote for any of the steps, the Jupyter notebooks found at the link below walk through all of it.

I searched around a bit for data that could help me answer this question. There were a few datasets on Kaggle that contained metadata scraped from GoodReads. I liked the idea of using metadata, but none of these datasets were quite what I wanted. Therefore, I decided to gather my own dataset and gain some more experience with web scraping along the way.

Scraping

I did not build the Scrapy scraper completely from scratch, though I probably would have if I hadn't run across a small project on GitHub that was built for this exact purpose. I figured the scraper itself wasn't as important to this particular project as the data it produced, so I decided to use it instead of building my own.

However, I did get my feet pretty wet with Scrapy in order to understand what it was doing and why it didn't work when I first tested it out. I forked it and made some minor modifications to the code to fix a bug or two and make it better fit my use-case. Overall, finding the project was a definite time-saver, and I still got to learn about how to build a robust web spider with Scrapy.

My fork of the repository can be found here: tobias-fyi/GoodReads.

Of course it's not realistic to simply send off the spider to crawl every single book on GoodReads. I decided to look at the popular reading lists that are curated on the site, as they do a good job aggregating books chosen by many tens of thousands of users.

The list I ended up choosing to use as the basis for my scrape is called Books That Everyone Should Read At Least Once, which has a little over 20,000 titles, voted on by almost 100,000 users.

This worked out nicely, and after about 16 hours of crawling and scraping, I had my dataset containing the metadata for about 20,000 books.

The raw output from the scraper took the form of a series of JSON files. I wrote a few helper functions that concatenated these files into a single Pandas DataFrame, with some basic initial preprocessing steps built in, such as dropping duplicate records and cleaning up column names.

Data wrangling and exploration

Dataset in hand, it was time to explore and wrangle!

As always, the first step was to take a look at what I have—basic info about the features, such as the types of data and null values.

Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   url                    21514 non-null  object 
 1   title                  21513 non-null  object 
 2   author                 21513 non-null  object 
 3   num_ratings            21513 non-null  float64
 4   num_reviews            21513 non-null  float64
 5   avg_rating             21513 non-null  float64
 6   num_pages              20339 non-null  float64
 7   language               19366 non-null  object 
 8   publish_date           21078 non-null  object 
 9   original_publish_year  12839 non-null  float64
 10  genres                 18573 non-null  object 
 11  characters             5823 non-null   object 
 12  series                 7048 non-null   object 
 13  places                 5238 non-null   object 
 14  asin                   3953 non-null   object 
 15  0_rating_count         0 non-null      float64
 16  1_rating_count         20219 non-null  float64
 17  2_rating_count         20219 non-null  float64
 18  3_rating_count         20219 non-null  float64
 19  4_rating_count         20219 non-null  float64
 20  5_rating_count         20219 non-null  float64
dtypes: float64(11), object(10)
memory usage: 3.4+ MB

Feature management

Before digging deep into the data, there was some initial processing and pruning to be done to the features to make them more manageable later on.

Right off the bat I removed some features that likely wouldn't prove useful in the model:

  • url
  • asin
  • 0_rating_count
  • characters
  • places

'url' and 'asin' are obviously not going to be useful, just extra noise. As can be seen in the info table above, '0_rating_count' was completely null because GoodReads doesn't allow books to get zero stars.

I based my decision to remove 'characters' and 'places' on my domain knowledge and on the quality of their data. In my experience, nonfiction books are much less likely to be set in a particular location or have characters.

On one hand, this could be valuable information for the model to know: if the book has a list of characters, it is more likely to be fiction. On the other hand, that information could be too useful—i.e. it could leak information about the target.

Both have a high proportion of null values—over 50%, as can be seen in the table above—and I cannot be sure whether the null values represent the fact that users simply haven't taken the time to add that data to those books, or if those books really do not have any significant characters or places.

Null values

A feature with a large proportion of null values is far less likely to be useful, as imputing (filling in) the missing data can add noise—or even artificial patterns—that would adversely affect the model. Therefore, a feature such as that can usually be considered extraneous and removed from the dataset. With a small enough proportion of null values, imputing or otherwise filling in the missing data is more likely to preserve the preexisting patterns in those features.

Depending on the distribution of null values throughout the dataset and the characteristics of the features, it might be better to remove rows that have a high proportion of nulls.

There is one specific feature for which I want to drop all rows that have a null: 'genres'. Because I am going to be using this to engineer my target, I don't want to risk biasing the model by imputing the missing values.

After removing those, I was left with 18,573 rows.

The above table doesn't do a great job describing the null value situation. Below is a visualization (thanks to missingno) showing the distribution and proportion of null values per feature (after removing rows with null 'genres').

Null values by feature visualization
Distribution of null values.

I used Scikit-learn's IterativeImputer to impute the missing values for many of the features with null values. Basically, it models each feature as a function of the other features, using that model to "predict" what the missing value would have been if it wasn't null.

There are a couple of features that had to be dealt with differently, 'original_publish_year' and 'series', as their null values actually held information.

For 'original_publish_year', I assumed that a null value indicated that the book had not been previously published. And similarly for 'series', I assumed null indicated a standalone book. I transformed both of these features into binary: 0 indicating the value was null, 1 indicating it was not.

Distributions and outliers

As can be expected with media such as books, there were some large outliers with regards to popularity. While the vast majority of books had relatively few ratings and even fewer reviews, where is a relatively small group of authors that struck gold, with them and their books becoming household names.

I found it somewhat surprising that my dataset reflected this reality so closely, as I thought a list of "must reads" would be biased toward more popular books. That bias is likely present—the dataset is large enough to include many books that never became household names—though not to the degree I initially thought.

distplot num_ratings
Distribution of the number of ratings.

The distributions of the number of ratings and reviews are both heavily skewed to the right. The box in the boxplot below indicates the interquartile range, and the line that sits just barely to the right indicates the top of the third quartile. Basically, anything above that line can be considered an outlier.

Another way to look at the dispersion of the data is to consider the mean versus the median, along with the standard deviation. In the case of 'num_ratings', the relative size of the mean (46,958, with a standard deviation of 212,398!) compared to the median (4,135) indicates that the mean is significantly influenced by outliers.

boxtplot of num_ratings
Boxplot showing distribution of the number of ratings.

Curious what those books are?

most-rated books histogram
Highest rated books.

I didn't show the plots for 'num_reviews' as it looks almost identical to 'num_ratings', only it is in the range of hundreds of thousands instead of millions.

Dealing with the outliers

Notice anything about the books with the most ratings—i.e. the most popular?

Every one the top 12 most popular books is fiction.

Based on that observation, I figured the model would find these particular outliers to be useful. However, given the huge range of these features, they would have to be scaled for use in any linear models. To this end, I included an instance of Scikit-learn's StandardScaler in my pipeline, which standardizes numeric features, transforming the data such that each feature has a mean of zero and standard deviation of 1. Basically, that just brings all of the numeric data into the same range.

Scaling the data is very important for linear models or neural networks—not so much for decision tree-based models. Therefore, I only used the StandardScaler as needed.

num_pages_boxplot.png
Boxplot showing the distribution of number of pages.

Another feature that had stark outliers was the number of pages. There were several books with over 5,000 pages when the majority had less than 500. As with the outliers discussed above, I figured that the number of pages was relevant to the model.

Upon further inspection, however, I found that most of the outliers in this case were not actually single books, but entire book series. Therefore, I decided to remove the farthest outliers—those over 2,000 pages in length.

At this point, I was left with 18,344 rows in my dataset.

Feature relationships

Another line of inquiry I wanted to explore before diving into the model was relationships between features and between the target and the features. The obvious first step was to create a matrix of scatterplots between what I believed to be the most important features and the target. As the target is binary, I thought it best to represent it as a third dimension in each of the scatterplots: color.

fiction-pairplot-1
Matrix of scatterplots showing relationships between certain features.

Although it's difficult to see much in the way of detail, this matrix is great for getting an overall idea of what's going on between the features in question.

One characteristic that immediately stuck out to me is the majority of outliers in 'num_ratings' are fiction books. This helps to confirm my earlier hypothesis that these outliers will be valuable to the model.

Also, I noticed that 'avg_rating' seems to be different between fiction and nonfiction books. The comparison of distributions (middle-middle) and the scatter plots comparing 'avg_rating' to 'num_pages' (middle-bottom and middle-right) seem to indicate that nonfiction books are rated slightly higher than fiction.

Seeing this difference in densities when grouped by the target made me want to explore other features in this manner. An interesting one I found was 'publish_year'. As can be seen in plot below, it seems that the "golden years" of fiction were in the mid-2000s (at least according to this reading list), whereas the mid-2010s had more good nonfiction.

publish_year_x_fiction
Distribution of books by fictionality.

Although the exploration above is relatively basic, I can see that there seem to be some meaningful relationships in the data. Most important to my purposes is confirming that there are patterns that indicate some differentiation in the features when grouping them by the target.

With that in mind, I felt good enough to get into playing some models.

Baseline Models

As I discussed above, the most complex model is not necessarily the best model. It's generally a good idea to start with a simple baseline model and progressively add complexity in proceeding iterations. The chosen evaluation metric will (ideally) indicate when the added complexity is beneficial, and when it is not.

Majority class

The simplest possible model is to simply predict the majority class every time, regardless of input. Assuming an evenly distributed binary target, that model should be right about half the time, or have a 50% accuracy. In my case, the majority class baseline is just north of that, around ~.52, or ~52%.

Now I know that no matter what model I end up choosing and what features I end up using, it must have an accuracy of more than .52. If I can't beat that...

Limited logistic regression

Taking it a step further, I find it useful to get a second baseline using an algorithm with a bit more complexity than that, but far less complexity than what is possible given the problem parameters. In this case I chose to train a limited logistic regression model using only a few of the features: 'num_reviews', 'avg_rating', and 'num_pages'. This simple baseline model had an accuracy of ~.63 and an F1 score of ~.65.

I will go into some detail on what the F1 score is in the Model Validation section. For now, suffice to say that the discrepancy between it and the accuracy is due to the slight imbalance of the classes in the target.

Default random forest

I actually trained one more baseline model to set the bar higher and get a more accurate and precise idea of how high the bar will go given the problem, the available features, and the amount of data in the dataset.

This last baseline was a random forest that I left with default hyperparameters, trained on the full set of features as they were at this point in the process. This more complex—maybe even overly complex—model did quite a bit better, with an accuracy of ~.73 and an F1 of ~.75.

That's the score to beat as I iterate on the model and add complexity.

Model validation

As always, before jumping into building and training I needed to make some decisions about how I will measure success—or, more accurately, how the model measures success. Choosing appropriate evaluation metrics is crucial to get an accurate understanding of how well a model is fitting (or overfitting) the data. An overly complex model will likely overfit the training data, leading to a model that performs poorly on new inputs on which is was not trained. Decision trees are particularly prone to overfitting

Target distribution

The target I'm building a model to predict is binary, 1 or 0; fiction or nonfiction. Therefore, the metrics I choose must be appropriate for a binary classification model. The most common way to measure the performance of a binary classifier is the accuracy score—i.e. the proportion of predictions correctly classified by the model.

A very important characteristic to look at when choosing a metric is the distribution of the target variable. In the case of binary classification, if the target is skewed one way or the other, accuracy is not a good method of evaluating performance.

Although the target in my dataset is very evenly distributed between fiction (~54%) and nonfiction (~46%), it still is skewed a little bit. This means that if I use accuracy, I can expect it to be a decent indicator of performance, but slightly less than ideal. The reason for this is that accuracy only tracks the number of the model's mistakes, irrespective of how the model made those mistakes.

Types of errors

In the case of binary classification, there are two ways a prediction can be wrong: false positive (type I error) or false negative (type II error). As the name suggests, false positive is when the model (falsely) predicts positive or 1 when the actual label is 0. False negative is the alternative to that, when the model predicts negative or 0 when the label is 1.

A better method of evaluating performance will take into account the number of each type of error in a model's predictions. Depending on the target and what the model will be used for, it may be better to use a metric that looks at one or the other.

A common example is a spam detection model. In this case, it is generally better to let some spam emails through the filter into the users' inboxes (false negative) than to put potentially important emails into the spam folder (false positive).

Confusion matrix

A common and effective way to get deeper insight into a classification model's predictions is with a confusion matrix, which is a type of contingency table with the target's classes laid out along both axes, the rows representing the actual classes and the columns showing the predicted classes.

conf_matrix_def_rf.png
Confusion matrix for default random forest model.

Where each row and column intersect, it shows the number of predictions that fall into that category. In the case of binary classification, this results in a table containing four numbers: true positives, true negatives, false positives, and false negatives. A classifier that made no mistakes would have non-zero numbers only on the main diagonal (top left to bottom right).

Evaluation metrics

There are three main metrics that can be derived from the confusion matrix. The first is precision, which is the proportion of the total positives (true positive + false positive) that were correctly predicted to be positive (true positive). The second is recall, which is the proportion of the total predicted values (true positives + false negatives) that were correctly predicted to be positive (true positive). The last metric I'll be covering now is the F1 score: the weighted average of precision and recall.

I chose to use the F1 score as the primary evaluation metric for my model because I'm more interested in reducing the number of mistakes in general, as opposed to preferring to optimize for one or the other.

Feature engineering

When I initially started this project, I went through the process of validating and training a model or two that tried to predict the average rating of books. This was by far the most common target chosen by those who started Kaggle kernels using other GoodReads datasets. Although this may have the most obvious business value if I was a data scientist working for a book publisher, to me this wasn't a particularly interesting target to try to predict.

Target practice

I realized this when I hit a wall with my progress in improving the rating-predictor model. One reason was that I did not see any obvious useful features that could be engineered. However, once I found my way to the idea of predicting the fictionality of the books, the target drove the direction I took with my feature engineering. It was a great learning experience for me in engineering features toward the specific target that the model is trying to predict. I called this process "target practice".

Here are the feature ideas I came up with and engineered (all in short succession once the new target was chosen):

  • Title begins with "The"
  • Has subtitle: contains ":"
  • Title character count
  • Title word count
  • Title longest word
  • Author number of names
  • Author middle initial
  • Ratings (stars) ratio: 1 + 2 / 4 + 5

Forest with new features

To get an initial idea of the effect that these new features had on the model, I retrained a new random forest with all the same hyperparameters (defaults). The result was a significant boost in the model's F1 score, from ~.75 to ~.79. That made me happy.

However, because I engineered all of the new features at once—i.e. I didn't retrain the model after every one—this does not give me insight into which ones were useful. In fact, at this point I hadn't looked at how useful any of the specific features were.

Permutation importance

It is likely that some of the features do not help the model make correct predictions. Indeed, some may even be worse than that: they could add noise that makes the model perform worse.

To address this potential problem, I'm going to find each feature's importance to the model using a method called permutation importance. Basically, this method will go through each of the features, one by one, replacing their data with random noise generated from the distribution of the original data. The performance of the model will be evaluated and compared with the score using all of the original data to find the effect that each feature has on the performance of the model.

The following table is the result of running permutation importance on both the new and old features.

permutation_importance_3.png
Permutation importance table showing the effect that each feature has on the model.

Along with being useful for feature selection, I find it interesting to see what features have the largest positive effect on the model's predictive power. From this table, I can see that the majority of the benefit I got from engineering the new features came from 'has_subtitle'. This feature, according to the permutation importance table, is the most important predictor of the lot, and simply indicates whether the title of the book has a colon in it. My intuition was that having a subtitle is very common for nonfiction books, not so much for fiction. It seems that my intuition was generally good.

Feature pruning

Based on the above table, I should see a small increase in the model's performance by removing 'publish_month', 'author', 'title_word_count', 'title', and 'publish_day'. Once I removed those features, I saw another bump of ~.01 in the model's F1 score.

Sweet! That put the model up to an F1 score just south of .80 on the validation set.

Now that I'd engineered some new features and pruned them to only the useful ones, it was time to iterate and tune.

Iteration

As I mentioned above, a good general approach to finding and training the best model for a particular problem and dataset is to start simple and iterate. I already iterated to select the best features. Next up was to iterate on algorithms and their various hyperparameters.

Cross-validation

Up until this point, I had been using a consistent train-test-validation set split. However, now that I was going to be tuning hyperparameters, it made sense to start using cross-validation. I won't get into the details of what that is in this particular article. If you're curious, search for it on Duck Duck Go, or read about it in The Python Data Science Handbook.

Hyperparameter tuning

Hyperparameters are the parameters of the model that are not directly learned during the training process—they must be adjusted manually. My preferred method of tuning them is with a randomized search.

Essentially, I provide a list of algorithm hyperparameters and a search window or distribution for each one. The algorithm runs through a specified number of iterations, each time randomly selecting values for each hyperparameter from their respective distribution and using those to train a new model. Each model is evaluated after each iteration using cross-validation. Then, once the search is over, the pipeline is refitted using the values that resulted in the model with the best cross-validation score.

As the search is random and limited to a reasonable number of iterations, some strategy is involved to find optimal search windows/distributions for each hyperparameter. The method I used for this is to start out searching a wide range of values and go through the process a few times, progressively narrowing the range around the values that come up as consistently optimal.

Validation

Before getting too deep into tuning the hyperparameters of a specific algorithm (it is computationally expensive to run exhaustive searches), I thought it best to test out a few different algorithms. I started out with the assumption that random forest or gradient-boosting would provide the best results, but it's generally good to test assumptions.

Though I did not test a very wide range of algorithms, I figured it would be worth it anyways to see if any of the more commonly-used ones showed any promise.

Somewhat to my surprise, the best models from tuning and cross-validation were basically on par with the default, non-cross-validated random forest.

The fact that the baseline was trained without using cross-validation could be one reason. In other words, the best model from the search might generalize better to new data; it could outperform the default one on the test dataset.

Or, it could be a result of the parameters and their ranges I searched—i.e. not giving a wide enough initial distribution. However, the wider the distribution, if the number of iterations is relatively low, the more "noise" might be present in the results of the search. I wanted to wait until I chose an algorithm to try increasing the number of search iterations significantly.

Here is the full list of algorithms I trained during the validation and iteration process, and their best F1 score:

  • Logistic regression: ~.76
  • K-nearest neighbors: ~.69
  • Random forest: ~.80
  • Gradient-boosted decision tree: ~.80

Final algorithm

Even with 40 total fits (4 cross-validation folds, 10 iterations) the gradient-boosted classifier did not really outperform the random forest by any significant margin. Given the additional complexity and computation required for an XGBoost (gradient-boosting) model, I decided to move forward with the random forest classifier.

To continue testing the hypothesis that my initial number of iterations was too low for the search to converge on a good combination of hyperparameters, I trained more random forests with higher numbers of search iterations.

To my continued surprise, even after many, many more iterations, the absolute best F1 score I got still hovered right around ~.80.

Results and Interpretation

At this point, I did not want to waste any more time or compute power trying to find that last little bit of optimization. So, it was time to evaluate the final model, interpret its predictions, then deploy it.

Evaluating the final model

Finally it was time to unlock the test data and see how the model does. The result was an F1 score slightly north of .78.

Here is the confusion matrix derived from the model's predictions on the test set:

conf_matrix_final_rf_test_data.png
Confusion matrix showing performance of final random forest model on the test data.

Predicted probabilities

One way to interpret a trained model is to inspect the predictions that the model made on the test set, either individually or through various descriptive statistics and visualizations. A key component to this is the probabilities underlying each of the predictions.

By looking at predicted probabilities, I can look at instances when the model was very sure or not sure of its predictions, and if those predictions were correct or not. By looking at the cases where the model made mistakes, I can hopefully pick out patterns and glean some insight why the model made those mistakes.

If I find that the model consistently makes mistakes that don't make much sense, that could mean there is some more optimization to be done or weakness to be addressed.

False negatives

The first one that jumped out at me was subtitles. The model finds that feature very useful, yet of course there are some fiction books with subtitles. This mistake makes sense to me, and for the benefit that the feature adds to the model, it is worth incorrectly classifying some fiction books as nonfiction. One way to get around this would be to engineer another feature or three that attempts to be the complement to that feature, catching the instances when fiction books have subtitles.

Another feature that causes a similar type of error is the title character count. According to the feature importances table (and plot generated from the table's data, shown below) of the final random forest model, 'title_char_count' is also a very important feature. I can see that many of the false negatives (predicted nonfiction, actually fiction) have a high title character count.

The following SHAP plots formalize my observations and provide more insight by showing the effect that the value of each feature had on the model's prediction. The following group are all false negative predictions, meaning the model predicted nonfiction when it was actually fiction.

I also wanted to see what features caused false negatives when the book had neither a high character count nor a subtitle. The third plot below shows the top three as 'republish', 'avg_rating', and 'publish_year'. I was surprised to see 'republish', as it does not seem to be all that important to the model overall. The other two seem to be relatively good predictors, particularly 'avg_rating'.

shap_fn_1.png
Example false negative prediction.
shap_fn_2.png
Another example of a false negative prediction.
shap_fn_3.png
Another example of a false negative prediction.

False positives

The other side of the confusion matrix again corroborates my initial observation. Many of the same features that provided some additional predictive power also mean some additional mistakes. However, the overall effect was net positive. If it wasn't these features causing the incorrect predictions, it would be others, and there would be more incorrect predictions overall.

shap_fp_1.png
Example of a false positive prediction.
shap_fp_2.png
Example of a false positive prediction.
shap_fp_3.png
Example of a false positive prediction.

So there you have it, the process of training a machine learning model, from start to finish. The only thing left to do is to deploy it.

I thought it would be interesting to be able to set up a dashboard that can be used to give dynamic inputs to the model and see the resulting prediction. I decided to build the dashboard using Plotly Dash and deploy it to Heroku.

So you don't have to scroll all the way up, here's another link to the live app: print(fiction).

Shoutout to HTML5UP for the template on which I based the design of the app. I made some minor modifications to it, but most of the design was straight from one of his templates.

If you made it this far, I salute you.

Thank you for reading and I'll see you in the next one!