Introduction

print(fiction) is a solo project I worked on to explore the data on and around fictional stories.

I used Scrapy to scrape metadata for over 20,000 books from GoodReads and used it to train a series of machine learning classifiers. The final version of the model classified books as either fiction or nonfiction with ~80% accuracy.

The dataset is freely available for download on GitHub.

I built an interactive dashboard using Plotly Dash that can be used to tinker with the model parameters and view the resulting prediction in real time.

Note: You can find the current live version of the app here.

This project is part of an on-going series of exploratory articles and projects called Sci-Fi IRL, through which I am exploring the relationship between science-fiction and the real world. It is my belief that the stories we read, write, and believe in, particularly about our future, have an effect on how that future ultimately turns out.

Our human minds are geared toward thinking about what could go wrong. It follows that the majority of stories in popular and niche culture are written about how things could go wrong, usually catastrophically so—it "tells a good story".

In the case of science-fiction, stories tend to be dystopian in nature, showing what could go wrong if our technology advances along certain trajectories.

But does this affect our outlook on what these technologies can do for us?

While it is always good to consider the possible ramifications of technological advances, I believe that too many dystopian stories are causing humans, as a civilization, to fall short of our potential. If instead of describing dystopia, the majority of science-fiction was utopian—exploring the possible ways that things could go right for us—I believe it would, in a very real sense, point us a little bit more in that direction.

If that's a bit too lofty for you, another way to think about this is to imagine what your life could be like 100 years from now (i.e. if you'd been born 60 years from now). Depending on how things go, you could be scraping by with a group of other radiation-poisoned humans as the world recovers from nuclear holocaust. Or, you could be out exploring the galaxy in a luxury space yacht, with a potential lifespan in the centuries or millennia.

Which is more interesting to you? Which future would you rather live in?

This is the area I'm exploring with this series. I want to find the data and conduct the analyses that begins to show how our collective narrative (aliased by popular science-fiction) can bring about changes in our technological progress.

Of course this area is too large to explore in a single project, which is why I am doing it as a series. The first article in the series explored, at a very basic level, how technical terminology disperses through popular culture. You can find that article here: Tech Term Velocity.

In this project, print(fiction) the broad question I wanted to explore was this:

What separates fact from fiction?

...which is really just a cliché way of saying I wanted to explore the differences between nonfiction and fiction stories. My favorite method of consuming science-fiction is through books. Therefore, I chose to look at the differences between fiction and nonfiction books.

Without diving into the actual content of books, my goal was to look for patterns in the book metadata that could distinguish the two classes of books.

Imports and Configuration

# === General Imports === #
import pandas as pd
import numpy as np
from scipy.stats import randint, uniform
import matplotlib.pyplot as plt
import seaborn as sns

# === Configure === #
%matplotlib inline
pd.options.display.max_rows = 200
pd.options.display.max_columns = 200

# === ML Imports === #
# Preprocessing
import category_encoders as ce
from sklearn.preprocessing import StandardScaler
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer

# Model validation
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, classification_report

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Interpretations
import eli5
from eli5.sklearn import PermutationImportance

The Data

I searched around a bit for data that could help me answer this question. There were a few datasets on Kaggle that contained metadata scraped from GoodReads. I liked the idea of using metadata, but none of these datasets were quite what I wanted. Therefore, I decided to gather my own dataset and gain some more experience with web scraping along the way.

Scraping

I did not build the Scrapy scraper completely from scratch, though I probably would have if I hadn't run across a small project on GitHub that was built for this exact purpose. I figured the scraper itself wasn't as important to this particular project as the data it produced, so I decided to use it instead of building my own.

However, I did get my feet pretty wet with Scrapy in order to understand what it was doing and why it didn't work when I first tested it out. I forked it and made some minor modifications to the code to fix a bug or two and make it better fit my use-case. Overall, finding the project was a definite time-saver, and I still got to learn about how to build a robust web spider with Scrapy.

My fork of the repository can be found here: tobias-fyi/GoodReads.

Of course it's not realistic to simply send off the spider to crawl every single book on GoodReads. I decided to look at the popular reading lists that are curated on the site, as they do a good job aggregating books chosen by many tens of thousands of users.

The list I ended up choosing to use as the basis for my scrape is called Books That Everyone Should Read At Least Once, which has a little over 20,000 titles, voted on by almost 100,000 users.

This worked out nicely, and after about 16 hours of crawling and scraping, I had my dataset containing the metadata for about 20,000 books.

The raw output from the scraper took the form of a series of JSON files. I wrote a few helper functions that concatenated these files into a single Pandas DataFrame, with some basic initial preprocessing steps built in, such as dropping duplicate records and cleaning up column names.

The output of the scraper was a set of JSON files. In order to use it in the project, I'll need to convert to Pandas DataFrame.

In case you're curious, here are some functions I wrote to read and combine json files into single DataFrame and do some preprocessing.

def json_cat(json_files):
    """Reads and concatenates a list of .jl (json lines) 
    files into a single dataframe.
    """
    # Read the books json files into a list of dataframes
    dfs = [pd.read_json(filepath, lines=True) for filepath in json_files]
    # Concatenate the list of dataframes
    df = pd.concat(dfs, sort=False)
    
    return df

def encode_book_genres(df):
    """Deconcatenates top 30 book genres into separate
    features, OneHotEncoding style.
    """
    # Set comprehension - creates a set of all genres listed in dataset
    all_genres = {genre for row_genres in df["genres"] for genre in row_genres}

    # Create a new feature for every genre
    for genre in all_genres:
        has_genre = lambda g: genre in g
        df[genre] = df.genres.apply(has_genre)

    # Create list of top 30 most common genres
    most_common_genres = df[list(all_genres)].sum().sort_values(ascending=False).head(30)
    most_common_genres = most_common_genres.index.tolist()
    
    # Drop all but the top 30 genres from the dataframe
    unwanted_genres = list(all_genres - set(most_common_genres))
    df = df.drop(columns=unwanted_genres)
    
    # Drop the original "genres" feature
    df = df.drop(columns=["genres"])
    
    return df

def book_pub_date(df):
    """Deconcatenates book publish_date to three separate features
    for year, month, and day. Drops the original publish_date feature.
    """
    # === The Pandas method === #
    # Convert the "publish_date" column to datetime
    df["publish_date"] = pd.to_datetime(df["publish_date"], errors="coerce", infer_datetime_format=True)

    # Break out "publish_date" into dt components
    df["publish_year"] = df["publish_date"].dt.year
    df["publish_month"] = df["publish_date"].dt.month
    df["publish_day"] = df["publish_date"].dt.day
    
    df = df.drop(columns=["publish_date"])  # Drop the OG publish_date
    
    return df

def book_cat(paths_list, output_filename):
    """Reads and concatenates a list of book_*.jl (json lines) files."""

    # === Concatenate the list of dataframes === #
    df = json_cat(paths_list)

    # === Initial wrangling === #
    df = df.drop_duplicates(subset=["url"])  # Drop duplicate records
    # Format column names with pyjanitor
    df = (df.clean_names())

    # Break ratings_histogram (array) into component features
    df_hist = df["rating_histogram"].apply(pd.Series)
    rating_cols = {}  # Dict for mapping column names
    for col in df_hist.columns.tolist():
        rating_cols[col] = f"{col}_rating_count"
    # Rename according to mapper created above
    df_hist = df_hist.rename(columns=rating_cols)
    # Concat new columns onto main dataframe
    df = pd.concat([df, df_hist], axis=1, join="outer")
    # Drop extra column
    df = df.drop(columns=["rating_histogram"])
    
    df.to_csv(output_filename, index=False)
    print(f"Created dataframe and saved to current directory as '{output_filename}'")
    
    return df

books = book_cat(bookpaths, "must_read_books-01.csv")

Successfully created dataframe and saved to current directory as 'must_read_books-01.csv'

print(books.shape)
books.head()

(21514, 21)

books.isnull().sum()

url                          0
title                        1
author                       1
num_ratings                  1
num_reviews                  1
avg_rating                   1
num_pages                 1175
language                  2148
publish_date               436
original_publish_year     8675
genres                    2941
characters               15691
series                   14466
places                   16276
asin                     17561
0_rating_count           21514
1_rating_count            1295
2_rating_count            1295
3_rating_count            1295
4_rating_count            1295
5_rating_count            1295
dtype: int64

Data wrangling and exploration

Dataset in hand, it was time to explore and wrangle!

As always, the first step was to take a look at what I have—basic info about the features, such as the types of data and null values.

books.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21514 entries, 0 to 21513
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   url                    21514 non-null  object 
 1   title                  21513 non-null  object 
 2   author                 21513 non-null  object 
 3   num_ratings            21513 non-null  float64
 4   num_reviews            21513 non-null  float64
 5   avg_rating             21513 non-null  float64
 6   num_pages              20339 non-null  float64
 7   language               19366 non-null  object 
 8   publish_date           21078 non-null  object 
 9   original_publish_year  12839 non-null  float64
 10  genres                 18573 non-null  object 
 11  characters             5823 non-null   object 
 12  series                 7048 non-null   object 
 13  places                 5238 non-null   object 
 14  asin                   3953 non-null   object 
 15  0_rating_count         0 non-null      float64
 16  1_rating_count         20219 non-null  float64
 17  2_rating_count         20219 non-null  float64
 18  3_rating_count         20219 non-null  float64
 19  4_rating_count         20219 non-null  float64
 20  5_rating_count         20219 non-null  float64
dtypes: float64(11), object(10)
memory usage: 3.4+ MB

Feature management

Before digging deep into the data, there was some initial processing and pruning to be done to the features to make them more manageable later on.

Right off the bat I removed some features that likely wouldn't prove useful in the model:

url
asin
0_rating_count
characters
places

url and asin are obviously not going to be useful, just extra noise. As can be seen in the info table above, 0_rating_count was completely null because GoodReads doesn't allow books to get zero stars.

I based my decision to remove characters and places on my domain knowledge and on the quality of their data. In my experience, nonfiction books are much less likely to be set in a particular location or have characters.

On one hand, this could be valuable information for the model to know: if the book has a list of characters, it is more likely to be fiction. On the other hand, that information could be too useful—i.e. it could leak information about the target.

Both have a high proportion of null values—over 50%, as can be seen in the table above—and I cannot be sure whether the null values represent the fact that users simply haven't taken the time to add that data to those books, or if those books really do not have any significant characters or places.

drop_cols = [
    "url",
    "asin",
    "0_rating_count",
    "characters",
    "places",
]

books = books.drop(columns=drop_cols)

Note: Dropping the rows with null genres has to happen here in order to complete the data type conversion directly below this.

books = books.dropna(subset=["genres"])
print(books.shape)
books.head(2)

(18573, 18)

Data types

I did some initial feature engineering early on to make the features more manageable for me and usable for the model. More feature engineering will take place later on.

First, the publish_date feature needed to be split up into its component parts (year, month, day), because Scikit-learn models can't directly use the datetime format. By splitting them up into integers, their meaning can be preserved and the model can process them.

Second, I had to deal with nested data. Most notably, the genres column was organized as an array of genres for each book. Because Pandas doesn't parse this kind of data by default, the column imported as the object (text) datatype. The best way I found to deal with it, without delving into natural language processing, was to break out each genre into its own column, one-hot encoding style.

This step was very important, as I used the genres to engineer the fiction target.

Note: these steps have to happen after the genre null values are removed.

def book_pub_date(df: pd.DataFrame):
    """Deconcatenates book publish_date to three separate features
    for year, month, and day. Drops the original publish_date feature.
    """
    # Convert the "publish_date" column to datetime
    df["publish_date"] = pd.to_datetime(df["publish_date"], errors="coerce", infer_datetime_format=True)

    # Break out "publish_date" into dt components
    df["publish_year"] = df["publish_date"].dt.year
    df["publish_month"] = df["publish_date"].dt.month
    df["publish_day"] = df["publish_date"].dt.day
    
    df = df.drop(columns=["publish_date"])  # Drop the OG publish_date
    
    return df


books = book_pub_date(books)
books.head(2)

Null values

A feature with a large proportion of null values is far less likely to be useful, as imputing (filling in) the missing data can add noise—or even artificial patterns—that would adversely affect the model. Therefore, a feature such as that can usually be considered extraneous and removed from the dataset. With a small enough proportion of null values, imputing or otherwise filling in the missing data is more likely to preserve the preexisting patterns in those features.

Depending on the distribution of null values throughout the dataset and the characteristics of the features, it might be better to remove rows that have a high proportion of nulls.

There is one specific feature for which I want to drop all rows that have a null: 'genres'. Because I am going to be using this to engineer my target, I don't want to risk biasing the model by imputing the missing values.

After removing those, I was left with 18,573 rows.

The above table doesn't do a great job describing the null value situation. Below is a visualization (thanks to missingno) showing the distribution and proportion of null values per feature (after removing rows with null genres).

books.isnull().sum()

title                        0
author                       0
num_ratings                  0
num_reviews                  0
avg_rating                   0
num_pages                  676
language                  1359
original_publish_year     6364
genres                       0
series                   11788
1_rating_count              83
2_rating_count              83
3_rating_count              83
4_rating_count              83
5_rating_count              83
publish_year               284
publish_month              284
publish_day                284
dtype: int64

# import missingno as msno
msno.matrix(books)
plt.title("Null values by feature")
# Settings for saving as image
# plt.tight_layout()
# plt.savefig("null_values_by_feature.png", dpi=80)

Text(0.5, 1.0, 'Null values by feature')

I used Scikit-learn's IterativeImputer to impute the missing values for many of the features with null values. Basically, it models each feature as a function of the other features, using that model to "predict" what the missing value would have been if it wasn't null.

There are a couple of features that had to be dealt with differently, original_publish_year and series, as their null values actually held information.

For original_publish_year, I assumed that a null value indicated that the book had not been previously published. And similarly for series, I assumed null indicated a standalone book. I transformed both of these features into binary: 0 indicating the value was null, 1 indicating it was not.

# === Encode the 'genres' feature === #
def encode_book_genres(df: pd.DataFrame):
    """Deconcatenates top 30 book genres into
    separate features, OneHotEncoding style.
    """
    from ast import literal_eval
    
    # === Convert 'genres' to python list === #
    df["genres"] = df["genres"].apply(literal_eval)
    
    # Create a set of all distinct genres listed in dataset
    all_genres = {genre for row_genres in df["genres"] for genre in row_genres}

    # Create a new feature for every genre
    for genre in all_genres:
        has_genre = lambda g: genre in g
        df[genre] = df.genres.apply(has_genre)

    # Create list of top 30 most common genres
    # NOTE: I ended up only using 'fiction', the top result
    most_common_genres = df[list(all_genres)].sum().sort_values(ascending=False).head(1)
    most_common_genres = most_common_genres.index.tolist()
    
    # Drop all but the top 30 genres from the dataframe
    unwanted_genres = list(all_genres - set(most_common_genres))
    df = df.drop(columns=unwanted_genres)
    
    # Drop the original "genres" feature
    df = df.drop(columns=["genres"])
    
    # Convert from Boolean to binary
    df = df.replace(to_replace={True: 1, False:0})
    
    # Format column names with pyjanitor
    df = (df.clean_names())
    
    return df


books = encode_book_genres(books)
books.head(2)

books["series"] = books["series"].notnull().replace(to_replace={True: 1, False:0})
books["series"].value_counts()

0    11788
1     6785
Name: series, dtype: int64

books["republish"] = books["original_publish_year"].notnull().replace(to_replace={True: 1, False:0})
books = books.drop(columns=["original_publish_year"])
books["republish"].value_counts()

1    12209
0     6364
Name: republish, dtype: int64

books.dtypes

title              object
author             object
num_ratings       float64
num_reviews       float64
avg_rating        float64
num_pages         float64
language           object
series              int64
1_rating_count    float64
2_rating_count    float64
3_rating_count    float64
4_rating_count    float64
5_rating_count    float64
publish_year      float64
publish_month     float64
publish_day       float64
fiction             int64
republish           int64
dtype: object

Duplicates or republished?

books.describe(exclude="number").T.sort_values(by="unique")

books[books["title"] == "When Breath Becomes Air"]

Looks like there are some duplicates. I'm going to use title, author, and publish_year as the subset this time.

books = books.drop_duplicates(subset=["title", "author", "publish_year"])

books.shape

(18383, 18)

It looks like there were around 200 duplicates that slipped through the cracks. Not this time!

Distributions and outliers

As can be expected with media such as books, there were some large outliers with regards to popularity. While the vast majority of books had relatively few ratings and even fewer reviews, where is a relatively small group of authors that struck gold, with them and their books becoming household names.

I found it somewhat surprising that my dataset reflected this reality so closely, as I thought a list of "must reads" would be biased toward more popular books. That bias is likely present—the dataset is large enough to include many books that never became household names—though not to the degree I initially thought.

# Caption: Distribution of the number of ratings.
plt.figure(figsize=(16, 8))
sns.distplot(books["num_reviews"])
plt.title("Distribution of 'num_reviews'");
# plt.savefig("num_reviews_distplot.png", dpi=160)

The distributions of the number of ratings and reviews are both heavily skewed to the right. The box in the boxplot below indicates the interquartile range, and the line that sits just barely to the right indicates the top of the third quartile. Basically, anything above that line can be considered an outlier.

Another way to look at the dispersion of the data is to consider the mean versus the median, along with the standard deviation. In the case of num_ratings, the relative size of the mean (46,958, with a standard deviation of 212,398!) compared to the median (4,135) indicates that the mean is significantly influenced by outliers.

# Caption: Boxplot showing distribution of the number of ratings.
plt.figure(figsize=(16, 8))
sns.boxplot(x=books["num_ratings"])
plt.title("Distribution of 'num_ratings'");
# plt.savefig("num_ratings_boxplot.png", dpi=160)

Curious what those books are?

# Caption: Highest rated books.
hiratings = books.nlargest(20, ["num_ratings"]).set_index("title")["num_ratings"]
plt.figure(figsize=(16, 8))
sns.barplot(hiratings, hiratings.index, palette="deep")
plt.title("Books with the most ratings");
# plt.tight_layout()
# plt.savefig("books_most_ratings.png", dpi=160)

Dealing with the outliers

Notice anything about the books with the most ratings—i.e. the most popular?

Every one the top 12 most popular books is fiction.

Based on that observation, I figured the model would find these particular outliers to be useful. However, given the huge range of these features, they would have to be scaled for use in any linear models. To this end, I included an instance of Scikit-learn's StandardScaler in my pipeline, which standardizes numeric features, transforming the data such that each feature has a mean of zero and standard deviation of 1. Basically, that just brings all of the numeric data into the same range.

Scaling the data is very important for linear models or neural networks—not so much for decision tree-based models. Therefore, I only used the StandardScaler as needed.

# Caption: Boxplot showing the distribution of number of pages.
plt.figure(figsize=(16, 8))
sns.boxplot(x=books["num_pages"])
plt.title("Distribution of 'num_pages'");
# plt.savefig("num_pages_boxplot.png", dpi=160)

Another feature that had stark outliers was the number of pages. There were several books with over 5,000 pages when the majority had less than 500. As with the outliers discussed above, I figured that the number of pages was relevant to the model.

Upon further inspection, however, I found that most of the outliers in this case were not actually single books, but entire book series. Therefore, I decided to remove the farthest outliers—those over 2,000 pages in length.

At this point, I was left with 18,344 rows in my dataset.

# This could be one of the sliders on the dashboard
cutoff = 2000
books_over_pages_cutoff = books[books["num_pages"] > cutoff]
print(books_over_pages_cutoff.shape)
books_over_pages_cutoff.head(2)

(39, 18)

books2 = books.drop(books_over_pages_cutoff.index, axis=0)
print(books2.shape)
books2.head(2)

(18344, 18)

books2["num_pages"].describe()

count    17678.000000
mean       333.513576
std        196.836062
min          0.000000
25%        220.000000
50%        306.000000
75%        400.000000
max       1985.000000
Name: num_pages, dtype: float64

Feature relationships

Another line of inquiry I wanted to explore before diving into the model was relationships between features and between the target and the features. The obvious first step was to create a matrix of scatterplots between what I believed to be the most important features and the target. As the target is binary, I thought it best to represent it as a third dimension in each of the scatterplots: color.

# Caption: Matrix of scatterplots showing relationships between certain features.
sns.pairplot(
    books2,
    hue="fiction",
    vars=["num_reviews", "avg_rating", "num_pages"],
    palette="deep",
    plot_kws=dict(alpha=0.8, s=20),
    height=4,
);

<seaborn.axisgrid.PairGrid at 0x134837ad0>

Although it's difficult to see much in the way of detail, this matrix is great for getting an overall idea of what's going on between the features in question.

One characteristic that immediately stuck out to me is the majority of outliers in num_ratings are fiction books. This helps to confirm my earlier hypothesis that these outliers will be valuable to the model.

Also, I noticed that avg_rating seems to be different between fiction and nonfiction books. The comparison of distributions (middle-middle) and the scatter plots comparing avg_rating to num_pages (middle-bottom and middle-right) seem to indicate that nonfiction books are rated slightly higher than fiction.

Seeing this difference in densities when grouped by the target made me want to explore other features in this manner. An interesting one I found was publish_year. As can be seen in plot below, it seems that the "golden years" of fiction were in the mid-2000s (at least according to this reading list), whereas the mid-2010s had more good nonfiction.

# Caption: Distribution of books by fictionality.
sns.pairplot(
    books2[books2["publish_year"] > 1960],
    hue="fiction",
    vars=["publish_year"],
    palette="deep",
    plot_kws=dict(alpha=0.8, s=20),
    height=8,
);

Although the exploration above is relatively basic, I can see that there seem to be some meaningful relationships in the data. Most important to my purposes is confirming that there are patterns that indicate some differentiation in the features when grouping them by the target.

With that in mind, I felt good enough to get into playing with some models.

The Model

Baseline Models

As I discussed above, the most complex model is not necessarily the best model. It's generally a good idea to start with a simple baseline model and progressively add complexity in proceeding iterations. The chosen evaluation metric will (ideally) indicate when the added complexity is beneficial, and when it is not.

train, test = train_test_split(books, stratify=books["fiction"], test_size=0.2, random_state=92)
train, val = train_test_split(train, stratify=train["fiction"], test_size=0.2, random_state=92)

train.shape, val.shape, test.shape

((11740, 18), (2935, 18), (3669, 18))

target = "fiction"

# Arrange y vector
y_train = train[target]
y_val = val[target]
y_test = test[target]

print(y_train.shape, y_val.shape, y_test.shape)

# Arrange X matrices
X_train = train.drop(columns=[target])
X_val = val.drop(columns=[target])
X_test = test.drop(columns=[target])

print(X_train.shape, X_val.shape, X_test.shape)

(11740,) (2935,) (3669,)
(11740, 17) (2935, 17) (3669, 17)

Majority class

The simplest possible model is to simply predict the majority class every time, regardless of input. Assuming an evenly distributed binary target, that model should be right about half the time, or have a 50% accuracy. In my case, the majority class baseline is just north of that, around ~.52, or ~52%.

Now I know that no matter what model I end up choosing and what features I end up using, it must have an accuracy of more than .52. If I can't beat that...

print(y_train.value_counts())
sns.distplot(y_train);

1    6145
0    5595
Name: fiction, dtype: int64

maj = y_train.mode()[0]  # Mode is 1 (fiction)

# Simply predict 1 for every training example
y_pred_maj = [maj] * len(y_train)

# Baseline accuracy
accuracy_score(y_train, y_pred_maj)

0.5234241908006815

Limited logistic regression

Taking it a step further, I find it useful to get a second baseline using an algorithm with a bit more complexity than that, but far less complexity than what is possible given the problem parameters. In this case I chose to train a limited logistic regression model using only a few of the features: num_reviews, avg_rating, and num_pages. This simple baseline model had an accuracy of ~.63 and an F1 score of ~.65.

I will go into some detail on what the F1 score is in the Model Validation section. For now, suffice to say that the discrepancy between it and the accuracy is due to the slight imbalance of the classes in the target.

base_features = [
    "num_reviews",
    "avg_rating",
    "num_pages",
]

# Arrange X matrices
X1_train = train[base_features]
X1_val = val[base_features]
X1_test = test[base_features]

X1_train.shape, X1_val.shape, X1_test.shape

((11740, 3), (2935, 3), (3669, 3))

pipe1 = Pipeline([
    ("scaler", StandardScaler()),
    ("imputer", SimpleImputer(strategy="median")),
    ("logreg", LogisticRegression(random_state=92)),
])

# Train base pipeline
pipe1.fit(X1_train, y_train)

Pipeline(steps=[('scaler', StandardScaler()),
                ('imputer', SimpleImputer(strategy='median')),
                ('logreg', LogisticRegression(random_state=92))])

y_pred1 = pipe1.predict(X1_val)

# Compute accuracy
print("Baseline accuracy:", accuracy_score(y_val, y_pred1))

Baseline accuracy: 0.6275979557069846

confusion_matrix(y_val, y_pred1)

array([[ 821,  578],
       [ 515, 1021]])

Default random forest

I actually trained one more baseline model to set the bar higher and get a more accurate and precise idea of how high the bar will go given the problem, the available features, and the amount of data in the dataset.

This last baseline was a random forest that I left with default hyperparameters, trained on the full set of features as they were at this point in the process. This more complex—maybe even overly complex—model did quite a bit better, with an accuracy of ~.73 and an F1 of ~.75.

That's the score to beat as I iterate on the model and add complexity.

def_drop_columns = [
    "title",
    "author",
    "language",
]

X2_train = X_train.drop(columns=def_drop_columns)
X2_val = X_val.drop(columns=def_drop_columns)
X2_test = X_test.drop(columns=def_drop_columns)

rf1_pipe = Pipeline([
    ("encoder", ce.OrdinalEncoder()),
    ("imputer", SimpleImputer(strategy="median")),
    ("rfc", RandomForestClassifier(random_state=92)),
])

# Train default random forest
rf1_pipe.fit(X2_train, y_train)

# Made predictions to get validation accuracy
y_pred_rf1 = rf1_pipe.predict(X2_val)

# Compute evaluation metrics
print("Default random forest eval metrics:")
print("  Accuracy:", accuracy_score(y_val, y_pred_rf1))
print("  F1 score:", f1_score(y_val, y_pred_rf1))

Default random forest eval metrics:
  Accuracy: 0.7342419080068143
  F1 score: 0.7477360931435963

Model validation

As always, before jumping into building and training I needed to make some decisions about how I will measure success—or, more accurately, how the model measures success. Choosing appropriate evaluation metrics is crucial to get an accurate understanding of how well a model is fitting (or overfitting) the data. An overly complex model will likely overfit the training data, leading to a model that performs poorly on new inputs on which is was not trained. Decision trees are particularly prone to overfitting

Target distribution

The target I'm building a model to predict is binary, 1 or 0; fiction or nonfiction. Therefore, the metrics I choose must be appropriate for a binary classification model. The most common way to measure the performance of a binary classifier is the accuracy score—i.e. the proportion of predictions correctly classified by the model.

A very important characteristic to look at when choosing a metric is the distribution of the target variable. In the case of binary classification, if the target is skewed one way or the other, accuracy is not a good method of evaluating performance.

Although the target in my dataset is very evenly distributed between fiction (~54%) and nonfiction (~46%), it still is skewed a little bit. This means that if I use accuracy, I can expect it to be a decent indicator of performance, but slightly less than ideal. The reason for this is that accuracy only tracks the number of the model's mistakes, irrespective of how the model made those mistakes.

Types of errors

In the case of binary classification, there are two ways a prediction can be wrong: false positive (type I error) or false negative (type II error). As the name suggests, false positive is when the model (falsely) predicts positive or 1 when the actual label is 0. False negative is the alternative to that, when the model predicts negative or 0 when the label is 1.

A better method of evaluating performance will take into account the number of each type of error in a model's predictions. Depending on the target and what the model will be used for, it may be better to use a metric that looks at one or the other.

A common example is a spam detection model. In this case, it is generally better to let some spam emails through the filter into the users' inboxes (false negative) than to put potentially important emails into the spam folder (false positive).

Confusion matrix

A common and effective way to get deeper insight into a classification model's predictions is with a confusion matrix, which is a type of contingency table with the target's classes laid out along both axes, the rows representing the actual classes and the columns showing the predicted classes.

# Caption: Confusion matrix for default random forest model.

from sklearn.utils.multiclass import unique_labels

unique_labels(y_val)  # Create unique labels

def plot_confusion_matrix(y_true, y_pred):
    labels = unique_labels(y_true)
    columns = [f'Predicted {label}' for label in labels]
    index = [f'Actual {label}' for label in labels]
    table = pd.DataFrame(confusion_matrix(y_true, y_pred), 
                         columns=columns, index=index)
    return sns.heatmap(table, annot=True, fmt='d', cmap='viridis')

# Plot the confusion matrix
plt.figure(figsize=(10, 8))
plt.title("Confusion matrix: default random forest")
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 14}
plt.rc('font', **font)
plot_confusion_matrix(y_val, y_pred_rf1);

findfont: Font family ['normal'] not found. Falling back to DejaVu Sans.
findfont: Font family ['normal'] not found. Falling back to DejaVu Sans.

Where each row and column intersect, it shows the number of predictions that fall into that category. In the case of binary classification, this results in a table containing four numbers: true positives, true negatives, false positives, and false negatives. A classifier that made no mistakes would have non-zero numbers only on the main diagonal (top left to bottom right).

Evaluation metrics

There are three main metrics that can be derived from the confusion matrix. The first is precision, which is the proportion of the total positives (true positive + false positive) that were correctly predicted to be positive (true positive). The second is recall, which is the proportion of the total predicted values (true positives + false negatives) that were correctly predicted to be positive (true positive). The last metric I'll be covering now is the F1 score: the weighted average of precision and recall.

I chose to use the F1 score as the primary evaluation metric for my model because I'm more interested in reducing the number of mistakes in general, as opposed to preferring to optimize for one or the other.

true_pos = 1156
false_pos = 400
precision = true_pos / (true_pos + false_pos)
print(f"Precision: {precision}")

Precision: 0.7429305912596401

true_pos = 1156
false_neg = 380
precision = true_pos / (true_pos + false_neg)
print(f"Precision: {precision}")

Precision: 0.7526041666666666

from sklearn.metrics import classification_report

print(classification_report(y_val, y_pred_rf1))

              precision    recall  f1-score   support

           0       0.72      0.71      0.72      1399
           1       0.74      0.75      0.75      1536

    accuracy                           0.73      2935
   macro avg       0.73      0.73      0.73      2935
weighted avg       0.73      0.73      0.73      2935

Let's see what we can do to increase that score.

Feature Engineering

When I initially started this project, I went through the process of validating and training a model or two that tried to predict the average rating of books. This was by far the most common target chosen by those who started Kaggle kernels using other GoodReads datasets. Although this may have the most obvious business value if I was a data scientist working for a book publisher, to me this wasn't a particularly interesting target to try to predict.

Target practice

I realized this when I hit a wall with my progress in improving the rating-predictor model. One reason was that I did not see any obvious useful features that could be engineered. However, once I found my way to the idea of predicting the fictionality of the books, the target drove the direction I took with my feature engineering. It was a great learning experience for me in engineering features toward the specific target that the model is trying to predict. I called this process "target practice".

Here are the feature ideas I came up with and engineered (all in short succession once the new target was chosen):

Title begins with "The"
Has subtitle: contains ":"
Title character count
Title word count
Title longest word
Author number of names
Author middle initial
Ratings (stars) ratio: 1 + 2 / 4 + 5

def engineer_features(data):
    """Engineer a handful of new features."""
    # Create new feature that is if the title begins with "The"
    data["the_title"] = data["title"].str.startswith("The")
    # New feature - has_subtitle
    data["has_subtitle"] = data["title"].str.contains(":")
    # New feature - title character length
    data["title_char_count"] = data["title"].apply(lambda x: len(x))
    # New feature - title word count
    data["title_word_count"] = data["title"].apply(lambda x: len(x.split()))
    # New feature - title longest word
    data["title_longest_word"] = data["title"].apply(lambda x: len(max(x.split(), key=len)))
    # New feature - author number of names
    data["author_name_count"] = data["author"].apply(lambda x: len(x.split()))
    # New feature - author middle initial
    pat = r"\w* (\w. )+ \w*"
    data["author_middle_initial"] = data["author"].str.contains(pat, regex=True)
    # New feature - low/high rating ratio
    data["rating_ratio"] = (data["1_rating_count"] + data["2_rating_count"]) / (data["4_rating_count"] + data["5_rating_count"])
    # Replace Boolean with binary
    data = data.replace(to_replace={True: 1, False:0})
    return data

Forest with new features

To get an initial idea of the effect that these new features had on the model, I retrained a new random forest with all the same hyperparameters (defaults). The result was a significant boost in the model's F1 score, from ~.75 to ~.79. That made me happy.

However, because I engineered all of the new features at once—i.e. I didn't retrain the model after every one—this does not give me insight into which ones were useful. In fact, at this point I hadn't looked at how useful any of the specific features were.

X3_train = engineer_features(X_train)
X3_val = engineer_features(X_val)
X3_test = engineer_features(X_test)

rf2_pipe = Pipeline([
    ("encoder", ce.OrdinalEncoder()),
    ("imputer", SimpleImputer(strategy="median")),
    ("rfc", RandomForestClassifier(random_state=92)),
])

# Train default random forest
rf2_pipe.fit(X3_train, y_train)

# Made predictions to get validation accuracy
y_pred_rf2 = rf2_pipe.predict(X3_val)

# Compute evaluation metrics
print("Default random forest eval metrics:")
print("  Accuracy:", accuracy_score(y_val, y_pred_rf2))
print("  F1 score:", f1_score(y_val, y_pred_rf2))

/Users/Tobias/.vega/vela-_qIiF1eP/lib/python3.7/site-packages/pandas/core/strings.py:1954: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
  return func(self, *args, **kwargs)

Default random forest eval metrics:
  Accuracy: 0.7819420783645656
  F1 score: 0.7939471989697361

Permutation Importances

It is likely that some of the features do not help the model make correct predictions. Indeed, some may even be worse than that: they could add noise that makes the model perform worse.

To address this potential problem, I'm going to find each feature's importance to the model using a method called permutation importance. Basically, this method will go through each of the features, one by one, replacing their data with random noise generated from the distribution of the original data. The performance of the model will be evaluated and compared with the score using all of the original data to find the effect that each feature has on the performance of the model.

The following table is the result of running permutation importance on both the new and old features.

# Use the same (fitted) steps from main pipeline
transformers = Pipeline([
    ("encoder", rf2_pipe.named_steps["encoder"]),
    ("imputer", rf2_pipe.named_steps["imputer"]),
])

# Encode and impute
X3_train_transformed = transformers.transform(X3_train)
X3_val_transformed = transformers.transform(X3_val)

permuter = PermutationImportance(
    rf2_pipe.named_steps["rfc"], 
    scoring='f1', 
    n_iter=5, 
    random_state=42
)

permuter.fit(X3_val_transformed, y_val)

PermutationImportance(estimator=RandomForestClassifier(random_state=92),
                      random_state=42, scoring='f1')

feature_names = X3_val.columns.tolist()
pd.Series(permuter.feature_importances_, feature_names).sort_values(ascending=False)

eli5.show_weights(
    permuter, 
    top=None, # Show permutation importances for all features
    feature_names=feature_names
)

Along with being useful for feature selection, I find it interesting to see what features have the largest positive effect on the model's predictive power. From this table, I can see that the majority of the benefit I got from engineering the new features came from has_subtitle. This feature, according to the permutation importance table, is the most important predictor of the lot, and simply indicates whether the title of the book has a colon in it. My intuition was that having a subtitle is very common for nonfiction books, not so much for fiction. It seems that my intuition was generally good.

Feature pruning

Based on the above table, I should see a small increase in the model's performance by removing publish_month, author, title_word_count, title, and publish_day.

more_drop_cols = [
    "publish_month",
    "author",
    "title_word_count",
    "title",
    "publish_day",
]

# New features are already engineered
X4_train = X3_train.drop(columns=more_drop_cols)
X4_val   = X3_val.drop(columns=more_drop_cols)
X4_test  = X3_test.drop(columns=more_drop_cols)

rf3_pipe = Pipeline([
    ("encoder", ce.OrdinalEncoder()),
    ("imputer", SimpleImputer(strategy="median")),
    ("rfc", RandomForestClassifier(random_state=92)),
])

# Train default random forest
rf3_pipe.fit(X4_train, y_train)

# Made predictions to get validation accuracy
y_pred_rf3 = rf3_pipe.predict(X4_val)

# Compute evaluation metrics
print("Default random forest:")
print("  Accuracy:", accuracy_score(y_val, y_pred_rf3))
print("  F1 score:", f1_score(y_val, y_pred_rf3))

Default random forest:
  Accuracy: 0.7867120954003407
  F1 score: 0.7991014120667522

Once I removed those features, I saw another bump of ~.01 in the model's F1 score. That put the model up to an F1 score just south of .80 on the validation set.

Now that I'd engineered some new features and pruned them to only the useful ones, it was time to iterate and tune.

Iteration

As I mentioned above, a good general approach to finding and training the best model for a particular problem and dataset is to start simple and iterate. I already iterated to select the best features. Next up was to iterate on algorithms and their various hyperparameters.

Cross-validation

Up until this point, I had been using a consistent train-test-validation set split. However, now that I was going to be tuning hyperparameters, it made sense to start using cross-validation. I won't get into the details of what that is in this particular article. If you're curious, search for it on Duck Duck Go, or read about it in The Python Data Science Handbook.

# Start from original dataset, because data
# will only be split into train and test
books2 = engineer_features(books)

/Users/Tobias/.vega/vela-_qIiF1eP/lib/python3.7/site-packages/pandas/core/strings.py:1954: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
  return func(self, *args, **kwargs)

print(books2.shape)
books2.head(2)

(18344, 25)

train, test = train_test_split(books2, stratify=books2["fiction"], test_size=0.2, random_state=92)

train.shape, test.shape

((14675, 25), (3669, 25))

# No val this time bc using cross-validation
target = "fiction"

drop_cols = [  # Columns not useful to model
    "title",
    "author",
    "language",
    "publish_month",
    "publish_day",
]

# Arrange y vector
y_train = train[target]
y_test = test[target]

print(y_train.shape, y_test.shape)

# Arrange X matrices
X_train = train.drop(columns=[target] + drop_cols)
X_test = test.drop(columns=[target] + drop_cols)

print(X_train.shape, X_test.shape)

(14675,) (3669,)
(14675, 19) (3669, 19)

Hyperparameter tuning

Hyperparameters are the parameters of the model that are not directly learned during the training process—they must be adjusted manually. My preferred method of tuning them is with a randomized search.

Essentially, I provide a list of algorithm hyperparameters and a search window or distribution for each one. The algorithm runs through a specified number of iterations, each time randomly selecting values for each hyperparameter from their respective distribution and using those to train a new model. Each model is evaluated after each iteration using cross-validation. Then, once the search is over, the pipeline is refitted using the values that resulted in the model with the best cross-validation score.

As the search is random and limited to a reasonable number of iterations, some strategy is involved to find optimal search windows/distributions for each hyperparameter. The method I used for this is to start out searching a wide range of values and go through the process a few times, progressively narrowing the range around the values that come up as consistently optimal.

Validation

Before getting too deep into tuning the hyperparameters of a specific algorithm (it is computationally expensive to run exhaustive searches), I thought it best to test out a few different algorithms. I started out with the assumption that random forest or gradient-boosting would provide the best results, but it's generally good to test assumptions.

Though I did not test a very wide range of algorithms, I figured it would be worth it anyways to see if any of the more commonly-used ones showed any promise.

Somewhat to my surprise, the best models from tuning and cross-validation were basically on par with the default, non-cross-validated random forest.

The fact that the baseline was trained without using cross-validation could be one reason. In other words, the best model from the search might generalize better to new data; it could outperform the default one on the test dataset.

Or, it could be a result of the parameters and their ranges I searched—i.e. not giving a wide enough initial distribution. However, the wider the distribution, if the number of iterations is relatively low, the more "noise" might be present in the results of the search. I wanted to wait until I chose an algorithm to try increasing the number of search iterations significantly.

Here is the full list of algorithms I trained during the validation and iteration process, and their best F1 score:

Logistic regression: ~.76
K-nearest neighbors: ~.69
Random forest: ~.80
Gradient-boosted decision tree: ~.80

Random forest

# Tune hyperparameters using cross-validation

rf3_pipe = Pipeline([
    ("encoder", ce.OrdinalEncoder()),
    ("imputer", IterativeImputer(random_state=92)),
    ("rfc", RandomForestClassifier(random_state=92)),
])

rf3_params = {
    "imputer__initial_strategy": ["median", "most_frequent"], 
    "imputer__max_iter": randint(16, 40),
    "imputer__tol": uniform(0.001, 0.05),
    "imputer__n_nearest_features": randint(2, 10),
    "imputer__imputation_order": ["ascending", "roman", "random"],
    "rfc__n_estimators": randint(80, 300), 
    "rfc__max_depth": randint(6, 32),
    "rfc__min_samples_split": uniform(0, 1), 
}

# Define the search using parameter distros above
rf3_search = RandomizedSearchCV(
    rf3_pipe, 
    param_distributions=rf3_params, 
    n_iter=5,
    cv=5, 
    scoring='f1',
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1,
    random_state=92,
)

# Train default random forest
rf3_search.fit(X_train, y_train)

# Best combination of hyperparameters and their resulting f1 score
print("Best hyperparameters", rf3_search.best_params_)
print("F1 score", rf3_search.best_score_)

Fitting 5 folds for each of 5 candidates, totalling 25 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:   10.2s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   12.0s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   15.3s
[Parallel(n_jobs=-1)]: Done  21 out of  25 | elapsed:   16.7s remaining:    3.2s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:   17.9s finished

Best hyperparameters {'imputer__imputation_order': 'random', 'imputer__initial_strategy': 'median', 'imputer__max_iter': 19, 'imputer__n_nearest_features': 4, 'imputer__tol': 0.02223640657375538, 'rfc__max_depth': 21, 'rfc__min_samples_split': 0.00799211859966853, 'rfc__n_estimators': 205}
F1 score 0.7874357962768823

Nearest Neighbors

nn_pipe = Pipeline([
    ("encoder", ce.OrdinalEncoder()),
    ("imputer", IterativeImputer(random_state=92)),
    ("nn", KNeighborsClassifier()),
])

nn_params = {
    "imputer__initial_strategy": ["median", "most_frequent"], 
    "imputer__max_iter": randint(16, 40),
    "imputer__tol": uniform(0.001, 0.05),
    "imputer__n_nearest_features": randint(2, 10),
    "imputer__imputation_order": ["ascending", "roman", "random"],
    "nn__n_neighbors": randint(2, 20), 
    "nn__weights": ["uniform", "distance"],
    "nn__algorithm": ["ball_tree", "kd_tree"],
    "nn__leaf_size": randint(20, 50),
}

# Define the search using parameter distros above
nn_search = RandomizedSearchCV(
    nn_pipe, 
    param_distributions=nn_params, 
    n_iter=5,
    cv=5, 
    scoring="f1",
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1,
    random_state=92,
)

# Train default random forest
nn_search.fit(X_train, y_train)

# Best combination of hyperparameters and their resulting f1 score
print("Best hyperparameters", nn_search.best_params_)
print("F1 score", nn_search.best_score_)

Fitting 5 folds for each of 5 candidates, totalling 25 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    4.3s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    8.1s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   13.3s
[Parallel(n_jobs=-1)]: Done  21 out of  25 | elapsed:   15.1s remaining:    2.9s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:   16.6s finished

Best hyperparameters {'imputer__imputation_order': 'ascending', 'imputer__initial_strategy': 'most_frequent', 'imputer__max_iter': 17, 'imputer__n_nearest_features': 8, 'imputer__tol': 0.040172486631023935, 'nn__algorithm': 'ball_tree', 'nn__leaf_size': 29, 'nn__n_neighbors': 15, 'nn__weights': 'uniform'}
F1 score 0.6971155846391321

It seems that Random Forest is quite a bit better of an algorithm for this problem than k-nearest neighbors. Therefore, I won't be moving forward with nearest neighbors.

The last algorithm I'll try is a gradient-boosted decision tree classifier from XGBoost: XGBClassifier.

Gradient Boosting

Training a gradient-boosted decision tree using XGBoost.

Though I don't have a record of every single iteration of the below classifier search, the method I used to tune is to basically look at the values of each parameter, and moved the search range to more closely fit around those values.

I was surprised to find that my initial attempts at training the XGBClassifier had about the same performance as the default random forest with the newly-engineered features.

As I mentioned above, one hypothesis of what was causing the discrepancy (or lack thereof: I assumed gradient-boosting would increase the performance, which maybe wasn't a sound assumption), could be the simple fact that the randomized search doesn't cover every possibility. To test this, I increased the number of iterations and let 'er rip!

from xgboost import XGBClassifier

xgb1_pipe = Pipeline([
    ("encoder", ce.OrdinalEncoder()),
    ("imputer", IterativeImputer(random_state=92)),
    ("xgb", XGBClassifier(random_state=92)),
])

xgb1_params = {
    "imputer__initial_strategy": ["median", "most_frequent"], 
    "imputer__max_iter": randint(16, 45),
    "imputer__tol": uniform(0.02, 0.04),
    "imputer__n_nearest_features": randint(2, 10),
    "imputer__imputation_order": ["ascending", "roman", "random"],
    "xgb__n_estimators": randint(80, 160), 
    "xgb__max_depth": randint(18, 48),
    "xgb__learning_rate": uniform(0.05, .5),
}

# Define the search using parameter distros above
xgb1_search = RandomizedSearchCV(
    xgb1_pipe, 
    param_distributions=xgb1_params, 
    n_iter=10,
    cv=4,
    scoring="f1",
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1,
    random_state=92,
)

# Train default random forest
xgb1_search.fit(X_train, y_train)

# Best combination of hyperparameters and their resulting f1 score
print("F1 score", xgb1_search.best_score_)
print("Best hyperparameters:")
for param, val in xgb1_search.best_params_.items():
    print(  f"{param}: {val}")

Fitting 4 folds for each of 10 candidates, totalling 40 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   20.2s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   39.4s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  5.2min
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  5.6min finished

F1 score 0.7942245127829319
Best hyperparameters:
imputer__imputation_order: ascending
imputer__initial_strategy: median
imputer__max_iter: 31
imputer__n_nearest_features: 3
imputer__tol: 0.03492326452711587
xgb__learning_rate: 0.10726406674881385
xgb__max_depth: 27
xgb__n_estimators: 100

Final algorithm

Even with 40 total fits (4 cross-validation folds, 10 iterations) the gradient-boosted classifier did not really outperform the random forest by any significant margin. Given the additional complexity and computation required for an XGBoost (gradient-boosting) model, I decided to move forward with the random forest classifier.

To continue testing the hypothesis that my initial number of iterations was too low for the search to converge on a good combination of hyperparameters, I trained more random forests with higher numbers of search iterations.

To my continued surprise, even after many, many more iterations, the absolute best F1 score I got still hovered right around ~.80.

rf4_pipe = Pipeline([
    ("encoder", ce.OrdinalEncoder()),
    ("imputer", IterativeImputer(random_state=92, n_nearest_features=3)),
    ("rfc", RandomForestClassifier(random_state=92)),
])

rf4_params = {
    "imputer__initial_strategy": ["median", "most_frequent"], 
    "imputer__max_iter": randint(8, 20),
    "imputer__tol": uniform(0.01, 0.04),
    "imputer__imputation_order": ["ascending", "roman", "random"],
    "rfc__n_estimators": randint(140, 200), 
    "rfc__max_depth": randint(6, 18),
    "rfc__min_samples_split": randint(6, 14), 
    "rfc__min_impurity_decrease": uniform(0, .01), 
}

# Define the search using parameter distros above
rf4_search = RandomizedSearchCV(
    rf4_pipe, 
    param_distributions=rf4_params, 
    n_iter=15,
    cv=5, 
    scoring='f1',
    verbose=10,
    return_train_score=True, 
    n_jobs=-1,
    random_state=92,
)

# Train default random forest
rf4_search.fit(X_train, y_train)

# Best combination of hyperparameters and their resulting f1 score
print('F1 score', rf4_search.best_score_)
print('Best hyperparameters:')
for param, val in rf4_search.best_params_.items():
    print(  f"{param}: {val}")

Fitting 5 folds for each of 15 candidates, totalling 75 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    5.9s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    8.5s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   14.5s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:   17.9s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   25.6s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   31.3s
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:   39.3s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   46.6s
[Parallel(n_jobs=-1)]: Done  75 out of  75 | elapsed:   59.7s finished

F1 score 0.7908519538332393
Best hyperparameters:
imputer__imputation_order: ascending
imputer__initial_strategy: median
imputer__max_iter: 12
imputer__tol: 0.029121541514263292
rfc__max_depth: 12
rfc__min_impurity_decrease: 0.00010629484293753989
rfc__min_samples_split: 11
rfc__n_estimators: 195

rf4 = rf4_search.best_estimator_

y_pred_test_rf4 = rf4.predict(X_test)

# Compute evaluation metrics
print("Random forest 4 eval metrics:")
print("  Accuracy:", accuracy_score(y_test, y_pred_test_rf4))
print("  F1 score:", f1_score(y_test, y_pred_test_rf4))

Random forest 4 eval metrics:
  Accuracy: 0.7623330607795039
  F1 score: 0.7795753286147623

Results and Interpretation

At this point, I did not want to waste any more time or compute power trying to find that last little bit of optimization. So, it was time to evaluate the final model, interpret its predictions, then deploy it.

Evaluating the final model

Finally it was time to unlock the test data and see how the model does. The result was an F1 score slightly north of .78.

Here is the confusion matrix derived from the model's predictions on the test set:

y_pred_test_rf4 = rf4_search.predict(X_test)

# Compute evaluation metrics
print("Random forest 4 eval metrics:")
print("  Accuracy:", accuracy_score(y_test, y_pred_test_rf4))
print("  F1 score:", f1_score(y_test, y_pred_test_rf4))

Random forest 4 eval metrics:
  Accuracy: 0.7672390297083674
  F1 score: 0.7833587011669203

plt.figure(figsize=(14, 12))
plt.title("Confusion matrix: final random forest, test data")
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 22}
plt.rc('font', **font)
plot_confusion_matrix(y_test, y_pred_test_rf4);

train_id = X_train.reset_index()["index"]
test_id = X_test.reset_index()["index"]

test_id.head()

0     1042
1    10779
2    10072
3    15540
4    13007
Name: index, dtype: int64

# Process the test data
transformers_2 = Pipeline([
    ("encoder", rf4_search.best_estimator_["encoder"]),
    ("imputer", rf4_search.best_estimator_["imputer"]),
])

# Encode and impute
X_test_transform = transformers_2.transform(X_test)
class_index = 1

# Make predictions with the trained random forest
y_pred_proba_rf4 = rf4_search.predict_proba(X_test_transform)[:, class_index]

# ROC AUC score ranges from 0-1; higher is better
print(f'Test ROC AUC for class {class_index}:')
print(roc_auc_score(y_test, y_pred_proba_rf4))

Test ROC AUC for class 1:
0.8489270700686261

X_test = X_test.reset_index()
X_test.head()

X_test.shape, test_id.shape, y_pred_test_rf4.shape, y_pred_proba_rf4.shape, y_test.shape

((3669, 19), (3669,), (3669,), (3669,), (3669,))

y_test.reset_index().head()

# Create new dataframe to compare the predictions to the actual
df = pd.DataFrame({
    "index": test_id,
    "pred": y_pred_test_rf4,
    "pred_proba": y_pred_proba_rf4,
})

print(df.shape)
df.head()

(3669, 3)

df = df.merge(y_test.reset_index())

print(df.shape)
df.head()

(3669, 4)

df = df.merge(
     X_test,
     how='left'
)

print(df.shape)
df.head()

(3669, 22)

df_wrong = df[df["pred"] != df["fiction"]]
print(df_wrong.shape)
df_wrong.head()

(854, 22)

df_wrong = df_wrong.merge(books.iloc[df_wrong["index"]]["title"].reset_index())
print(df_wrong.shape)
df_wrong.head()

(854, 23)

df_wrong.sample(n=60, random_state=92).sort_values(by="pred_proba")

Predicted probabilities

One way to interpret a trained model is to inspect the predictions that the model made on the test set, either individually or through various descriptive statistics and visualizations. A key component to this is the probabilities underlying each of the predictions.

By looking at predicted probabilities, I can look at instances when the model was very sure or not sure of its predictions, and if those predictions were correct or not. By looking at the cases where the model made mistakes, I can hopefully pick out patterns and glean some insight why the model made those mistakes.

If I find that the model consistently makes mistakes that don't make much sense, that could mean there is some more optimization to be done or weakness to be addressed.

False negatives

The first one that jumped out at me was subtitles. The model finds that feature very useful, yet of course there are some fiction books with subtitles. This mistake makes sense to me, and for the benefit that the feature adds to the model, it is worth incorrectly classifying some fiction books as nonfiction. One way to get around this would be to engineer another feature or three that attempts to be the complement to that feature, catching the instances when fiction books have subtitles.

Another feature that causes a similar type of error is the title character count. According to the feature importances table (and plot generated from the table's data, shown below) of the final random forest model, 'title_char_count' is also a very important feature. I can see that many of the false negatives (predicted nonfiction, actually fiction) have a high title character count.

rf4 = rf4_search.best_estimator_["rfc"]
importances = pd.Series(rf4.feature_importances_, X_train.columns)

# Plot feature importances
n = 20
plt.figure(figsize=(10,n/2))
plt.title(f'Top {n} features')
importances.sort_values()[-n:].plot.barh(color='grey');

The following SHAP plots formalize my observations and provide more insight by showing the effect that the value of each feature had on the model's prediction. The following group are all false negative predictions, meaning the model predicted nonfiction when it was actually fiction.

I also wanted to see what features caused false negatives when the book had neither a high character count nor a subtitle. The third plot below shows the top three as republish, avg_rating, and publish_year. I was surprised to see republish, as it does not seem to be all that important to the model overall. The other two seem to be relatively good predictors, particularly avg_rating.

False positives

The other side of the confusion matrix again corroborates my initial observation. Many of the same features that provided some additional predictive power also mean some additional mistakes. However, the overall effect was net positive. If it wasn't these features causing the incorrect predictions, it would be others, and there would be more incorrect predictions overall.

Deployment

So there you have it, the process of training a machine learning model, from start to finish. The only thing left to do is to deploy it.

I thought it would be interesting to be able to set up a dashboard that can be used to give dynamic inputs to the model and see the resulting prediction. I decided to build the dashboard using Plotly Dash and deploy it to Heroku.

So you don't have to scroll all the way up, here's another link to the live app: print(fiction).

Shoutout to HTML5UP for the template on which I based the design of the app. I made some minor modifications to it, but most of the design was straight from one of his templates.

If you made it this far, I salute you.

As always, thank you for reading and I'll see you in the next one!

	url	title	author	num_ratings	num_reviews	avg_rating	num_pages	language	publish_date	original_publish_year	genres	characters	series	places	asin	0_rating_count	1_rating_count	2_rating_count	3_rating_count	4_rating_count	5_rating_count
0	https://www.goodreads.com/book/show/323355.The...	The Book of Mormon: Another Testament of Jesus...	Anonymous	71355.0	5704.0	4.37	531.0	English	2013-10-22 00:00:00	1830.0	[Lds, Church, Christianity, Religion, Nonfiction]	NaN	NaN	NaN	NaN	NaN	7520.0	2697.0	2521.0	1963.0	56654.0
1	https://www.goodreads.com/book/show/28862.The_...	The Prince	Niccolò Machiavelli	229715.0	7261.0	3.81	140.0	English	2003-06-01 00:00:00	1513.0	[European Literature, Italian Literature, Hist...	[Theseus (mythology), Alexander the Great, Ces...	NaN	NaN	NaN	NaN	5254.0	16827.0	61182.0	80221.0	66231.0
2	https://www.goodreads.com/book/show/46654.The_...	The Foundation Trilogy	Isaac Asimov	83933.0	1331.0	4.40	679.0	English	1974-01-01 00:00:00	1953.0	[Science Fiction, Classics, Fiction]	[Hari Seldon, Salvor Hardin, Hober Mallow, Mul...	Foundation (Publication Order) #1-3	NaN	NaN	NaN	477.0	1521.0	9016.0	25447.0	47472.0
3	https://www.goodreads.com/book/show/3980.From_...	From the Mixed-Up Files of Mrs. Basil E. Frank...	E.L. Konigsburg	173617.0	6438.0	4.15	178.0	English	2003-06-02 00:00:00	1967.0	[Childrens, Mystery, Middle Grade, Fiction, Yo...	[Mrs. Basil E. Frankweiler, Claudia Kincaid, J...	NaN	[New York City, New York, Connecticut]	NaN	NaN	2742.0	6381.0	29358.0	58559.0	76577.0
4	https://www.goodreads.com/book/show/18521.A_Ro...	A Room of One's Own	Virginia Woolf	98164.0	5848.0	4.14	112.0	English	2000-01-01 00:00:00	1929.0	[Essays, Feminism, Classics, Nonfiction, Writing]	NaN	NaN	NaN	NaN	NaN	1357.0	3778.0	15993.0	35876.0	41160.0

	title	author	num_ratings	num_reviews	avg_rating	num_pages	language	original_publish_year	genres	series	1_rating_count	2_rating_count	3_rating_count	4_rating_count	5_rating_count	publish_year	publish_month	publish_day
0	The Book of Mormon: Another Testament of Jesus...	Anonymous	71355.0	5704.0	4.37	531.0	English	1830.0	['Lds', 'Church', 'Christianity', 'Religion', ...	NaN	7520.0	2697.0	2521.0	1963.0	56654.0	2013.0	10.0	22.0
1	The Prince	Niccolò Machiavelli	229715.0	7261.0	3.81	140.0	English	1513.0	['European Literature', 'Italian Literature', ...	NaN	5254.0	16827.0	61182.0	80221.0	66231.0	2003.0	6.0	1.0

	title	author	num_ratings	num_reviews	avg_rating	num_pages	language	original_publish_year	genres	series	1_rating_count	2_rating_count	3_rating_count	4_rating_count	5_rating_count	publish_year	publish_month	publish_day
0	The Book of Mormon: Another Testament of Jesus...	Anonymous	71355.0	5704.0	4.37	531.0	English	1830.0	['Lds', 'Church', 'Christianity', 'Religion', ...	NaN	7520.0	2697.0	2521.0	1963.0	56654.0	2013.0	10.0	22.0
1	The Prince	Niccolò Machiavelli	229715.0	7261.0	3.81	140.0	English	1513.0	['European Literature', 'Italian Literature', ...	NaN	5254.0	16827.0	61182.0	80221.0	66231.0	2003.0	6.0	1.0

	title	author	num_ratings	num_reviews	avg_rating	num_pages	language	original_publish_year	series	1_rating_count	2_rating_count	3_rating_count	4_rating_count	5_rating_count	publish_year	publish_month	publish_day	fiction
0	The Book of Mormon: Another Testament of Jesus...	Anonymous	71355.0	5704.0	4.37	531.0	English	1830.0	NaN	7520.0	2697.0	2521.0	1963.0	56654.0	2013.0	10.0	22.0	0
1	The Prince	Niccolò Machiavelli	229715.0	7261.0	3.81	140.0	English	1513.0	NaN	5254.0	16827.0	61182.0	80221.0	66231.0	2003.0	6.0	1.0	0

	title	author	num_ratings	num_reviews	avg_rating	num_pages	language	1_rating_count	2_rating_count	3_rating_count	4_rating_count	5_rating_count	publish_year	publish_month	publish_day	republish
39	When Breath Becomes Air	Paul Kalanithi	332950.0	28262.0	4.35	208.0	English	4685.0	8103.0	35451.0	101807.0	182904.0	2016.0	1.0	19.0	1
4782	When Breath Becomes Air	Paul Kalanithi	333005.0	28275.0	4.35	229.0	English	4685.0	8103.0	35454.0	101823.0	182940.0	2016.0	1.0	12.0	0
10877	When Breath Becomes Air	Paul Kalanithi	332942.0	28259.0	4.35	229.0	English	4685.0	8103.0	35451.0	101804.0	182899.0	2017.0	1.0	23.0	1
12947	When Breath Becomes Air	Paul Kalanithi	333024.0	28277.0	4.35	209.0	NaN	4686.0	8103.0	35455.0	101826.0	182954.0	2016.0	2.0	17.0	1
13426	When Breath Becomes Air	Paul Kalanithi	333025.0	28277.0	4.35	228.0	English	4686.0	8103.0	35455.0	101826.0	182955.0	2016.0	1.0	12.0	0
19976	When Breath Becomes Air	Paul Kalanithi	333042.0	28279.0	4.35	5.0	English	4686.0	8104.0	35457.0	101830.0	182965.0	2016.0	1.0	12.0	0

	count	unique	top	freq
language	17214	55	English	15909
author	18573	10019	Stephen King	79
title	18573	17816	Selected Poems	6

	title	author	num_ratings	num_reviews	avg_rating	num_pages	language	series	1_rating_count	2_rating_count	3_rating_count	4_rating_count	5_rating_count	publish_year	publish_month	publish_day	fiction	republish
114	The Complete Anne of Green Gables Boxed Set	L.M. Montgomery	99371.0	1491.0	4.43	2088.0	English	1	1611.0	2123.0	9689.0	24650.0	61298.0	1998.0	10.0	6.0	1	1
274	In Search of Lost Time	Marcel Proust	9249.0	515.0	4.34	4211.0	English	1	249.0	365.0	965.0	2045.0	5625.0	2003.0	6.0	3.0	1	1

Weight	Feature
0.0307 ± 0.0021	has_subtitle
0.0264 ± 0.0116	avg_rating
0.0197 ± 0.0066	4_rating_count
0.0193 ± 0.0031	publish_year
0.0153 ± 0.0059	num_ratings
0.0124 ± 0.0039	series
0.0116 ± 0.0071	1_rating_count
0.0096 ± 0.0048	num_pages
0.0078 ± 0.0099	num_reviews
0.0073 ± 0.0091	3_rating_count
0.0062 ± 0.0075	title_char_count
0.0060 ± 0.0043	5_rating_count
0.0031 ± 0.0048	rating_ratio
0.0016 ± 0.0024	language
0.0008 ± 0.0059	the_title
0.0007 ± 0.0056	2_rating_count
0.0005 ± 0.0024	title_longest_word
0.0004 ± 0.0025	republish
0.0003 ± 0.0020	author_name_count
0.0001 ± 0.0002	author_middle_initial
-0.0002 ± 0.0040	publish_month
-0.0006 ± 0.0041	author
-0.0007 ± 0.0030	title_word_count
-0.0008 ± 0.0009	title
-0.0027 ± 0.0030	publish_day

	index	num_ratings	num_reviews	avg_rating	num_pages	series	1_rating_count	2_rating_count	3_rating_count	4_rating_count	5_rating_count	publish_year	republish	has_subtitle	title_char_count	title_longest_word	author_name_count	rating_ratio
0	1042	112871.0	4674.0	4.00	317.0	0	1593.0	5439.0	24255.0	41993.0	39591.0	1997.0	1	0	29	7	2	0.086193
1	10779	364.0	47.0	3.97	431.0	0	12.0	16.0	72.0	134.0	130.0	2016.0	1	0	18	8	2	0.106061
2	10072	49.0	12.0	4.18	294.0	0	0.0	3.0	10.0	11.0	25.0	1994.0	0	0	16	8	2	0.083333
3	15540	150.0	14.0	3.87	200.0	1	0.0	2.0	46.0	71.0	31.0	1994.0	0	0	16	6	2	0.019608
4	13007	19322.0	213.0	4.09	268.0	1	285.0	986.0	3758.0	5984.0	8309.0	1993.0	1	1	40	11	2	0.088925

	index	pred	pred_proba
0	1042	1	0.926309
1	10779	1	0.550764
2	10072	0	0.159053
3	15540	1	0.561363
4	13007	1	0.529370

	index	pred	pred_proba	fiction	num_ratings	num_reviews	avg_rating	num_pages	series	1_rating_count	2_rating_count	3_rating_count	4_rating_count	5_rating_count	publish_year	republish	the_title	has_subtitle	title_char_count	title_longest_word	author_name_count	rating_ratio	title
113	6021	0	0.081860	1	42.0	19.0	3.83	NaN	0	2.0	1.0	9.0	20.0	10.0	2013.0	1	0	1	62	8	2	0.100000	Raised by Hand, Lifted by the Tides: A Souther...
671	10557	0	0.151352	1	146.0	19.0	3.84	96.0	0	4.0	6.0	42.0	52.0	42.0	1999.0	1	0	1	38	10	2	0.106383	Sanctuary: A Tale of Life in the Woods
589	7885	0	0.201978	1	24.0	8.0	3.75	270.0	0	3.0	0.0	6.0	6.0	9.0	2011.0	0	0	0	41	17	2	0.200000	Our Father, Who Art Out There...Somewhere
601	17626	0	0.294917	1	334.0	41.0	4.17	228.0	0	6.0	7.0	55.0	121.0	145.0	1996.0	1	0	0	32	9	2	0.048872	Farewell, I'm Bound to Leave You
254	11700	0	0.295858	1	80.0	4.0	3.86	23.0	0	1.0	7.0	21.0	24.0	27.0	1980.0	0	0	0	49	15	3	0.156863	Walt Disney's Winnie-the-Pooh and the Honey Patch
287	6715	0	0.298807	1	735.0	62.0	3.88	287.0	1	24.0	48.0	146.0	293.0	224.0	2015.0	0	0	0	11	6	2	0.139265	Dark Waters
832	15914	0	0.311132	1	114.0	19.0	4.28	207.0	0	2.0	0.0	14.0	46.0	52.0	1991.0	1	0	0	21	9	2	0.020408	John Bauers sagovärld
102	15751	0	0.348599	1	390.0	61.0	4.35	290.0	0	3.0	16.0	36.0	121.0	214.0	2010.0	1	0	0	7	7	2	0.056716	Damaged
87	13051	0	0.348673	1	4661.0	582.0	4.03	320.0	1	20.0	131.0	1001.0	2043.0	1466.0	2012.0	0	0	1	48	9	2	0.043032	A Blink of the Screen: Collected Shorter Fiction
103	2304	0	0.350009	1	61.0	18.0	3.98	430.0	0	1.0	6.0	9.0	22.0	23.0	2012.0	1	0	0	6	6	2	0.155556	Eulogy
334	941	0	0.379592	1	48225.0	3677.0	3.82	96.0	0	1169.0	3440.0	12147.0	17776.0	13693.0	1992.0	1	0	1	38	10	3	0.146462	Flatland: A Romance of Many Dimensions
634	15440	0	0.396910	1	1765.0	57.0	4.50	288.0	1	5.0	21.0	147.0	501.0	1091.0	2011.0	0	0	0	14	5	2	0.016332	Silva Rerum II
814	13551	0	0.402496	1	582.0	37.0	4.44	208.0	1	4.0	11.0	59.0	157.0	351.0	2010.0	0	0	0	35	7	2	0.029528	Sir Quinlan and the Swords of Valor
297	3884	0	0.418613	1	16641.0	1170.0	3.92	365.0	1	564.0	986.0	3299.0	6178.0	5614.0	2015.0	1	0	0	12	6	3	0.131445	Split Second
82	7418	0	0.425630	1	194.0	15.0	4.29	241.0	1	3.0	0.0	19.0	87.0	85.0	2018.0	1	1	0	25	9	2	0.017442	The Guardians of Eastgate
397	15351	0	0.425805	1	16018.0	176.0	4.40	30.0	1	177.0	454.0	2171.0	3256.0	9960.0	2002.0	0	0	0	29	8	2	0.047745	Disney's Beauty and the Beast
189	7629	0	0.443745	1	416.0	45.0	3.90	321.0	1	9.0	28.0	101.0	134.0	144.0	2014.0	0	1	0	18	8	2	0.133094	The Damascus Cover
682	2806	0	0.445734	1	33628.0	5291.0	3.96	247.0	1	1366.0	2091.0	6039.0	11055.0	13077.0	2011.0	0	1	0	66	15	3	0.143254	The Girl Who Circumnavigated Fairyland in a Sh...
570	9871	0	0.447479	1	537.0	73.0	3.93	95.0	0	9.0	31.0	122.0	200.0	175.0	1994.0	1	0	1	19	7	2	0.106667	Lost Boy: A Novella
21	77	0	0.448205	1	12371.0	604.0	4.39	1204.0	1	276.0	415.0	1238.0	2758.0	7684.0	2003.0	0	1	1	91	8	2	0.066175	The Black Jewels Trilogy: Daughter of the Bloo...
354	12538	0	0.491755	1	136.0	26.0	3.18	246.0	1	15.0	28.0	36.0	31.0	26.0	2019.0	1	0	0	16	4	2	0.754386	Cry for the Moon
425	2326	0	0.498150	1	5890.0	262.0	4.34	352.0	0	105.0	133.0	607.0	1883.0	3162.0	2002.0	0	0	0	13	7	2	0.047175	गोदान [Godan]
127	13055	1	0.506951	0	1655.0	70.0	4.38	268.0	0	11.0	28.0	187.0	516.0	913.0	1998.0	1	1	0	30	9	2	0.027292	The Collected Poems, 1957-1982
607	11992	1	0.515010	0	319.0	129.0	3.21	261.0	0	35.0	65.0	81.0	75.0	63.0	2017.0	0	0	0	13	4	2	0.724638	Bad Girl Gone
178	9523	1	0.525453	0	352.0	56.0	3.77	248.0	0	36.0	49.0	42.0	57.0	168.0	2007.0	1	1	0	20	9	3	0.377778	The Way To Happiness
331	9214	1	0.526713	0	1145.0	142.0	4.07	400.0	0	13.0	44.0	227.0	432.0	429.0	2015.0	0	1	0	17	6	2	0.066202	The Bright Effect
517	6526	1	0.530070	0	2334.0	212.0	4.48	272.0	1	9.0	22.0	195.0	729.0	1379.0	1993.0	1	0	0	17	6	2	0.014706	Man of the Family
93	7070	1	0.530858	0	23008.0	4618.0	3.97	292.0	0	580.0	1521.0	4642.0	7526.0	8739.0	2012.0	0	0	0	10	5	3	0.129173	Tiger Lily
299	18145	1	0.536059	0	10863.0	458.0	4.22	70.0	1	71.0	285.0	1859.0	3645.0	5003.0	2011.0	1	0	0	19	8	2	0.041166	A Taste of Midnight
121	8047	1	0.565218	0	5554.0	235.0	4.27	288.0	0	80.0	154.0	774.0	1699.0	2847.0	2007.0	1	1	0	17	6	2	0.051474	The Joy of Living
290	8710	1	0.570749	0	6641.0	517.0	4.11	384.0	0	240.0	294.0	911.0	2257.0	2939.0	1999.0	1	1	0	14	10	3	0.102771	The Seamstress
506	10612	1	0.573332	0	18092.0	2547.0	4.04	213.0	1	268.0	854.0	3865.0	6094.0	7011.0	2010.0	1	0	0	12	6	2	0.085616	Girl, Stolen
140	18189	1	0.576180	0	13032.0	549.0	3.99	133.0	1	146.0	544.0	2898.0	5117.0	4327.0	2013.0	0	0	0	7	7	2	0.073062	Lockout
157	651	1	0.596103	0	31487.0	872.0	4.15	216.0	0	274.0	968.0	5389.0	12062.0	12794.0	2001.0	0	1	0	26	8	2	0.049968	The Universe in a Nutshell
325	15922	1	0.606033	0	409.0	41.0	4.18	256.0	0	1.0	11.0	79.0	141.0	177.0	1987.0	1	1	0	11	7	2	0.037736	The Gypsies
568	17045	1	0.610479	0	858.0	46.0	4.11	362.0	1	14.0	34.0	163.0	282.0	365.0	2008.0	1	0	0	9	4	2	0.074189	Wild Jinx
37	1930	1	0.629782	0	24176.0	220.0	4.40	256.0	0	236.0	533.0	2924.0	6133.0	14350.0	2014.0	1	1	0	28	7	2	0.037543	The Power of a Praying Woman
701	2625	1	0.653492	0	95178.0	8167.0	4.17	256.0	0	2090.0	4260.0	14235.0	28948.0	45645.0	2017.0	0	1	0	23	7	2	0.085129	The Sun and Her Flowers
762	17691	1	0.655321	0	691.0	68.0	4.12	464.0	0	4.0	31.0	126.0	247.0	283.0	2005.0	1	1	0	10	6	2	0.066038	The Armada
670	7554	1	0.661281	0	4004.0	123.0	4.09	260.0	1	75.0	205.0	721.0	1290.0	1713.0	2004.0	1	1	0	19	8	2	0.093240	The Art of Dreaming
309	11141	1	0.689412	0	45130.0	5103.0	3.90	372.0	1	1070.0	2624.0	10165.0	17305.0	13966.0	2012.0	1	1	0	20	4	2	0.118129	The Name of the Star
35	1949	1	0.709350	0	119301.0	1493.0	4.34	178.0	0	1253.0	2956.0	15995.0	33145.0	65952.0	2003.0	1	0	0	10	7	2	0.042474	Falling Up
642	8398	1	0.723719	0	1245.0	81.0	3.89	350.0	1	18.0	69.0	306.0	494.0	358.0	2003.0	1	0	0	33	7	2	0.102113	Dilbert and the Way of the Weasel
372	1672	1	0.729363	0	2467412.0	25823.0	4.13	283.0	0	56904.0	105275.0	408368.0	783656.0	1113209.0	1993.0	1	1	0	25	5	2	0.085498	The Diary of a Young Girl
451	16099	1	0.733403	0	7491.0	149.0	4.08	358.0	1	80.0	293.0	1641.0	2390.0	3087.0	2002.0	0	0	0	8	8	2	0.068103	Tapestry
25	11502	1	0.753236	0	8158.0	1142.0	4.07	224.0	0	52.0	314.0	1611.0	3216.0	2965.0	2018.0	1	1	0	17	5	2	0.059214	The Order of Time
395	8714	1	0.760106	0	9966.0	761.0	3.93	316.0	0	210.0	680.0	2252.0	3266.0	3558.0	2000.0	0	0	0	9	5	2	0.130422	Old Magic
576	14384	1	0.761098	0	12224.0	367.0	4.06	256.0	1	104.0	488.0	2724.0	4118.0	4790.0	2006.0	1	0	0	17	5	2	0.066457	Once Upon a Curse
806	15838	1	0.786317	0	1385.0	116.0	3.40	292.0	1	79.0	187.0	444.0	448.0	227.0	2008.0	1	0	0	11	5	2	0.394074	Night Child
438	16960	1	0.791309	0	2074.0	390.0	3.08	363.0	0	180.0	421.0	745.0	503.0	225.0	2014.0	1	0	0	4	4	2	0.825549	Tape
459	14595	1	0.801759	0	18656.0	1395.0	4.11	310.0	0	109.0	568.0	3475.0	7492.0	7012.0	2002.0	1	1	0	17	5	2	0.046677	The Water is Wide
99	12264	1	0.805392	0	13515.0	528.0	3.98	304.0	0	325.0	879.0	2907.0	4005.0	5399.0	2003.0	1	0	0	8	8	3	0.128031	Journals
404	10196	1	0.808873	0	32726.0	6933.0	3.96	344.0	0	559.0	1679.0	6551.0	13614.0	10323.0	2018.0	0	0	0	17	7	2	0.093495	To Kill a Kingdom
238	1310	1	0.850194	0	31074.0	1918.0	3.97	378.0	0	735.0	1526.0	6513.0	11551.0	10749.0	2002.0	1	0	0	13	6	2	0.101390	Silent Spring
259	9934	1	0.851483	0	3142.0	142.0	4.07	320.0	1	36.0	130.0	630.0	1127.0	1219.0	1989.0	1	0	0	17	8	2	0.070759	An Actor Prepares
191	17455	1	0.864813	0	3922.0	103.0	3.91	430.0	1	47.0	191.0	1014.0	1486.0	1184.0	2006.0	1	0	0	18	5	2	0.089139	Ruler of the Realm
548	14559	1	0.865144	0	3351.0	217.0	3.71	384.0	0	77.0	302.0	1019.0	1055.0	898.0	2004.0	1	0	0	26	8	2	0.194060	Sword of the Rightful King
702	14809	1	0.877158	0	3674.0	337.0	3.91	304.0	0	32.0	181.0	900.0	1529.0	1032.0	1997.0	1	1	0	27	7	2	0.083171	The Serpent and the Rainbow
204	5310	1	0.885058	0	12596.0	1170.0	3.76	208.0	0	232.0	811.0	3636.0	4969.0	2948.0	1998.0	1	0	0	31	6	3	0.131742	How Proust Can Change Your Life
310	6466	1	0.911392	0	12946.0	868.0	3.90	215.0	0	119.0	600.0	3215.0	5567.0	3445.0	2001.0	1	1	0	22	5	2	0.079783	The Road to Wigan Pier