The goal of this tutorial is to describe one method of automating the process of cutting out objects (things, people, pets, etc.) from images and combining them to make a collage of sorts.

First, I go through creating binary masks for one or more objects in an image by using a class of computer vision algorithms called image segmentation. Binary mask(s) in hand(s), I go through one method (technically two, actually) of using said binary mask(s) to extract or remove part(s) of an image. Next, I do some basic image transformations (rotate, crop, and scale) on the resulting cutout. Finally, I paste the cutout on top of another image to make a collage.

Rather than drawing binary masks by hand or using proprietary software like Photoshop to manipulate and transform images, I'll show you how to automate the process using completely free, open-source tools. Namely, we'll be using Python along with a few open-source libraries:

The Problem

Selecting and separating parts of an image can be a tedious, time-consuming process. Anyone who's done a fair amount of tinkering with image manipulation using a program like Photoshop knows the struggle.

Although modern tools make the process easier, wouldn't it be nice if there was a way to automate the process?

Creating "Paws"

As an example, say I'd like to cut out my cat Hobbes from a photo in order to "Photoshop" him into a different image. Here's the photo of Hobbes I'll be using.

I think his position is perfect for creating "Hawbbes" (Jaws + Hobbes)...meh I'll call it "Paws". By cutting him out and rotating him a bit, he could be pasted onto an underwater shot of someone swimming and he could live his dream of being a fierce sharkitty.

Here's an image I found on Unsplash that would work as the background onto which Hobbes, once cut out of the image above, can be pasted.

Basically, in order to cut Hobbes out of the photo above, I'll have to make all the pixels in the image transparent except for the ones representing Hobbes. Then I'll crop, rotate, scale, and superimpose the resulting image on top of the underwater shot such that Hobbes roughly takes up the bottom half.


Image Masking

To accomplish this manually, I could spend anywhere from a few minutes to a few hours outlining Hobbes in the original image to create a mask — masking the image. The time investment depends on how easily-separable the subject is from the rest of the image, how accurate I want the cut to be, and what tools are available to me.

Regarding that last point, the magicians at Adobe have done some rather impressive black magic with Photoshop, giving users very quick and very effective methods for selecting parts of an image. However, the goal of this post is to accomplish this programmatically, without the use of any closed-source software.

A mask is basically a method of distinguishing/selecting/separating pixels. If you've never heard the term used this way before, one way to think about it is with masking tape and paint. Typically, one would put masking tape—i.e. create a "mask"—around areas on a wall that should not be painted. This is essentially what a mask is doing in any photo manipulation software: indicating what areas of an image to affect or transform (or which areas not to do so).

Here's the image of Hobbes with the image segmentation-generated masks overlayed on top of it (which we'll be creating later) showing, obviously, where Hobbes is in the image. It doesn't really matter that the model thinks he's a dog — we won't be using the predicted class, only the mask. And the mask is still good enough for our purposes.

A binary mask is a method of masking which uses a two-tone color scheme, to indicate the areas of an image to be affected and not affected. By overlaying a binary mask on top of the original image, the boundaries between the two colors can be used to affect the different areas of the image differently, whether that is making pixels transparent (removing them) or applying some sort of effect or transformation.

The white area in the image below shows the same coordinates as the orange one above, converted into a binary mask. While I've only spent any significant time with Photoshop, I'd imagine any decent image manipulation software can work with binary masks similarly to how we'll be working with them.

Computer vision

In order to generate binary masks based on the content of the image, the algorithm must be somewhat intelligent. That is, it must be able to process the image in such a way that it can recognize where the foreground is and draw a polygon around it with some degree of accuracy.

Luckily, there are a number of deep learning models that will do just that. The field is called Computer Vision, and the class of algorithm used in this article is known as image segmentation.

Don't worry if you don't have any experience with this type of thing, or even if you don't necessarily want to get experience with it. Modern machine learning tooling makes it incredibly quick and easy to get a model up and predicting with pre-trained weights. Though if you want to understand what's going on, it will likely help to know a bit of Python programming.

One caveat: the pre-trained models will usually work well with classes of objects that were in their training data. The model weights used in this post were trained on the COCO dataset, which contains 80 object classes. Depending on what the object in the foreground is that you are trying to extract, you may or may not need to extend the model with a custom dataset and training session. That is a topic for another post.


The deep learning framework used here is PyTorch, developed by Facebook AI Research (FAIR). More specifically, we'll use a computer vision framework, also developed by FAIR, called Detectron2.

Although the primary framework used in this article is Detectron2, this process should be translatable to other image segmentation models as well. In fact, I'll be adding an addendum to this post in which I'll go over using Matterport's TensorFlow-based implementation of Mask R-CNN to accomplish the exact same thing.

Heck, while I'm at it, I might as well do it with fastai as well.

End Result

I know you've been dying to see the end result of the whole process.

Without any further ado, I present to you, Paws!

[[Caption :: All of that was done with code. Pretty neat, eh?]] With that, let's get into how this masterpiece was created.


Install Detectron2 and other dependencies

As mentioned in the introduction, the framework we'll be using for image segmentation is called Detectron2. The following cells install and set up Detectron2 in a Google Colab environment (pulled from the official Detectron2 getting started notebook). If you don't want to use Colab for whatever reason, either play around with installation and setup or refer to the installation instructions.

The other top-level dependencies needed for this tutorial:

The nice thing about Colab is all of these come pre-installed. Oh yeah, you also get free access to a GPU. Thanks, Googs!

Again, simply click the "Open in Colab" badge at the top of this page, then hit File > Save a copy in Drive, which does exactly what it says: saves a copy of the notebook to your Google Drive. In addition, you can open an ephemeral copy of the notebook without saving it first by hitting File > Open in playground mode.

Once you have everything installed, we can start with some imports and configuration.

# === Some imports and setup === #

# Setup Detectron2 logger
import detectron2
from detectron2.utils.logger import setup_logger

# Common libraries
import numpy as np
import os, json, cv2, random
# Only needed when running in Colab
from google.colab.patches import cv2_imshow

# Detectron2 utilities
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from import MetadataCatalog, DatasetCatalog

Running a pre-trained Detectron2 model

Most, if not all, open-source deep learning frameworks have a set of pre-trained weights available to use. The creators of the frameworks will conduct a series of training sessions on the most commonly-used datasets in order to benchmark the performance of their algorithms. Luckily for everyone else, they typically provide the results of this training in the form of weights, which can be loaded into the model and be used for inference immediately.

For many tasks, including recognizing and outlining an image of a cat, pre-trained weights will work fine. The model weights used in this post were trained on the popular COCO dataset, which contains 80 object classes, including cats. If, for example, we wanted to do the same thing with whales or one specific person, we'd have to do some custom training.

I will be publishing a companion blog post to this one about training Detectron2 on a custom dataset. Once that is published, I'll link to it here. If there's no link yet, I haven't published it yet.

If you're curious about custom training now, the Detectron2 "Getting Started" Colab notebook also goes through one way of doing so.

Image loading and image arrays

The first thing we need in order to use the model is an image on which it can be used.

We first download the image to the local filesystem using wget, then load it into memory using cv2 (opencv-python).

im = cv2.imread("./01_hobbes.jpg")
--2020-07-20 19:35:16--
Resolving (,,, ...
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 344670 (337K) [image/jpeg]
Saving to: ‘01_hobbes.jpg’

01_hobbes.jpg       100%[===================>] 336.59K  --.-KB/s    in 0.05s   

2020-07-20 19:35:16 (7.01 MB/s) - ‘01_hobbes.jpg’ saved [344670/344670]

If you think about what a digital image actually is, it makes sense to represent it as a matrix — each row corresponds to a row of pixels, and each column a column of pixels in the image. Technically, images would be considered a 3-dimensional array, because they have width, height, and depth (number of channels).

Depending on if the image has three channels (typically RGB: red, green, blue) or four (typically RGBA: same plus an alpha channel), the values at each row-column index (or cell, like in a spreadsheet, in case that helps you visualize it) indicate the intensities of each of the color channels (and transparency, in the case of 4-channel images) for each pixel.

Thus, after the image is loaded, it really is just an array of numbers and can be utilized and manipulated just like any other array. For example, in order to rotate the image, a linear transformation can be applied to the image matrix to "rotate" the pixel values within the matrix.

Here is an example of a single row in the array representing the image of Hobbes is shown.

# === Look at the image, in array form === #
print("Image dimensions:", im.shape)
print("\nImage array - first row of 3-value sub-arrays:")
Image dimensions: (800, 600, 3)

Image array - first row of 3-value sub-arrays:
array([[172, 192, 209],
       [188, 208, 225],
       [119, 137, 154],
       [137, 151, 149],
       [139, 153, 151],
       [142, 156, 154]], dtype=uint8)

# === Look at the image, rendered === #