The goal of this tutorial is to describe one method of automating the process of cutting out objects (things, people, pets, etc.) from images and combining them to make collages.
First, I go through creating binary masks for one or more objects in an image by using a class of computer vision algorithms called image segmentation. Binary mask(s) in hand(s), I go through one method (technically two, actually) of using said binary mask(s) to extract or remove part(s) of an image. Next, I do some basic image transformations (rotate, crop, and scale) on the resulting cutout. Finally, I paste the cutout on top of another image to make a collage.
Rather than drawing binary masks by hand or using proprietary software like Photoshop to manipulate and transform images, I'll show you how to automate the process using completely free, open-source tools. Namely, we'll be using Python along with a few open-source libraries:
A link to the Colab notebook used to create this tutorial can be found here:
Selecting and separating parts of an image can be a tedious, time-consuming process. Anyone who's done a fair amount of tinkering with image manipulation using a program like Photoshop knows the struggle.
Although modern tools make the process easier, wouldn't it be nice if there was a way to automate the process?
As an example, say I'd like to cut out my cat Hobbes from a photo in order to "Photoshop" him into a different image. Here's the photo of Hobbes I'll be using.
I think his position is perfect for creating "Hawbbes" (Jaws + Hobbes)...meh I'll call it "Paws".
By cutting him out and rotating him a bit, he could be pasted onto an underwater shot of someone swimming and he could live his dream of being a fierce sharkitty.
Here's an image I found on Unsplash that would work as the background onto which Hobbes, once cut out of the image above, can be pasted.
Basically, in order to cut Hobbes out of the photo above, I'll have to make all the pixels in the image transparent except for the ones representing Hobbes. Then I'll crop, rotate, scale, and superimpose the resulting image on top of the underwater shot such that Hobbes roughly takes up the bottom half.
To accomplish this manually, I could spend anywhere from a few minutes to a few hours outlining Hobbes in the original image to create a mask — masking the image. The time investment depends on how easily-separable the subject is from the rest of the image, how accurate I want the cut to be, and what tools are available to me.
Regarding that last point, the magicians at Adobe have done some rather impressive black magic with Photoshop, giving users very quick and very effective methods for selecting parts of an image. However, the goal of this post is to accomplish this programmatically, without the use of any closed-source software.
A mask is basically a method of distinguishing/selecting/separating pixels. If you've never heard the term used this way before, one way to think about it is with masking tape and paint. Typically, one would put masking tape—i.e. create a "mask"—around areas on a wall that should _not_ be painted. This is essentially what a mask is doing in any photo manipulation software: indicating what areas of an image to affect or transform (or which areas not to do so).
Here's the image of Hobbes with the image segmentation-generated masks overlayed on top of it (which we'll be creating later) showing, obviously, where Hobbes is in the image. It doesn't really matter that the model thinks he's a dog — we won't be using the predicted class, only the mask. And the mask is still good enough for our purposes.
A binary mask is a method of masking which uses a two-tone color scheme, to indicate the areas of an image to be affected and not affected. By overlaying a binary mask on top of the original image, the boundaries between the two colors can be used to affect the different areas of the image differently, whether that is making pixels transparent (removing them) or applying some sort of effect or transformation.
While I've only spent any significant time with Photoshop, I'd imagine any decent image manipulation software can work with binary masks similarly to how we'll be working with them.
In order to generate binary masks based on the content of the image, the algorithm must be somewhat intelligent. That is, it must be able to process the image in such a way that it can recognize where the foreground is and draw a polygon around it with some degree of accuracy.
Luckily, there are a number of deep learning models that will do just that. The field is called Computer Vision, and the class of algorithm used in this article is known as image segmentation.
Don't worry if you don't have any experience with this type of thing, or even if you don't necessarily want to get experience with it. Modern machine learning tooling makes it incredibly quick and easy to get a model up and predicting with pre-trained weights. Though if you want to understand what's going on, it will likely help to know a bit of Python programming.
One caveat: the pre-trained models will usually work well with classes of objects that were in their training data. The model weights used in this post were trained on the COCO dataset, which contains 80 object classes. Depending on what the object in the foreground is that you are trying to extract, you may or may not need to extend the model with a custom dataset and training session. That is a topic for another post.
Although the primary framework used in this article is Detectron2, this process should be translatable to other image segmentation models as well. In fact, I'll be adding an addendum to this post in which I'll go over using Matterport's TensorFlow-based implementation of Mask R-CNN to accomplish the exact same thing.
Heck, while I'm at it, I might as well do it with FastAI as well.
I know you've been dying to see the end result of the whole process.
Without any further ado, I present to you, Paws!
With that, let's get into how this masterpiece was created.
As mentioned in the introduction, the framework we'll be using for image segmentation is called Detectron2. In the tutorial notebook, there are cells to set up Detectron2 in the Google Colab environment (pulled from the official Detectron2 getting started notebook). If you don't want to use Colab for whatever reason, either play around with installation and setup or refer to the installation instructions.
The other top-level dependencies needed for this tutorial:
The nice thing about Colab is all of these come pre-installed. Oh yeah, you also get free access to a GPU. Thanks, Googs!
Again, click the link below to go to the Colab notebook, then hit File > Save a copy in Drive, which does exactly what it says: saves a copy of the notebook to your Google Drive. In addition, you can open an ephemeral copy of the notebook without saving it first by hitting File > Open in playground mode.
Running a pre-trained Detectron2 model
Most, if not all, open-source deep learning frameworks have a set of pre-trained weights available to use. The creators of the frameworks will conduct a series of training sessions on the most commonly-used datasets in order to benchmark the performance of their algorithms. Luckily for everyone else, they typically provide the results of this training in the form of weights, which can be loaded into the model and be used for inference immediately.
For many tasks, including recognizing and outlining an image of a cat, pre-trained weights will work fine. The model weights used in this post were trained on the popular COCO dataset, which contains 80 object classes, including cats. If, for example, we wanted to do the same thing with whales or one specific person, we'd have to do some custom training.
I will be publishing a companion blog post to this one about training Detectron2 on a custom dataset. Once that is published, I'll link to it here. If there's no link below this paragraph, I haven't published it yet.
If you're curious about custom training now, the Detectron2 "Getting Started" Colab notebook also goes through one way of doing so.
Image loading and image arrays
The first thing we need in order to use the model is an image on which it can be used.
We first download the image to the local filesystem using wget, then load it into memory using cv2 (opencv-python).
If you think about what a digital image actually is, it makes sense to represent it as a matrix — each row corresponds to a row of pixels, and each column a column of pixels in the image. Technically, images would be considered a 3-dimensional array, because they have width, height, and depth (number of channels).
Depending on if the image has three channels (typically RGB: red, green, blue) or four (typically RGBA: same plus an alpha channel), the values at each row-column index (or cell, like in a spreadsheet, in case that helps you visualize it) indicate the intensities of each of the color channels (and transparency, in the case of 4-channel images) for each pixel.
Thus, after the image is loaded, it really is just an array of numbers and can be utilized and manipulated just like any other array. For example, in order to rotate the image, a linear transformation can be applied to the image matrix to "rotate" the pixel values within the matrix.
Here is an example of a single row in the array representing the image of Hobbes is shown.
Inference with Detectron2
After the image is loaded, we're ready to use Detectron2 to run inference on the image and find the mask of Hobbes. Running inference means generating predictions from the model. In the case of image segmentation, the model is making a prediction for each pixel, providing its best guess at what class of object each one belongs to, if any.
We create a Detectron2 config and instantiate a `DefaultPredictor`, which is then used to run inference.
Just a heads-up: the first time this runs, it will automatically attempt to start downloading the pre-trained weights — a ~180mb pickle file. That's a lot of pickles...
In addition to downloading and configuring the weights, the threshold is set for the minimum predicted probability. In other words, the model will only output a prediction if it is certain enough — the probability assigned to the prediction is above the threshold.
By default, the output of the model contains a number of results, including the predicted classes, coordinates for the bounding boxes (object detection), and mask arrays (image segmentation), along with various others, such as pose estimation (for people). More information on the types of predictions made by Detectron2 can be found in the documentation.
We are really only interested in the one mask outlining Mr. Hobbes here, though will also need to extract the IDs for the predicted classes in order to select the correct mask. If the image only has one type of object, then this part isn't really necessary. But when there are many different classes in a single image, it's important to be certain which object we are extracting.
First, let's take a look at how the model did.
Extracting the mask
The "dog" is the first id in the `.pred_classes` list. Therefore, the mask that we want is the first one in the `.pred_masks` tensor (array).
Those colored areas are the "masks", which can be extracted from the output of the model and used to manipulate the image in neat ways. First, we'll need to get the array holding the mask.
In this case, as can be seen below, each mask is a 2-dimensional array of Boolean values, each one representing a pixel. If a pixel has a "True" value, that means it is inside the mask, and vice-versa.
Converting to a binary mask
Now that we've run inference on the image and retrieved the mask array, it's time to turn that array into a binary mask. While I won't be using this particular binary mask directly, it can be downloaded as a and/or edited and used to various ends.
Using the binary mask to cut out Hobbes
In order to use NumPy operations between the mask and image, the dimensions of the mask must match the image. The image array currently has three values for each pixel, indicating the values of red, green, and blue (RGB) that the pixel should render. Therefore, the mask must also have three values for each pixel. To do this, I used a NumPy method called `np.stack` to basically "stack" three of the masks on top of one another.
However, the goal is to have the background of the image (everything but Hobbes) be transparent. This means that the image will ultimately have to have four channels — the three original ones, RGB, plus an alpha channel, A, for transparency.
Once the dimensions match, another NumPy method, `np.where`, can be used to copy or extract only the pixels contained within the area of the mask. I created a blank background (this time transparent instead of black) onto which those pixels can be copied.
The Roundabout Method
This is that "second" method for using the binary mask I talked about in the introduction.
The difference is that I added a fourth channel to the image after the original pixels were copied onto the background. The black pixels were then converted to full transparency. While this method works, I'm sure you can think of one big issue with it.
It took me too long to realize this, but by using a black background and the method below, which converts all black pixels to transparent, any pixels brought over from the original image that also happened to be black were converted to transparent.
That's why I decided to refactor into the method above.
However, I felt like I should leave it in anyways, as it still has some potentially useful code in it. For example, in the case when the image cannot be converted to four channels beforehand.
Now, this image can be saved (as a PNG to preserve the alpha channel/transparency) and simply overlayed onto another image. Or, even better, the image can be used directly (as it is now, in the form of an array), scaled, rotated, moved, then pasted on top of the other image.
At first I was going to use Photoshop to overlay Hobbes and make him look like a super dangerous sharkitty. But then I remembered the goal of this post, and decided to do it programmatically with Python.
The primary library I'll be using to manipulate images is Pillow.
I decided to rotate the image matrix directly using OpenCV, prior to loading it into Pillow. There is a `.rotate` method on Pillow's Image class as well, which could accomplish the same thing. Either way works — I just wanted to learn how to do it with OpenCV.
Once the image is rotated, it needs to be cropped and scaled appropriately. I decided to use Pillow for these transformations. For some reason that I have not looked into yet, OpenCV images are in BGR (blue, green, red) format instead of the virtually universal RGB format. Thus, in order to load the image from cv2 into Pillow without reversing the blues and reds, the color first must be converted to RGB.
Once converted, it can simply be loaded into Pillow as a `PIL.Image` object, which contains a suite of useful methods for transformation and more.
I manually defined the coordinates of the box used to crop the image by eyeballing it (shout out to Matplotlib for including tick marks on rendered images). A more automated method for doing this would be to extract the bounding box coordinates from the model output, and use that to crop the image before doing any other transformations. I will add this to a later iteration of this tutorial (if you're reading this, it likely hasn't been implemented yet).
Next, the cropped and rotated image must be resized in order to fill up the entire width of the background onto which it will be pasted. The reason for this is simply because the dimensions of the "paste" box must exactly match that of the "copy" box. It must be explicit — i.e. it's not like Photoshop where any of the image falling outside the boundaries of the "canvas" is cropped or simply left out.
Finally, with the rotated, cropped, and resized image of Hobbes all ready to go, we can paste it onto the background and save the result!
As with just about any process, there are always aspects that could be changed in an effort to improve efficiency or ease-of-use. I will keep adding to this list of potential improvements as I think or hear of them:
- Fully automated the crop by using the bounding box from the model output
There are a couple of additional adjacent projects to this that I'd like to work on at some point in the future, both to help others use the method outlined above, and to give me practice with various other aspects of the development lifecycle.
The two projects I have in mind are essentially the same thing, accomplished different ways. First, I'd like to integrate this method into a Python package and put it on PyPI so it can be installed with Pip and easily customized/used from the command line or in other scripts. Second, I want to build a simple web app and API that would allow anyone to upload an image, choose a class of object to extract (or remove), run the model, then download the image with the background (and/or other objects) removed.
As I work on these, I'll put links here.
Thanks for reading, and I'll see you in the next one!