There is a lot of mysticism around Facial Recognition. On films it is seen as an ominous, powerful, and complicated technology. It, now that I have explored this field slightly, not complicated. At first glance Facial Recognition can be done fairly easily. The only difficult part, so far in my implementation, is managing the massive amount of data from the training datasets I am using and will showcase here. Yes, this technology is powerful, but by demystifying and understanding it we take it's menace.

I'm gonna assume some basic knowledge of Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs). If you want learn about ANNs or if you want to brush up here is a great book, link. If you would like to learn about CNNs I recommend this post, link, it not only goes over some of the common layers but also some common architectures. I wont go too deep into the theoretical but I'll provide resources as I go for those who want to dive deep.

I plan on creating three posts on this subject. The first will be over the detection of faces in images, the second will be over the actual recognition, and the last will be over put them together.

A personal note, you can ignore this if you like.

I haven't written much on this blog, but I have not forgotten it. Since January I've start work on a masters degree in computer science. This last semester has been particularly difficult, but, conversely, I have learnt so much. I cannot wait to share some of this knowledge with you all!


The first step in Facial Recognition is it's detection. To do this I employ a Faster R-CNN. It's paper describes a backbone of convolutional layers whose output is a feature map followed by a Region Proposal Network and ROI pooling and classification. For better detail you can visit the link above the introductory paper or this link to a really well made description and implementation using Keras.

Source: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (link)

As a backbone, I used MobileNet V2 and implemented it in python with the PyTorch machine learning library. Fortunately, there is a prebuilt, modular, Faster R-CNN model provided in the periphery library dubbed TorchVision.

For training and validation data I used the WIDER FACE dataset. It is a dataset of faces with their bounding boxes annotations. You can download the dataset from the website linked in the last sentence. Now, the dataset's annotation style is not one of the standard styles typically found in other datasets, so, fortunately, someone has already converted it over to the style I decided to use, PASCAL VOC, an XML based format. Because it's style is inline with XML we can use a prebuilt XML parser. Here is the github link.


I intend having the model as prebuilt as possible to reduce the amount of code and to emphasize how easy it can be.


I recommend using conda to setup an environment. Once you have conda installed in the environment you can run the following in your terminal to setup some of the required packages. The platform I am using is Ubuntu so be aware of that and modify you commands accordingly. The version of GCC I am using it 8.3.0, my CUDA version is 10.2, and cuDNN version 7.6.5.

conda create --name FacialRecognition python=3.7 numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing git pillow matplotlib pandas numpy scipy scikit-learn scikit-image tqdm joblib setuptools
conda activate FacialRecognition

We require some features and bug fixes currently only featured in the master branch of the PyTorch and TorchVision github or maybe in their nightly build. I just went ahead and built them from their master branches. It may also work with the nightly version of pytorch

git clone --recursive
git clone
cd pytorch
python install
cd ../vision
python install

For more detail on how to build PyTorch here is a link and for the TorchVision link. On windows, be sure to use the v14.22 tool-chain instead of the latest, which as the time I am writing this is v14.24. The newer versions tend to fail with a weird bug.


Creating the model with PyTorch is relatively easy since the TorchVision project provides the FasterRCNN class. It allows for a modular creation of the Faster R-CNN. First we are gonna need some imports

import sys
import math
import torch
import argparse
import torch.onnx
import torchvision
import matplotlib.pyplot as plt
from tqdm import tqdm
from torch.optim.adamw import AdamW
from import DataLoader
from datasets import WiderFaceDataset, collate_fn
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.rpn import AnchorGenerator

Just to show how easy it is to create an Faster R-CNN with a MobileNet V2 backbone, some of the following code is straight from the PyTorch website.

def fasterrcnn(min_size=224):
    backbone = torchvision.models.mobilenet_v2(pretrained=True).features
    backbone.out_channels = 1280
    anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),))
    roi_pooler = torchvision.ops.MultiScaleRoIAlign(
        featmap_names=['0'], output_size=7, sampling_ratio=2)
    model = FasterRCNN(backbone, num_classes=2, rpn_anchor_generator=anchor_generator,
                       box_roi_pool=roi_pooler, min_size=min_size)
    return model

The call to mobilenet_v2 produces the model, but since we only need the backbone getting it's features object retrieves the backbone. By backbone I mean the network without the ending classifier part of the model. I believe with MobileNet V2 the backbone just misses the last dense & softmax layer but other models might be different. We set the output channels to 1280 which is specified in the paper as the output number of elements before the final classification layers. We create the AnchorGenerator and the MultiScaleRoiAlign object which control the different aspects of our Faster R-CNN. Finally, it's put together with a call to the FastRCNN constructor.
We are not finished with the file yet, but lets create which will contain our Dataset inherited classes. In PyTorch, the standard way to pass data over to your model while training is with a class called DataLoader located in It allows for multi-process preprocessing of the data and automatic creation of batches, which speeds up training. The DataLoader object takes in a Dataset object and an object the has inherited the Dataset object. The documentation requires that we override the __getitem__ function and the __len__ function. The first produces a single sample and the latter returns the number of samples available. In the constructor we just do some parsing and preparations. In import:

import os
import torch
import numpy as np
import pandas as pd
import xml.etree.ElementTree as ET
from PIL import Image
from skimage import io
from os.path import isdir, isfile
from torchvision import transforms
from import Dataset

Some of the imports we won't use right now but also in the creation of the recognizer. First we need to create a function that collates our samples into batches.

def collate_fn(batch):
    return tuple(zip(*batch))

It's simple, it just orders our batch into a tuple. Since we are using the WIDER FACE dataset I though it pertinent to call our class WiderFaceDataset.

class WiderFaceDataset(Dataset):

The constructor will need to first check if the image directory and the annotation directory exists, store some information we may need later like the image directory, the annotation directory, the transformation to be used, and the model size. Then we need to go through our images and annotations and create a list of dictionaries that will contain their paths.

def __init__(self, image_dir, annotation_dir, model_size, transform=None):
    assert isdir(image_dir)
    assert isdir(annotation_dir)
    self.image_dir = image_dir
    self.annotation_dir = annotation_dir
    self.transform = transform
    self.model_size = model_size

    self.image_data = []
    for root, dirs, files in os.walk(image_dir):
        for name in files:
            full_path = os.path.join(root, name)
            filename, _ = os.path.splitext(name)
            annotation_path = os.path.join(annotation_dir, filename + '.xml')
            self.image_data.append({'image_path': full_path, 'annotation_path': annotation_path})

The __getitem__ function takes in the index of the samples and produces that samples. The first step to do in this function is to get the paths for the image and the annotation then get the image.

def __getitem__(self, idx):
    path_data = self.image_data[idx]
    img_path = path_data['image_path']
    ann_path = path_data['annotation_path']
    img ='RGB')

We will also need the size of the image.

prwidth, prheight = img.size

Then we need to resize the image over to the input size of the model. The model its self will resize the image but Its better to resize it in the datasets object to have less data transfer between the GPU and CPU and also allows us to use multi-processing to resize. We also need to scale the image's values to between 0 and 1. Since the images are between 0 and 255 we simply divide by 255.

img = np.array(img.resize(self.model_size)).reshape(3, *self.model_size)
img = img / 255.0

Here, the function, is where we parse the XML in our PASCAL VOC style annotations. As I'm parsing the XML I'll scale the bounding boxes over to the resized image's scale and add these bounding boxes to a list.

boxes = []
tree = ET.parse(ann_path)
root = tree.getroot()
for face in root.findall('object'):
    bndbox = face.find('bndbox')
    xmin = self.model_size[0] * (float(bndbox.find('xmin').text) / prwidth)
    ymin = self.model_size[0] * (float(bndbox.find('ymin').text) / prheight)
    xmax = self.model_size[0] * (float(bndbox.find('xmax').text) / prwidth)
    ymax = self.model_size[0] * (float(bndbox.find('ymax').text) / prheight)
    boxes.append([xmin, ymin, xmax, ymax])
n_faces = len(boxes)

We need to make sure that there are bounding boxes in the image and to do something if there is no faces in the image.

if n_faces == 0:
    return self.__getitem__(idx + 1)

Next, we set our data over the appropriate format needed for the model.

boxes = torch.as_tensor(boxes, dtype=torch.float32)
labels = torch.ones((n_faces,), dtype=torch.int64)
img = torch.as_tensor(img, dtype=torch.float32)
target = {
    'boxes': boxes,
    'labels': labels

Do the transformations if it is set:

if self.transform is not None:
    img, target = self.transform(img, target)

and return:

return img, target

For the __len__ function we simply return the length of the image_data object stored in the class since that is the number of samples we have.

def __len__(self):
    return len(self.image_data)

Back to the file, let's start to create our main function by taking in some command line arguments and setting them equal to some variables.

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('-b', '--batch_size', default=10, type=int,
                        help='The Batch Size')
    parser.add_argument('-i', '--image_size', default=224, type=int,
                        help='The image size and the input size of the model')
    parser.add_argument('-e', '--epochs', default=40, type=int,
                        help='The number of epochs to train the model')
    parser.add_argument('-w', '--num_workers', default=4, type=int,
                        help='The number of workers/threads to use when training model')
    parser.add_argument('-l', '--learning_rate', default=0.0001, type=float,
                        help='The training learning rate')
    parser.add_argument('-m', '--model', default='FasterRCNN_MobileNetv2_WIDERFACE.onnx',
                        help='Where to save the ONNX model')
    parser.add_argument('--image_path', default='/home/aherrera/Downloads/WIDER_train/images',
                        help='Path to training images')
    parser.add_argument('--valid_path', default='/home/aherrera/Downloads/WIDER_val/images',
                        help='Path to validation images')
    parser.add_argument('--train_ann_dir', default='/home/aherrera/Documents/WIDER-to-VOC-annotations/WIDER_train_annotations',
                        help='Path to training annotation files')
    parser.add_argument('--valid_ann_dir', default='/home/aherrera/Documents/WIDER-to-VOC-annotations/WIDER_val_annotations',
                        help='Path to validation annotation files')
    parser.add_argument('--fig_path', default='FasterRCNN_valid_loss.png',
                        help='Where to save validation loss graph')
    args = parser.parse_args()

    # Params
    batch_size = args.batch_size
    image_size = args.image_size
    num_epochs = args.epochs
    num_workers = args.num_workers
    learning_rate = args.learning_rate
    images_dir = args.image_path
    valid_dir = args.valid_path
    ann_dir = args.train_ann_dir
    valid_ann_dir = args.valid_ann_dir

You can change the defaults here to your needs or set them using arguments. Then let's check if an CUDA graphics card is available.

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

Create our datasets and dataloaders:

# Data stuff
dataset = WiderFaceDataset(images_dir, ann_dir, (image_size, image_size))
valid_dataset = WiderFaceDataset(valid_dir, valid_ann_dir, (image_size, image_size))
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=num_workers,
valid_dataloader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=True, num_workers=num_workers,

Here is the documentation for the Dataset and DataLoader.
Let's create our model and send it over to the device.

# Create model
model = fasterrcnn(min_size=image_size).to(device)

For our optimizer I prefer to use AdamW with the amsgrad option, you can see why in this nicely put together blog post, for the weight decay I left it on default. To improve training, I also used a scheduler that reduces the learning rate when ever training has stalled.

# Create optimizer
optim = AdamW(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optim)

To train on the GPU it is required to pass the batches over to the GPU, or whatever device your using, we can create a convenience function to do just this.

def to_device(batch, device):
    images = [ for image in batch[0]]
    targets = [{k: for k, v in target.items()} for target in batch[1]]
    return images, targets

Now that we have everything setup we can start training. We are gonna keep track of the validation losses then create an tqdm iterator for the epoch loop.

valid_losses = []
itr1 = tqdm(range(num_epochs), unit='epoch')
for _ in itr1:

In our first loop within the epoch loop we will train the network. The second loop will pass that validation data, display validation loss, and save it for plotting.

loss_sum = 0.0
itr2 = tqdm(enumerate(dataloader, 0), unit='step', total=len(dataloader))
for i, batch in itr2:
    images, targets = to_device(batch, device)
    loss_dict = model(images, targets)

    losses = sum(loss for loss in loss_dict.values())
    loss_value = losses.item()

    if not math.isfinite(loss_value):
        print("Loss is {}, stopping training".format(loss_value))

    itr2.set_description(f'loss={round(loss_value, 2)}')

    loss_sum += loss_value
lossloss = loss_sum / len(dataloader)
itr1.set_description(f'loss={round(lossloss, 2)}')

The model output a dictionary of losses that we then sum up and perform backward propagation to calculate the gradients. Then, with the optimizer, the weights are modified. The validation loop will, while turning off the recording of operations for backpropagation, not have to use backprop or the optimizer.

with torch.no_grad():
    loss_sum = 0.0
    for valid_batch in tqdm(valid_dataloader, unit='step', total=len(valid_dataloader)):
        images, targets = to_device(valid_batch, device)
        loss_dict = model(images, targets)

        losses = sum(loss for loss in loss_dict.values())
        loss_value = losses.item()
        loss_sum += loss_value
    loss_sum /= len(valid_dataloader)
    print(f'Validation loss: {loss_sum}')

Finally, we save the model.

model = model.eval(), args.model)

With this we have reached the end of this first part. After training it there should be an torch model saved that we will later use when we are putting the detector and recognizer together. I wrote a quick script to convert the model over to torch script and ONNX.

import torch
import torch.jit
import torch.onnx

device = torch.device('cuda:0')
model = torch.load('FasterRCNN_MobileNetv2_WIDERFACE.pth')

dummy_data = torch.rand(1, 3, 224, 224, device=device)

# Save as torch script
sm = torch.jit.script(model)'')

# Save as model
torch.onnx.export(model, dummy_data, 'FasterRCNN_MobileNetv2_WIDERFACE.onnx', opset_version=11)

If you would like an already trained model click here. If you would like to take a look at the full source code click here.
These link might be currently broken, so I'll update them in a bit.