In cognitive neuroscience, following the two-stream hypothesis, visual recognition mainly happens within both the dorsal and ventral stream, both of which extends out, in the back of our brain, from the visual cortex.

The ventral stream takes care of coming up with the "what" of what we are looking at. In the ventral stream, there have been observations that indicate the existence of specialization. For instance, an area of the brain lights up more than the rest as of a result of recognizing a face rather than an object, the inverse is true. The brain loads into it's short term memory what it is seeing, which triggers the long term memory.

Anyway, the show "person of interest" is about an AI supercomputer that helps it's creator and his partner, an ex spy, stop crime before it happens. In the show, the AI is fed video surveillance; It then detect what is on them and deduces future crimes from the most convoluted connections. Of course we are not, technologically, there yet, but we can, in fact, teach our computers to recognize object.

OpenCV & Haar Cascades

In the paper "Rapid object detection using a boosted cascade of simple features" 2001, by Vila, Jones, et al. (link), a method used by OpenCV to detect objects was first shown. The method, reminiscent of memoization, per-calculates an "integral image" where a matrix, the size of the image, is formed where each element is the sum of all the elements of the image left and up of that element. So,

$$(I)_ {i,j}=\sum_{k=0}^i \sum_{m=0}^j (A)_{k,m}$$

Where $I$ is the integral image matrix and $A$ is the image matrix. The integral image allows for quick calculation of the sums of areas of the image. Let's say that we have to calculate the sum of an area of the image within a rectangular area with corner points $A=(a_1,a_2),B=(b_1,b_2),C=(c_1,c_2),D=(d_1,d_2)$. Instead of summing

$$ s= \sum_{i=a_1}^{d_2} \sum_{j=a_1}^{b_1} (A)_{i,j} $$

We can use the integral image

$$s=(I)_ {c_1,c_2} + (I)_ {a_1,a_2} - ((I)_ {b_1,b_2} + (I)_ {d_1,d_2})$$

Features

Now, a, haar-like, feature is the subtraction of a set of rectangular areas of the image. Below, the white rectangular area is subtracted from the black rectangular area.

For instance, we can to find an edge feature using a integral image where the white rectangle has the edges $A=(a_1,a_2),B=(b_1,b_2),C=(c_1,c_2),D=(d_1,d_2)$ and the black rectangle $C=(c_1,c_2),D=(d_1,d_2),E=(e_1,e_2),F=(f_1,f_2)$. The feature is calculated through an integral image by

\begin{align}
s=&((I)_ {c_1,c_2} + (I)_ {a_1,a_2} - ((I)_ {b_1,b_2} + (I)_ {d_1,d_2})) \\
&-((I)_ {e_1,e_2} + (I)_ {c_1,c_2} - ((I)_ {d_1,d_2} + (I)_ {f_1,f_2})) \\
s=&(I)_ {a_1,a_2}-(I)_ {b_1,b_2}-(I)_ {e_1,e_2}+(I)_ {f_1,f_2}
\end{align}

A training program then has to find the features that let us detect and tell apart objects. The number of possible features grows exponentially with the size of the image so it is best to train with small images.

OpenCV

Fortunately for us, the good devs working on OpenCV have already done most of the work. The library contains a module to use the cascades and a program to train them.

Obtaining Data

There are databases of images of cars but, in my opinion, most are lacking or out of date. So, I set out on creating my own. To that end, I found an already trained haar cascade file for OpenCV on github, it works sometimes (link). But, I do think that it is good enough to use to build a car image database.

OpenCV with their VideoCapture object allows for the import of most images, videos, and streams. We then loop through the output of the VideoCapture. Each time we iterate VideoCapture, it spits out a Mat object which represents the image at that point in time, for streams and for videos it spits out the next frame. We then need to convert the frame to gray scale, since color cascades are extremely computationally expensive, for input to the cascade. here is the C++ source code.

We need the following includes:

#include "opencv2/objdetect.hpp"
#include "opencv2/highgui.hpp"
#include "opencv2/imgproc.hpp"
#include <iostream>
using namespace std;
using namespace cv;

Firstly, we need a function that has a frame and cascade as an input to do some of the work.

void detect(const Mat& frame, CascadeClassifier& cascade) {
  Mat frame_gray;
  cvtColor(frame, frame_gray, COLOR_BGR2GRAY);
  equalizeHist(frame_gray, frame_gray);
  
  int i = 0;
  std::vector<Rect> detectedRect;
  cascade.detectMultiScale(frame_gray, faces);
  for (auto& det : detectedRect) {
    Mat toSave = frame(det);
    String toSavePath = "data/car_image_" + to_string(i++) + ".jpg";
    imwrite(toSavePath, toSave); // Write image
    cout << "Detected: writen to - " << toSavePath << endl;
  }
}

Here is the main function:

int main() {
  const String cascade_name("cars.xml");
  CascadeClassifier cascade;
  
  if (!cascade.load(cascade_name)) {
    cout << "Error loading cars cascade\n";
    return -1;
  }
  
  VideoCapture capture;
  capture.open("http://50.73.9.194:80/mjpg/video.mjpg");
  if (!capture.isOpened()) {
    cout << "Error opening video capture";
    return -1;
  }
  Mat frame;
  while (capture.read(frame)) {
    if (frame.empty()) {
      cout << "No more frames!\n"
      break;
    }
    detect(frame, cascade);
    if (waitKey(10) == 27)
      break;
  }
  return 0;
}

The address, 50.72.9.194:80/mjpg/video.mjpg, is a great freely available traffic webcam that I used to obtain the images of cars.
After running the program for several hours, it resulted in about 25,000 images of cars. I went through the images and seperated the false positives resultings into a sperate folder called negatives. I later used these to train the cascade.

To find more negatives I acquired serveral databases of general images that didn't include cars. All in all, I had 24,861 positive images and 3,726 negative images.

Sample Preparation

Now the obtained positive images need to be converted to gray scale. Going through each of the 24,861 samples and doing this manually would be extremely tedious but with OpenCV we can automate this. It is also required to get the size of each image, both positive and negative, and create a text list of them. Here is the source code to do just this.

The program requires the use of an experimental class, filesystem, of the standard library and depending on the compiler and its versioning you will need to change the source slightly.

#include <iostream> // cout, string
#include <ostream> // ofstream
#include <experimental/filesystem> // use <filesystem> if your compiler
                                   // supports it
#include <opencv2/opencv.hpp>
#include <opencv2/imgproc.hpp>

namespace fs = std::experimental::filesystem;
// namespace fs = std::filesystem; // if your compiler supports it

int main(int argc, char *argv[]) {
  std::string posPath(argv[1]);
  if (argc == 3) {
    std::string to_path(argv[3]);
    std::ofstream myfile("info.dat");
    for (const auto& entry : fs::directory_iterator(path)) {
      cv::Mat img = cv::imread(entry.path().string(), cv::IMREAD_GRAYSCALE);
      std::string full_to_path = to_path + entry.path().filename().string();
      cv::imwrite(full_to_path, img);
      int width = img.cols;
      int height = img.rows;
      myfile << full_to_path << " 1 0 0 " << width << 
                " " << height << std::endl;
    }
    myfile.close();
  } else if (argc == 2) {
    std::ofstream myfile("bg.txt");
    for (const auto& entry : fs::directory_iterator(path))
      myfile << entry.path().string() << std::endl;
    myfile.close();
  }
}

To use on the positive images I used the command:

./ImageInfoGetter data/ data-bk/

And on Negative images:

./ImageInfoGetter ng_images/

Training

Now that our data is ready we use a tool program provided by opencv, opencv_samples, to convert the positive images into the required binary vector file.

opencv_createsamples -vec vector.vec -bg bg.txt -info info.dat -w 40 -h 40 -num 24861

Again, to train, OpenCV has provided a tool, opencv_traincascade. To train:

opencv_traincascade -data cascade_data/ -vec vector.vec -bg bg.txt -numPos 2487 -numNeg 3726 -numStages 20 -w 40 -h 40 -mode all

For clarification on what the parameters mean here is a link.

Using the cascade

The implementation is slightly different from the one above, in that instead of saving the frame, we draw a rectangle around the detected object and display it.

void detect(const Mat& frame, CascadeClassifier& cascade) {
  Mat frame_gray;
  cvtColor(frame, frame_gray, COLOR_BGR2GRAY);
  equalizeHist(frame_gray, frame_gray);
  
  std::vector<Rect> detectedRect;
  cascade.detectMultiScale(frame_gray, faces);
  for (auto& det : detectedRect)
    rectangle(frame, det, Scalar(255, 255, 0), 2);
  imshow("Display", frame);
}

I went ahead a recorded the result to a GIF:

SOURCE CODE

OpenCV Install Info

Ubuntu install guide

Windows install guide

OpenCV Documentation

A great tutorial on training cascades