Always Watching - Raspberry Pi Face Tracker (Part 2)

In my previous post, I built a camera that can track faces around a room using a Raspberry Pi, a face detection model, and some control theory. This post builds on this work by allowing the robot to learn specific faces and react uniquely to these faces.

Problem Re-Definition

The problem of teaching a machine to learn specific faces is called 'facial recognition'. It's worth reminding the reader about a key distinction between two concepts that sound similar but solve very different problems:

  • Facial detection is a method or algorithm which is able to identify and extract a face from an image.
  • Facial recognition is a method or algorithm which is able to compare faces with each other and identify if they belong to the same person.

In real-world applications, both techniques are often needed to identify people from raw images. The general process is shown in the following diagram.

Typical Real World Process


In the previous post, I solved the face detection part of the problem by using a Haar Cascade classifier and so I just need to tackle the facial recognition part.

Facial Recognition Techniques

In the paper Deep Face Recognition: A Survey, published in 2018 but last revised in 2020, there is an excellent summary of the progress in the field of facial recognition since the 1980s, describing the journey from methods that use holistic features, through the use of local feature extraction methods and finally to shallow and deep learning methods, which are considered state of the art as of 2020. This broad trend can be visualised in the chart below which plots modelling performance against time.

Progress in Face Recognition


Traditional Approaches

In literature, these methods are often broken down into 'traditional' and 'non-traditional'. When looking at what is considered traditional, there are three main categories of algorithms:

  • holistic features (whole image as a feature)
  • local features (portions of images as features)
  • hybrid of holistic and local

A holistic method is one in which comparisons between images are made using the whole image as input data, whereas local features are specific subsections of the image that are able to convey some meaning such as the eyes, nose, and other facial features.

Breakdown of Traditional Approaches (source)


My Approach (LBPH Classification)

For the context of this project, it is important to note that there is a correlation between performance and resource requirements, and this becomes truer as we progress to the neural network half in the first chart. Since we will be performing inference, and potentially training, on a Raspberry Pi, I limit my exploration to the less resource intensive classical approaches. Even though this decision will likely impact accuracy, it is unlikely my database of faces will ever exceed 5 people and the final product will have no critical use, so using advanced models with high performance would be unnecessary.

I choose to try out the LBPH feature extraction, primarily due to the limited computing power needed to train and to infer and the fact that the model is more resilient to varying lighting conditions than models such as Eigenfaces. After being handed a grayscale image, the feature extraction process can be described by the following steps:

  1. (local): Pass a sliding window across the image (I describe the process of sliding windows in a previous post)
  2. (binary): The pixel values in each window are set to be either 0 or 1 based on whether the pixel values are smaller or greater than the central value. The central value is left untouched.
  3. All of the binary values are concatenated through a clockwise rotation. The position of each digit, representing some power of two, is multiplied by the original digit and results replace the binary array. The central value is replaced with the decimal representation of the whole concatenated number. See figure 1 below for an example of this transform.
  4. (histogram): The resulting image is then split into a grid pattern and a histogram is created from each of the squares.
  5. The histograms are then concatenated and converted into a vector. This vector is the final extracted LBPH feature of the image. See figure 2 for histogram creation.

Figure 1: Binarisation (source)


Figure 2: Histogram Creation (source)


At the inference stage, the extracted features are compared to all features extracted at the training stage and the closest match is returned. In the case of OpenCV's implementation of LBPH, the default metric for similarity matching is the alternative Chi-Square.

Training

The training process requires a directory of images of faces, which I have created by taking pictures of me and a couple of friends' faces at various angles on my Raspberry Pi and using the Haar Cascade classification method to extract the faces from the background. A script then loops through this directory, loads the faces into memory and then trains an LBPH model on these faces. This learned model is then stored as a JSON onto a local drive. Since the training is not too intensive, I am able to perform this whole process on the Pi, removing the need to move files between the Pi and my laptop.

	@staticmethod
	def train_lbph_face_recogniser():
		print('Training recogniser.')
		people = {}
		faces = []
		labels = []
		faces_directory = os.path.join(DATA_DIRECTORY, 'faces')
		for i, person_folder in enumerate(os.listdir(faces_directory)):
			if person_folder == '.DS_Store':
				continue

			# store person to index lookup
			people[person_folder] = i
			
			# load images into array
			person_directory = os.path.join(faces_directory, person_folder)
			for image_name in os.listdir(person_directory):
				if image_name == '.DS_Store':
					continue
				image_path = os.path.join(person_directory, image_name)
				img = Image.open(image_path).convert('L') # convert it to grayscale
				img_array = np.array(img, 'uint8')
				faces.append(img_array)
				labels.append(people[person_folder])

		# initialise and train model
		recognizer = cv2.face.LBPHFaceRecognizer_create()
		recognizer.train(faces, np.array(labels))
		
		# Save the model into trainer/trainer.yml
		model_path = os.path.join(MODELS_DIRECTORY, 'custom-opencv.yml')
		recognizer.write(model_path) # recognizer.save() worked on Mac, but not on Pi
		
		# save label lookup
		model_path = os.path.join(MODELS_DIRECTORY, 'custom-opencv-lookup.json')
		with open(model_path, 'w') as fp:
			json.dump(people, fp)

		print('Recogniser trained and saved.')
Inference

Model inference is fairly easy too and performed with the code below. This method takes in a grayscale face stored as a Python cv2 object and calling LBPH.predict on it. The function returns the closest face in the database and a confidence score.

	@staticmethod
	def infer_lbph_face_recogniser(face):
		# load model
		recognizer = cv2.face.LBPHFaceRecognizer_create()
		model_path = os.path.join(MODELS_DIRECTORY, 'custom-opencv.yml')
		recognizer.read(model_path)

		# load label lookup
		model_path = os.path.join(MODELS_DIRECTORY, 'custom-opencv-lookup.json')
		with open(model_path, 'r') as fp:
			people = json.load(fp)

		# reverse lookup
		index_to_person = {idx: person for person, idx in people.items()}

		# predict
		face_index, confidence = recognizer.predict(face)

		return index_to_person[face_index], confidence

It is important to note that the model is far from perfect, but it works well enough for our exploratory use case. If I was willing to spend more time on this, I would be more analytical in assessing the accuracy of this model, comparing with other models as well as increasing the number of people in varying image conditions. I also found that both training and inference was very quick, and so I didn't have resort to any performance optimisation tricks such as training on a separate, more powerful machine, or inferring every N frames rather than every single frame. 😌

The whole process, including the camera aiming ability, can be seen in the following diagram.

Overall Process


Showtime!

I've set up a script to show a green light when it detects my face, a purple light when it detects Alleeya's face, and a orange light when it detects both of our faces. There is a fair amount of lag in the video, but the motion and switches are somewhat smoother in person.

Tracking

What I've proved is the ability to fairly simply build a robot that can track faces around a room and to react uniquely to certain faces. Although there is limited practical use for the current setup, these types of systems are used in a variety of applications such as security or more convenient shopping experiences.