Video Face Recognition: Assessing Template Comparison Hardware Needs

While enrolling video frames into templates is the bottleneck for video processing applications in face recognition, there is also a computational cost for using the generated templates for search and identity verification. While the cost is often negligible, for large-scale applications it can become meaningful enough to need to be factored into procurement considerations.

In this article we will discuss the computational considerations for template comparison tasks in video processing applications. We encourage first reading our article on the computational cost of generating templates in video processing applications, as this is the computational bottleneck. For readers unfamiliar with face recognition efficiency metrics we recommended you read the article on why efficiency metrics matter as well.

Hardware required for searching and comparing templates

Templates generated during the enrollment process will typically be further processed by a template comparison algorithm for the following purposes: tracking, consolidating searching, and/or verifying. In discussing the computational demands of these tasks, the following information is needed:

Comparisons per Second: the number of template comparisons per second, per CPU core, that can be performed by the algorithm. Note that NIST FRVT Ongoing reports this statistic as Comparison Speed, where Comparison Speed = 1 / Comparisons per Second, which is the time it takes to perform a single template comparison.
Templates per Track: is the number of templates retained in a “face track”, where a face track is a set, or subset, of consecutive set templates of a person tracked in a video feed.
Max Faces: the maximum number of faces that will appear across all video feeds at a single given time.

Tracking

Typically a first step in using templates generated from video feeds is to track the templates into the different identities present in the video feed.

Tracking faces in video generally requires all faces detected in a given video frame to be compared against all existing face tracks. In turn, the face templates can be assigned to the face track corresponding to the same identity. The computational cost of this operation is roughly determined as:

Tracking CPU Usage = (Max Faces)^2 * Templates per Track / Comparisons per Second

Note that Max Faces is representative of both the maximum number of faces present across all video feeds, and the number of face tracks being processed.

Typically the Tracking CPU Usage is extremely low. For example, if there are at most 10 faces in a set of video feeds at a given time (i.e., Max Faces = 10), and each face track retains 5 templates (i.e., Templates per Track = 5), and Comparisons per Second = 1e7 (1e7 = 10 million), then Tracking CPU Usage = 10^2 * 5 / 1e7 = 0.00005, which is a fairly trivial fraction of CPU usage.

Though this approach does not require much processing power (if an algorithm has a fast comparison speed), some algorithms may use lower cost heuristics, such as adding templates to the track that is closest in spatial location (e.g., if faces in consecutive frames were detected in the same location, put them in the same track). However, approaches like this can suffer from poor tracking in the presence of densely located faces (e.g., two people close to each other) or other factors. Unless an algorithm has an unreasonably slow comparison speed, it should be assumed that the template comparisons described above are performed for tracking.

The output of the tracking step is typically sets of templates corresponding to the different identities present in the video feed. There will often be a subsequent task of either searching these templates against a gallery (1:N+1) or comparing them to a claimed identity (1:1), which are discussed below. However, it could also be the case that tracking and consolating are performed to store the identities and no further comparisons will be required beyond tracking.

Consolidating

In order to maintain the Templates per Track, each time a new template is matched to a track during the Tracking step, a decision needs to be made whether or not to retain this template in the corresponding face track, and, if it is retained, which existing template to drop (in order to not exceed the Templates per Track limit).

There are many different techniques that can be employed to determine which template to retain. In some cases this will be templates with the highest quality scores, in other cases this will be templates with different characteristics (e.g., different face pose angles).

The most computationally exhaustive approach involves cross-matching all existing templates in a track, along with the newly detected template from the track. In turn, a template can be dropped based on the similarity information (e.g., drop a template that provides the least additional information). The computational cost of consolidating all tracks at once is:

Consolidating CPU Usage = (Max Faces * (Templates per Track + 1) * Templates per Track / 2) / Comparisons per Second

Similar to tracking, for consolidating, the CPU usage is typically very low. For example, if there are at most 10 faces in a set of video feeds at a given time (i.e., Max Faces = 10), and each face track retains 5 templates (i.e., Templates per Track = 5), and Comparisons per Second = 1e7, then Tracking CPU Usage = (10 * 6 * 5 / 2) / 1e7 = 0.000015.

Searching

Often times tracked faces are searched against a gallery in order to determine the person’s identity, which is also known as watch-list identification. This may be for different reasons, including determining if the person is on a security blacklist or on a VIP whitelist.

Typically this process is done once per face track. There are different ways that the templates in a face track can be searched against a gallery. For example, all the templates could be searched against the gallery, which is the most computationally burdensome approach (as well as the most comprehensive from an accuracy perspective). Or, a single template (such as the one with the highest quality score) can be used for a single search. We will assume that all templates in a face track are searched.

The computational cost for searching a face track against a gallery of N templates is:

Searching CPU Usage = Max Faces * Templates per Track * N / Comparisons per Second

For example, if there are 10 face tracks (i.e., Max Faces = 10), 5 Templates per Track, a watch-list gallery with 1e4 templates (1e4 = 10,000), and a Comparisons per Second of 1e7, then Searching CPU Usage = 10 * 5 * 1e4 / 1e7 = 0.05.

If an algorithm has a fast comparison speed, then there not be a significant amount of CPU usage for searching. However, while the Rank One algorithm has a comparison speed of roughly 1e7 (i.e., 10M comparisons per CPU core, per second), the average NIST FRVT algorithm is roughly 10x slower, which would mean 0.05 CPU usage would become 0.5 CPU usage, which, in this example, means an additional CPU core would be required. Other NIST algorithms have 100x to 1000x slower comparison speeds, which creates significant CPU requirements to perform video-based watchlisting.

In addition to the wide fluctuation in comparison speeds, the number of templates in the gallery N can significantly influence the CPU usage. In the example above, N was set to 1e4. This number is somewhat meaningful in that larger gallery sizes typically cannot be searched with stable accuracy in watch-listing applications. However, depending on the level of security involved in an application, and in turn the number of human analysts available to adjudicate watch-list match alerts, the size of N can be upwards of 1e6 (i.e., 1 million) or even beyond 1e8 (i.e., 100 million). In these cases several additional CPU cores may be required.

Verifying

It may be the case that face tracks are used for verifying a person’s identity. In this scenario the person claims to be a given identity, and face verification is performed by comparing the person’s presented face to the stored face template(s) corresponding to this person’s identity. The person may claim their identity by entering a pin, scanning an access card, providing an NFC token from their mobile phone, or other methods.

Typically in these cases there is only one face in each video stream, as most access control identity verification systems are designed to process one user at a time. The Max Faces value could still be higher than 1, though, as a central server could be processing multiple video feeds / access control points.

The computational cost for verifying from video streams is determined as follows, where Templates per Person is the number of templates stored for each identity in the system:

Verifying CPU Usage = Max Faces * Templates per Track * Templates per Person / Comparisons per Second

For example, if there are 10 face tracks (i.e., Max Faces = 10), 5 Templates per Track, 5 Templates per Person stored, and a Comparisons per Second of 1e7, then Verifying CPU Usage = 10 * 5 * 5 / 1e7 = 0.000025.

Aside from an algorithm with an extremely slow comparison speed, or a system that is processing a large number of face tracks (i.e., a high Max Faces), compute cost for verification is typically trivial.

Summarizing hardware costs for template comparison

Typically there is not a large CPU cost for the different tasks that may be performed after enrolling video frames (i.e., tracking, consolidating, searching, and/or verifying). However, certain factors may result in meaningful CPU resources being required. These include a slow template comparison speed, a large gallery for watch-listing/searching, and a large number of Max Faces (persons in the video streams at once).

For watchlisting / searching applications, there will also be memory requirements that will be directly based on the algorithm’s template size. Please refer to our previous article on the implications of template size for this consideration.

It is important to use the guidance in this article in conjunction with the information provided by your algorithm vendor, as well as their measured performance in the NIST FRVT Ongoing benchmarks. As always, a vendor who does not submit their algorithm to NIST FRVT should never be considered.

–

Like this article? Subscribe to our blog or follow us on LinkedIn or Twitter to stay up to date on future articles.

Popular articles:

The only American-made
multimodal biometrics and
computer vision provider.

Hardware requirements for video processing applications – Part 2: Template comparison