When automated face recognition technology is used for analyzing streaming video, an important question is: how much computer hardware is needed?
The hardware required to process video depends on several factors which will be discussed in this article. After reading you should be able to determine hardware requirements for a particular application and algorithm.
If you are unfamiliar with the efficiency metrics of relevance for face recognition algorithms we recommended you first read our previous article on why efficiency metrics matter.
Enrolling frames: the computationally burdensome step in video processing
There are two face recognition steps for video processing applications that will require processing power and hardware use: (i) detecting and enrolling faces found into templates for each processed video frame, and (ii) searching the created templates against a gallery. If these concepts are unfamiliar, we encourage you to read our article on how face recognition works.
This article focuses on the efficiency requirements for enrolling faces in video frames into templates, which is the computational bottleneck for video processing applications. There are also computational demands for performing watch-list searches of templates processed in video frames against gallery databases. However, because watch-list searching is not nearly as a computationally burdensome, we instead cover the computational cost of template comparison in video applications in a supplemental article.
Enrolling video frames is a CPU intensive task. In order to assess the number of CPU cores required you first need to determine the following information:
- Number of Streams: the number of video streams that will be processed concurrently.
- Max Faces: the maximum number of faces that will appear across all video feeds at a single given time.
- Templates per Second: the number of templates per second the face recognition algorithm can generate on a single CPU core. Note that NIST FRVT Ongoing reports this statistic as Enrollment Speed, where Enrollment Speed = 1 / Templates per Second, which is the time it takes to generate a single template.
- Frames per Second: the number of frames per second (fps) the algorithm will process (as recommended by your vendor and use case).
Using this information, the number of CPU cores required for enrollment is roughly determined as follows:
Enrolling CPU Usage = Max Faces * Frames per Second / Templates per Second
If, however, the Enrolling CPU Usage is less than Number of Streams, then:
Enrolling CPU Usage = Number of Streams.
For example, let’s suppose that your application will be processing 4 camera feeds (Number of Streams = 4) with a maximum of 10 faces at a time (Max Faces = 10), i.e., across all 4 cameras feeds, at any given time, the largest number faces that will appear at one time is 10, and your face recognition algorithm can generate 4 templates per second per CPU core (Templates per Second = 4) and recommends processing 5 fps (Frames per Second = 5). This would mean 10 * 5 / 4 = 12.5, which implies the Enrolling CPU Usage is 13 (i.e., 13 CPU cores), as you should always round up.
As another example, let’s suppose that your application will be processing 6 camera feeds (Number of Streams = 4) with a maximum of 4 faces at a time (Max Faces = 4), and your face recognition algorithm can generate 4 templates per second per CPU core (Templates per Second = 4) and recommends processing 5 fps (Frames per Second = 5). This would mean 4 * 5 / 4 = 5. Because 5 is less than the Number of Streams, 6, the Enrolling CPU Usage is 6 (i.e., 6 CPU cores).
Using these guidelines to determine hardware costs for a vendor solution
It is important to use the guidance in this article in conjunction with the information provided by your algorithm vendor, as well as their measured performance in the NIST FRVT Ongoing benchmarks. As always, a vendor who does not submit their algorithm to NIST FRVT should never be considered.
The information you will need from your vendor is the enrollment speed, and the number of frames per second recommended. Information you will determine yourself is the maximum number of faces appearing across all video streams at any given time, and the total number of video streams. When used with the formula provided in this article, you will be able to properly estimate the hardware requirements for your application.