Procuring a Face Recognition Algorithm: How to Measure Accuracy

One of many critical considerations when selecting a face recognition SDK or system is the accuracy of the underlying algorithm. Even though an algorithm may meet your hardware requirements, licensing budget, support needs, or other decisive factors, if it is not accurate enough then the integrated solution will not garner interest from users.

In this article we discuss several critical factors to consider when assessing the accuracy of face recognition algorithms. This overview will help ensure you do not waste your time attempting to build a system that current face recognition technology cannot support, that you understand which vendors meet your requirements, and you can clearly pass on to your customers or users comprehensive information regarding the algorithm you have integrated.

How Accurate is a Face Recognition Algorithm?

Face recognition algorithm providers are often asked “How accurate is your face recognition solution?” The problem with this question is that there is no single answer.

In order to answer this question in earnest one would need to provide accuracy statistics across all operating ranges (low to high security and/or low to high automation), for all applications (identity verification, real-time screening, or identity discovery), with accuracy variations from every environmental factor both in isolation and in conjunction with all other combinations of environmental factors. This is generally not possible as the amount of data needed to generate statistically significant accuracy measures across all combinations of environmental factors and applications is too large to reasonably collect.

Pitfall: Despite the inability to properly provide an answer to an algorithm’s accuracy, representatives from unscrupulous companies may offer up a single accuracy metric regardless. Their answer may only be true for certain favorable environments that may not be representative of your use case. Always be prepared to measure accuracy yourself.

How then can we effectively determine accuracy?

The answer is that you should instead be defining your requirements, and in turn answering the question yourself.

Specifically, the following steps should be followed:

Define the accuracy requirements for your application to be successful.
Define the environmental factors your deployed system will contend with.
Downselect potential solutions based on NIST FRVT results and other business factors (e.g., provider’s geographic region).
Collect an internal dataset that replicates the environmental factors of your operational system, to include operational data itself.
Internally test algorithms against the collected dataset(s) to determine if they meet your accuracy requirements.

If that sounds intimidating, don’t worry, it is not that hard and we will walk you through all of these steps below!

Defining accuracy requirements

If you cannot state the accuracy your system needs to be successful then you should reconsider your plans to develop the solution. This is not to say that it is easy to determine the required accuracy, but, at the same time, all face recognition systems will have errors. Much as the most astute transportation security agent will occasionally get duped by a false passport, or incorrectly think a passport is fake, even the most accurate face recognition algorithms in the world will be subject to some degree of error. In some scenarios these errors are exceedingly low, but they still exist and need to be accounted for when designing a system. For these reasons you need to be able to define the amount and types of error your planned system can tolerate in order to be successful.

When defining accuracy requirements, the two key values to consider are the Type I and Type II error rates that are tolerable.

Depending on the application, there can be vastly different requirements for Type I errors (i.e., false positives), and Type II errors (i.e., false negatives). For example, if the application is to identify loyal customers in a retail environment, then the acceptable Type I and Type II error rates would be significantly higher than a border crossing environment. For a border crossing scenario, a Type I false acceptance error (e.g. accepting a non-matching ID) is far more egregious than a Type II false rejection error (e.g. initially reject a matching ID), as a human agent can adjudicate the false rejection and still allow the valid ID card holder entry.

Once you have defined your accuracy requirements in a manner that is informed by the workflows and mission goals your product intends to support, a significant hurdle has been cleared. In addition to being able to clearly assess which algorithms meet your requirements, you will also be able to simulate how successes and errors will be handled by the business processes and workflows your application will support.

Defining Environmental Factors

Face recognition accuracy is dependent on at least the following external factors:

Illumination (e.g., studio lightning vs. outdoor environments)
Facial pose angle (e.g., the person is facing the camera vs. person is looking away from the camera)
Occlusion (e.g., person is wearing nothing on their face vs. person is wearing a mask over their mouth and sunglasses)
Resolution and compression (e.g., the face image is high resolution and easily identifiable by a human vs. the image is low resolution, has compression artifacts, and is difficult for a human to determine the identity)
Motion blur and focus
Gender, Race/Ethnicity, and Age (e.g., a particular algorithm may have minor or significant changes in accuracy depending on the demographic cohort)
Facial expression (e.g., a particular algorithm may have minor or significant changes in accuracy depending on the presence of facial expression)

Each factor will affect accuracy to a different degree, and different algorithms will exhibit different sensitivities to each factor.

For these reasons, it is critical to first define the factors that will be present for your operational environment. And, while the above factors are numerous, generally speaking they can be grouped according to the following two considerations: (i) user assisted vs. unassisted, and (ii) constrained vs. unconstrained.

User assisted vs. unassisted, which could also be referred to as cooperative vs. noncooperative, denotes whether or not face imagery will be captured with explicit cooperation from the user. For example, if the user takes a selfie, looks into an access control camera, or gets a standards compliant passport image taken, they are cooperating and assisting the identification process. By contrast, when CCTV imagery is used, or event photography is collected, typically the persons being imaged are not focused on facing the camera, standing a certain distance away, or other factors that influence the accuracy of a face recognition system.

Constrained vs. unconstrained refers to the capture environment. For a constrained environment, there is an opportunity to place cameras in a particular position, control the lighting, and configure other aspects of the image acquisition process. For example, an ID card booth will have studio lighting and a solid background, or a border crossing will have a controlled queue that has to be followed. For an unconstrained environment, regardless of whether there is user cooperation or not, there is not an opportunity to control these factors. For example, a selfie image capture does not allow the app developer to control the ambient lighting, the angle of the camera to the face, or other factors.

Based on these two meta-considerations, it is common to consider face recognition accuracy grouped into the following categories: “Frontal Constrained” (i.e., cooperative and constrained) [1], “Frontal Unconstrained” (i.e., cooperative and unconstrained), and “Non-Frontal Unconstrained” (i.e., noncooperative and unconstrained).

Downselect vendors from NIST FRVT reports

At this point you have defined the accuracy you need for your system to be successful, and you have also defined your environmental factors. While it may be tempting to start performing Internet searches for the different face recognition vendors, it is not reasonable to independently measure the accuracy of every face recognition solution prior to making a procurement.

Fortunately, there is an easy way to get a listing of all legitimate face recognition vendors. From there an initial downselect can be performed based on geographic location of the company, other business factors, and the independently measured accuracy on datasets and applications that are generally relevant to your intended application.

To support such a downselect, the National Institute of Standards and Technology (NIST) consistently performs third-party testing of face recognition accuracy from any vendor willing to submit their algorithm. These benchmarks are collectively referred to as the Face Recognition Vendor Tests (FRVT) and they are an invaluable resource for saving integrators time and money when assessing market solutions.

Currently, the most prominent NIST benchmark is FRVT Ongoing, which is updated every few months with the latest accuracies of nearly every relevant vendor. Other benchmarks are performed every few years and look at specific face recognition applications, such as large scale search (1:N), video-based applications, or demographic estimation.

Pitfall: Vendors that do not submit to FRVT Ongoing and other NIST benchmarks should be avoided. There is no legitimate reason a face recognition vendor to not submit to this benchmark.

In terms of NIST FRVT Ongoing, each published version is a multi-hundred page PDF document with a dense set of accuracy and efficiency measurements for each submitter. While analyzing the results could be a full-time job given the immense amount of information provided, a cursory inspection of the results is often all that is needed.

Specifically, Table 1 and Table 2 of the report include a snapshot of the accuracy metrics as well as all of the key efficiency statistics of all submitted algorithms. Accuracy is measured across datasets that roughly correspond to the three general sets of data conditions discussed above where Frontal Constrained is provided in the form of “Visa” and “Mug-shot” images, Frontal Unconstrained with “Webcam” and “Selfie” imagery, and Non-Frontal Unconstrained with the “Wild” dataset (which corresponds with a photojournalism application).

For each data scenario, one of the initial tables (typically Table 2) lists the false reject rates (FRR) for each algorithm at a fixed false accept rate (FAR) of 10^-4 (i.e., one false acceptance in 10,000 comparisons) as computed in a Receiving Operating Characteristic (ROC) or Decision Error Tradeoff (DET) curve. For the Visa dataset the FRR is also reported at a FAR of 10^-6. While these snapshot accuracies may not precisely align with the metrics you defined in your requirements, they should be sufficient for indicating whether or not a given algorithm could be relevant (accuracy-wise) to your application.

Based on the defined environmental conditions, your scenario will generally correspond to one of the three data types that FRVT reports: Frontal Constrained, Frontal Unconstrained, or Non-Frontal Unconstrained. In analyzing the error rates for different vendors on the corresponding dataset, you can generally determine which vendors are relevant and which are not. Of course, if a FAR of 10^-4 is not relevant to your scenario, there is a dense set of plots that show the FRR at all measurable FARs.

Pitfall: Certain vendors submit multiple algorithms to NIST benchmarks. These algorithms may have grossly different characteristics, or they may simply represent different versioned releases of the vendor’s algorithm. It is important to ask a vendor with multiple algorithm submissions about the differences between their submissions.

Based on the list of vendors that are relevant to your application, and after additionally reducing this set based on geographic and business considerations as well as efficiency metrics (which we will discuss in our next article), at this point you will hopefully have one or more face recognition vendors that you can choose from. You can now begin contacting each vendor!

When contacting a vendor, typically you should describe to them your intended application and ask them some of the following questions:

Will they provide access to their solution for you to test internally?
Is their algorithm available via a network-less SDK, an SDK that requires a network connection, or a cloud-only SaaS solution? Depending on your security requirements, or need to comply with privacy laws such as GDPR [2] this consideration can be critical.
How long of a trial period do they support in order for you to test their algorithm internally?
Which submission in FRVT does the algorithm they will send you correspond to?
Can they provide a licensing cost estimate based on example purchase orders you intend to make if you deploy your product?

After dialogue with each vendor you should hopefully receive access to the algorithms you seek to test.

Collect an internal dataset

Collecting an internal dataset is a vital step not only in the procurement of a face recognition algorithm, but also for properly integrating and maintaining the algorithm within your system. While the NIST reports were critical in your downselection to a few different face recognition solutions, there will be still be a lot of uncertainty as to whether or not a given solution will meet your requirements.

Collecting a dataset is arduous. In some cases you may be tempted to just use a publicly available dataset, but there a few important concerns to highlight with using publically available datasets. First, third-party datasets are unlikely to precisely replicate your operational environment. Second, if the data is publicly available, there is a chance that the algorithm you are evaluating used that dataset (along with others) to train the underlying statistical models. And, in such a case, testing on the same subjects, let alone the same images, that an algorithm was trained on will result in a significantly biased evaluation.

A chief benefit to your team collecting your own data is that you should be able to mimic your operational environment quite closely, and you can sequester the data such that no vendor’s performance will be biased, for better or worse, on the data.

In terms of the basic procedure for collecting a dataset yourself, here are some general guidelines:

You will generally want at least 50 different persons to participate in your data collection in order to generate statistically significant results, though the precise minimum number will depend on your accuracy requirements and application.
The cameras you use should be the same as you intend to deploy with the system. If you are also trying to decide between different camera models, then collect imagery from the different camera models you are deciding between. If you will not be able to control the camera (e.g., a mobile app) then this should be simulated by using multiple cameras, though these days algorithms are usually not sensitive to the camera used.
You should consider the time lapse between the images you collect from a given person. For example, collecting multiple images of the same person on the same day will be an easier recognition task than collecting the images several days or weeks apart [2].
The capture environment should be replicated as precisely as possible. If your system will capture images indoors with controlled illumination, you should simulate that scenario. Or, if the capture environment is outdoors, capture images outdoors. If it is both, capture both and be prepared to measure accuracy separately for both scenarios. If your system will first compress the images or videos prior to passing them into a face recognition algorithm, then make sure that same compression technique is applied prior using the images for evaluation.

There are other factors to consider when collecting internal data, and you are encouraged to think closely about any nuances to your environment that you should potentially mimic in your collection, particularly as it pertains to the factors influencing face recognition algorithm accuracy listed above.

A more advanced consideration is also how this dataset will evolve over time. That is, if you are successful in your development of a face recognition system then you will be measuring accuracies for years to come as new versions of your vendor’s SDK are (hopefully) released, or if you ever decide to re-compete your face recognition procurement.

If you do add to your dataset over time, note that this will generally change the accuracies you measure, and thus an apples-to-apples comparison of two different face recognition algorithms should occur on the same dataset snapshot. Alternatively, an apples-to-apples comparison of two different dataset snapshots should occur using the same algorithm to understand any change in difficulty between two different dataset snapshots. In other words, don’t simultaneously change both the algorithm and the dataset and then attempt to directly compare the accuracy between the two sets of results.

Pitfall: Not collecting an internal benchmarking dataset not only limits the ability to independently determine the best algorithm for your system, it also makes it difficult to properly select similarity thresholds and other integration parameters for your deployed system.

Internally testing algorithms

With data in tow, you will be able to measure the accuracy of different algorithms using your data as well as inform operational parameters for using a given algorithm. Further, you will get a good feel for the usability and stability of a given SDK as you integrate it into your evaluation harness.

The goal of your testing will be to generate accuracy metrics that are comparable to the accuracy requirements you have set forth in the first step of this process. However, not only will this internal testing allow you to understand if a given algorithm meets these accuracy requirements, it will also allow you to measure the impact of different parameter options supported by the algorithm..

If the accuracies you measure are grossly disproportionate to the accuracies reported in the NIST FRVT results then you may have made a mistake when integrating the algorithm. Aside from a code review of your implementation, you should reach out of the algorithm provider to double-check the veracity of your implementation. Even if you don’t necessarily need such help from the vendor in the case where integration and testing goes smoothly, throwing a few technical questions to the vendor’s support team could be an important means to understand the responsiveness and effectiveness of their support.

Using the information learned

It may be the case that after following the previous steps there is a clear and obvious decision as to which vendor meets your accuracy requirements. Or it may be the case that multiple vendors meet your accuracy requirements. Regardless, in properly following the steps provided in this article you should be confident that you have properly assessed the solutions available and that you have precise awareness of which solutions meet your accuracy needs and which do not.

As we will discuss in subsequent articles, accuracy is but one critical consideration when procuring a face recognition solution. Much like some face recognition algorithms are non-procurable if they do not meet accuracy requirements for a given application, if an algorithm uses too many computational resources it may significantly increase the bill of materials for an intended system design. Alternatively, perhaps a solution will be too difficult to integrate, the technical support is delayed and often lost in translation, the licensing costs are exorbitant, it is difficult to receive a well-defined contract or terms of use, or there are any number of other critical considerations that could undermine your hard work and capital investment when building the next great face recognition application. For now, though, you should hopefully feel more confident about your ability to understand the accuracy of a given face recognition algorithm as it pertains to the goals of your system.

Do you like this article, or have questions or comments? Well, we’d love to hear back from you. Please comment below, reach out to us at www.rankone.io/contact, or post to the LinkedIn thread for this article!

[1] ISO/IEC 19794 Information technology – Biometric data interchange formats – Part 5: Face image data, https://en.wikipedia.org/wiki/ISO/IEC_19794-5

[2] B. Klare and A. K. Jain, “Face Recognition Across Time Lapse: On Learning Feature Subspaces”, IJCB, Washington, DC, Oct. 11-13, 2011.

The only American-made
multimodal biometrics and
computer vision provider.