
In a high-stakes poker game, knowing when your opponent is bluffing and being able to read a subtle smile or frown can make the difference between walking away with the pot or going home empty-handed.
But for AI, decoding these nuanced facial cues, whether for identity verification, bias detection, or facial recognition, remains a major challenge.
To tackle this, researchers at Johns Hopkins University developed FaceXBench, a comprehensive system designed to assess and improve how multimodal large language models (MLLMs) interpret and analyze faces. The findings,”FaceXBench: Evaluating Multimodal LLMs on Face Understanding,” are posted in arXiv.
“FaceXBench evaluates MLLMs across 14 tasks in six key areas,” said co-author Vishal Patel, an associate professor in the Whiting School of Engineering’s Department of Electrical and Computer Engineering. “It tests bias and fairness by examining whether models perform equally well across different ages, genders, and racial backgrounds, or if they make more mistakes for certain groups. It also assesses face recognition, including identifying faces in high- and low-resolution images, even recognizing celebrities.”
Additionally, FaceXBench evaluates face authentication by measuring how well models detect deepfakes and prevent spoofing—attempts to trick an AI system into using fake videos, images, or masks to impersonate someone else. The system further tests facial expression recognition, crowd counting, and head position estimation, which makes facial recognition systems more accurate in real-world situations where people are not always facing the camera. Finally, it checks whether models can find and use tools or techniques related to face analysis, like algorithms for tracking faces across different video frames or images.
The team’s initial findings highlighted gaps in some MLLMs’ ability to interpret and analyze faces effectively, testing 26 open-source models and two proprietary ones, GPT-40 and GeminiPro 1.5.
“Even the most advanced AI models had trouble spotting deepfakes or counting people in crowds because these tasks require detecting complex patterns. Methods designed to improve performance, like giving extra instructions or guiding the model’s reasoning, didn’t help either,” said co-author Kartik Narayan, a PhD student in computer science. “Bias was also an issue, as the models sometimes gave different results based on age, gender, or race.”
To address these challenges, the team proposes expanding the training data used for MLLMs to improve accuracy, using face-processing APIs, which are sets of tools that allow software systems to communicate for better performance, and developing techniques to reduce bias. This would ensure fairer outcomes in predictions, which involve AI making decisions like identifying individuals, recognizing expressions, or detecting biases.
“By identifying these weaknesses and offering a more structured path for improvement, FaceXBench lays the foundation for advanced, ethically responsible, and accurate face-understanding capabilities in MLLMs,” said co-author Vibashan VS, a PhD student in electrical and computer engineering. “As future researchers refine these models, they move closer to AI systems that can analyze and interpret human faces with the nuance and reliability required for real-world applications.”