Machine learning technologies hold the potential to revolutionize decision-making. But how can we ensure AI systems are free of bias? Our experts weigh in.
What’s in a name? Sometimes, more than might be expected. Assistant Professor of Computer Science Anjalie Field has shown that something as seemingly innocuous as people’s names can offer insight into how artificial intelligence and machine learning can get it wrong, and by extension, do wrong.
Last year, Field—a member of the Center for Language and Speech Processing—was part of a team of researchers investigating the possible application of natural language processing (NLP) to assist the nation’s network of child protective services agencies in better serving and protecting vulnerable children at risk of abuse or neglect. NLP is a form of artificial intelligence (AI) that processes large datasets of normal human language using rule-based or probabilistic machine learning to contextualize and understand written communication.
Protective services agencies’ case workers typically take careful and extensive notes related to every family that enters the system. Typically, those notes are not analyzed in any systematic way to learn how better to respond to and possibly prevent child abuse. In the social services community, there is tremendous interest in using NLP and AI to help better manage the more than 3.6 million calls that come into child protection agencies across the United States each year.
Field’s research team applied NLP tools to 3.1 million contact notes collected over a 10-year period in an anonymous child protective services agency. They found that the NLP model did a poorer job at recognizing the names of African American individuals, which could potentially include identifying relatives who might support or shelter endangered children, than it did with white individuals. “Here is an unlooked-for area where machine learning can amplify what is already a well-documented racial bias in child protective services without anyone even recognizing the problem,” Field says.
As AI and machine learning technologies increasingly come to inform decision-making tasks—and in some situations make decisions— there is increasing concern over issues of justice, equity, and fairness. At the Whiting School, faculty researchers are engaging with the fundamental issue of fairness in artificial intelligence across a wide spectrum of potential uses and misuses. These efforts range from unleashing the newfound power in neural network–based computing to search for and identify existing biases, to combatting the known tendency of machine learning systems to learn and propagate unfair practices.
Investigate truth and fairness in machine learning and you quickly discover there is more than one aspect to the challenge. Field, whose research focuses on the ethics and social science aspects of NLP, says her work follows three broad directions. “One of them is computational social science: How can we use language analysis and automated text processing to identify social injustices that already occur in society?” This approach employs machine learning technology as a tool for identifying bias. For instance, Field has used NLP to investigate the language Wikipedia articles use to portray different groups of people. “Do articles about women emphasize their personal lives more than their careers, whereas articles about men talk more about their careers? When you employ these tools to look across all of Wikipedia and compute statistics around the data, you can start to see these kinds of patterns and biases,” she says.
“Do articles about women emphasize their personal lives more than their careers, whereas articles about men talk more about their careers? When you employ these tools to look across all of Wikipedia and compute statistics around the data, you can start to see these kinds of patterns and biases.”
–Anjalie Field, assistant professor of computer science
A second line of her research focuses on building technology to address known issues of existing bias, particularly in public service domains. As part of her work looking at child welfare cases, she used NLP to review thousands of case records, categorized by race, for the frequency of certain words and terms, to identify persistent institutional bias. She found differing language, but not enough to provide conclusive evidence of racial bias. “This is sort of natural language processing for the social good,” is how she describes it.
The third direction looks at potential ethical issues within AI itself. “Some of that is directly looking at model bias—is it going to favor certain groups of people more than others?—but also issues like access to technology, privacy, and data ownership.” In this broader subject area of bias, it is the data itself that is typically of greatest concern, but Field emphasizes that bad data should never permit bad results. “I think there’s often a narrative in AI that ‘the data is biased so it’s not our fault/there’s nothing we can do about it,’ but that’s not true,” she says.
Mark Dredze, John C. Malone Professor of Computer Science, a pioneer in using AI tools to gain insights into public health challenges ranging from suicide prevention to vaccine refusal and from tobacco use to gun violence, has also found how bad data can work to amplify bad outcomes. He says early missteps in machine learning algorithms—such as when Microsoft’s initial release of a chatbot called Tay had to be shut down after just 16 hours for spewing obscene and racist language it picked up online—highlight the dangers of working with bad data.
“The problem is that if we just accept the data and say to the machine, ‘Learn how to make decisions from the data,’ when there are biases in the data, or biases in the process of how we learn from data, we will produce biased results. That is the essence of the problem of fairness in machine learning algorithms,” Dredze says.
This applies to his own research, some of which involves scanning the web to understand how people turn to the internet for medical information and what innate biases they are likely to encounter there. In a recent paper he co-authored, researchers set about evaluating biases in context-dependent health questions, focusing on sexual and reproductive health care queries. They looked at questions that required specific additional information to be properly answered when that information was not provided. For instance, “Which is the best birth control method for me?” has no single correct answer, as it depends on sex, age, and other factors. Dredze and colleagues found that large language models often will simply provide answers reflecting the majority demographic, as, for example, suggesting oral contraceptives, a solution only available to women, while neglecting to include the use of condoms. This kind of built-in bias is a special concern for individuals who turn to the web as a replacement for traditional health care advice since misinformed answers have potentially detrimental effects on users’ health, Dredze says.
“The problem is that if we just accept the data and say to the machine, ‘Learn how to make decisions from the data,’ when there are biases in the data, or biases in the process of how we learn from data, we will produce biased results. That is the essence of the problem of fairness in machine learning algorithms.”
–Mark Dredze, professor of computer science
This and his earlier pioneering work in “social monitoring”—employing machine learning to gain understanding from text published on social media sites—has led him to focus not just on the raw data, but also on how people use the web.
“I would describe this as a more holistic approach, where we’re actually building systems and paying attention to how people interact with those systems,” he says. “Where did our data come from? How did we collect it? I have to care about these issues when giving data to the algorithm and figuring out what the algorithm does. But then I also need to account for the fact that a human will interact with us. And humans are going to have their own biases and issues. So maybe it’s not just that the system is biased or unbiased, but it interacts with someone to create a different kind of bias.”
The Data Decides
So how do you make sure AI and programs based on NLP such as ChatGPT are fair and free of bias? To understand the nature of the challenge presented requires grasping a fundamental paradigm shift that has occurred. “What’s different is that in the traditional ways computers have been used, the programs themselves were the issue—the code made the decision,” explains cybersecurity expert Yinzhi Cao, assistant professor of computer science and technical director of the JHU Information Security Institute. “For AI, the most important thing is the data, because AI learns from the data and makes decisions from what it learns.”
It is that capability of making autonomous decisions that has the potential to be especially unfair, Cao says. His research has recently focused on security, privacy, and fairness analysis of machine learning systems. He notes that these three concerns can overlap in surprising ways.
“The final goal may be that we will be able to ask AI to actually perform the diagnosis, but at this stage, we are using it to generate training images that overcome disparities in age and race and gender.”
–Yinzhi Cao, assistant professor of computer science
Cao was one of a group of researchers to successfully overcome safeguards on two of the better-known text-to-image generative models that use AI to create original images from written prompts. Typically, these art generators are designed with filters to block violent, pornographic, or other objectionable content. But Cao and colleagues showed that the right algorithm could be used to bypass filters and create images that are not simply unsuitable but also could be used to defame or malign individuals “like a politician or famous person being made to look like they’re doing something wrong,” Cao said in a press release announcing the results of the team’s research efforts.
Cao also conducts research that instructs AI in medical image analysis to augment medical training and diagnosis in the detection of Lyme disease, a project he has worked on in cooperation with colleagues at the Applied Physics Lab and the School of Medicine. An early symptom in 70% to 80% of people with Lyme disease is the appearance of a distinctive “bull’s-eye rash,” which is usually a single circle of inflamed skin that slowly spreads from the site of the tick bite. In fair-skinned people, and especially in younger people with unblemished skin, it is easily detected. For clinicians, however, the most useful examples are found in skin types where the contrast is not as distinctive and readily apparent. “The final goal may be that we will be able to ask AI to actually perform the diagnosis, but at this stage, we are using it to generate training images that overcome disparities in age and race and gender,” he says.
Algorithms are Everywhere
Making AI decisions understandable may first require overcoming a larger challenge. The term algorithm itself raises math anxiety among many since the word is poorly understood. But Ilya Shpitser, the John C. Malone Associate Professor of Computer Science whose work focuses on algorithmic fairness in datasets of all types, points out that algorithms are the basis for everyday decision making among many—even if they don’t know it.
“When doctors diagnose, when judges set bail, they have a sequence of steps they’ve learned that’s considered reasonable,” he says. “Regardless of how they think of it themselves, they are using algorithms, because judicial decisions and diagnosis cannot be arbitrary; they better be systematic. The fact that similar cases are decided in similar ways: That’s what an algorithm really does.”
Most important for a judge to appropriately set bail or a doctor to accurately make a diagnosis is the need for good, fair, accurate information. In algorithmic decision-making, it all comes down to the data. And in an imperfect world, for any decision, there will be good data, there will be bad data, and most vexingly of all, there will be data we simply don’t have.
“Any person who works in actual data, real data of any kind, has missing data in their sets, that’s just how it is, basically,” says Shpitser, who cites a common example of data collection in electronic health records, where missing data can be due to a lack of collection such as a patient was never asked about asthma. Or it could come from a lack of documentation, as when a patient was asked about asthma but the response was never recorded in the medical record. “Lack of documentation is particularly common when it comes to patients not having symptoms or presenting comorbidities.” In these cases, rather than recording a negative value for each potential symptom or comorbidity, the missing data fields are left blank and only the positive values are recorded, which skews the value of the data, says Shpitser. “This makes it essentially impossible to differentiate between the lack of a comorbidity, the lack of documentation of a comorbidity, or the lack of data collection regarding the comorbidity.”
“My take on algorithmic fairness research is that it’s not our job as researchers to decide what fair is. I think the discussions of fairness need to be discussions in the public square.”
–Ilya Shpitser, associate professor of computer science
One of the central challenges in creating fair and accurate algorithms then becomes devising sound methods of correcting for data recorded incompletely, incorrectly, or not at all. “I work on data being screwed up,” is how Shpitser describes it. Along the way, he has demonstrated in his research that it is at least theoretically possible to “break the cycle of injustice” (in which variables such as gender, race, disability, or other attributes introduce bias) by making optimal but fair decisions. His research employs the methodology of causal inference, which he describes as “methods to adjust for incomplete, bad, or missing data to allow reliable and fair inferences to be made.”
Fairness Isn’t an Oracle Concept
In the past few years as AI systems have caught increasing media attention, highly public machine learning misfires have brought scientists and engineers a deeper awareness of both the importance and the difficulty of designing and implementing systems equitably.
“For a long time, we said, ‘Look, the algorithms are math, and math is math,’” says Dredze. “It was, ‘Let’s throw the math at this, and it comes up with what it comes up with.’” That attitude no longer applies. “I think we’ve learned a couple things. One is that the math might be math, but the data is not the data: It always has some kind of bias in it. And we need to do something about that.”
But that may only be the beginning. “The other thing we’re learning—and maybe there’s a little controversy to this—is that math isn’t just math. Math always has some assumptions to it. The models that you pick always have some assumptions, and for a variety of reasons we might favor certain models that do better on some groups. And so it’s not only a matter of the data; it’s also a matter of the models we build. How do we make our models aware that fairness is a thing? How do we build into the models some measure of fairness?”
Which points ultimately to issues that transcend both the data and the math.
“Fairness isn’t an Oracle concept,” notes Dredze. “If you’ve got kids and you’ve ever tried to give them anything, they complain about fairness, right? And you end up telling them that life isn’t fair. Fairness is subjective: that person got a smaller piece of cake, but they had a frosting flower, and you didn’t get a flower.”
It is, he points out, tremendously challenging to formalize concepts of fairness. “We can build that into our models and train them to be aware. But who decides what the right definition of fairness is? Think about the most controversial issues in society, like college admissions. Both sides of affirmative action and college admissions are insisting we need to be fair, but they have opposing views as to what that means,” he said.
All of which suggests that creating algorithms for a fairer world in the end will not be the purview of computer scientists alone.
“My take on algorithmic fairness research is that it’s not our job as researchers to decide what fair is. I think the discussions of fairness need to be discussions in the public square,” says Shpitser. “As an American citizen, I have my own opinions of what policies we should follow, but that’s a different path than wearing my hat as a researcher on algorithmic fairness. In other words, computer scientists are best suited to be the implementers, not the deciders, in notions of what is fair.”
He continues: “Whenever I give talks about this, I always get questions that try to push me into being some kind of priesthood that decides for people what fairness criteria to use. I really don’t think that is our job. I have as much grounds to advocate for a particular definition of fairness as any other citizen, but the fact that I work in algorithmic fairness doesn’t give me a special advantage. I’m using modern tools, but the questions themselves are much older than that.”