Examination of facial-analysis software shows error rate of 0.8 percent for light-skinned men, 34.7 percent for dark-skinned women.
Three commercially released facial-analysis programs from major technology companies demonstrate both skin-type and gender biases, according to new research from MIT and Stanford University.
At first glance, the MIT Media Lab researcher who did the study, Joy Buolamwini, said the overall accuracy rate was high even though all companies detected men’s faces better than women’s. But the error rate grew as she dug deeper.
“Lighter male faces were the easiest to guess the gender on, and darker female faces were the hardest to guess the gender on,” Buolamwini said.
Buolamwini made her own data set of images to test out, which included the faces of women politicians from other nations. White guy? No problem. Black female Rwandan parliament member? Does not compute.
The accuracy rate of identifying light-skinned men’s faces was 99 percent across the board, but the three companies experienced higher error rates when identifying darker-skinned women. IBM’s error rate was the highest, close to 35 percent. Face ++ clocked in at 34.5 percent, but had the lowest error rate of all three companies in identifying dark-skinned males. Microsoft’s error rate in identifying dark-skinned females was 20.8 percent.
“So when we’re looking at these systems that are relying on data, we have to be honest about the kind of data that’s being fed into it,” she said.
If programmers are training artificial intelligence on a set of images primarily made up of white male faces, their systems will reflect that bias. Buolamwini calls this the “Coded Gaze.” Another part of the equation could be that most programmers are white men.
“I think that definitely contributes to it,” Buolamwini said. “Because you might not even know to question your data or your benchmark if it’s reflective of you in the first place. If you don’t have a very diverse perspective, it can be easier to miss groups you’re not as familiar with.”
This can be problematic as facial recognition technology is increasingly relied on to decide everything from what you should pay for car insurance to when you’re likely to commit another crime. According to the Georgetown Center on Privacy and Technology, law enforcement has half of all American adults in their face recognition networks. But regulation and transparency are lacking, and our primary concern should be the potential civil liberties, says Phillip Atiba Goff, president of the Center for Policing Equity at John Jay College of Criminal Justice.
“The more accurate computers get, the more likely they are to be used to take away things like privacy and liberty,” he said. “It’s never been the case in the course of black American history that black freedom fighters have been fighting for more state surveillance. So that’s my concern. If we don’t make our values the priority, then we’re going to end up with tools and systems that are out of line with our values. That’s how we ended up incarcerating more than one out of every 100 people in the United States.”
There’s no sign of state or federal legislation to impose standards on law enforcement. Massachusetts State Police use facial recognition software to scan the Registry of Motor Vehicles database of driver’s license photos when searching for a suspect. What’s unknown is the software in use.
To begin investigating the programs’ biases systematically, Buolamwini first assembled a set of images in which women and people with dark skin are much better-represented than they are in the data sets typically used to evaluate face-analysis systems. The final set contained more than 1,200 images.
Next, she worked with a dermatologic surgeon to code the images according to the Fitzpatrick scale of skin tones, a six-point scale, from light to dark, originally developed by dermatologists as a means of assessing risk of sunburn.
Then she applied three commercial facial-analysis systems from major technology companies to her newly constructed data set. Across all three, the error rates for gender classification were consistently higher for females than they were for males, and for darker-skinned subjects than for lighter-skinned subjects.
For darker-skinned women—those assigned scores of IV, V, or VI on the Fitzpatrick scale—the error rates were 20.8 percent, 34.5 percent, and 34.7. But with two of the systems, the error rates for the darkest-skinned women in the data set—those assigned a score of VI—were worse still: 46.5 percent and 46.8 percent. Essentially, for those women, the system might as well have been guessing gender at random.
“To fail on one in three, in a commercial system, on something that’s been reduced to a binary classification task, you have to ask, would that have been permitted if those failure rates were in a different subgroup?” Buolamwini says. “The other big lesson … is that our benchmarks, the standards by which we measure success, themselves can give us a false sense of progress.”
“This is an area where the data sets have a large influence on what happens to the model,” says Ruchir Puri, chief architect of IBM’s Watson artificial-intelligence system. “We have a new model now that we brought out that is much more balanced in terms of accuracy across the benchmark that Joy was looking at. It has a half a million images with balanced types, and we have a different underlying neural network that is much more robust.”
“It takes time for us to do these things,” he adds. “We’ve been working on this roughly eight to nine months. The model isn’t specifically a response to her paper, but we took it upon ourselves to address the questions she had raised directly, including her benchmark. She was bringing up some very important points, and we should look at how our new work stands up to them.”
In a time where there’s excitement about what artificial intelligence can do, Joy Buolamwini says there’s a lot of blind faith in machine-learning algorithms.
“Yes, there are many things that it could do, but we have to be honest about how it’s being implemented. At the end of the day, who’s being harmed. Who’s benefiting? If the technology is flawed, it shouldn’t be used in the first place, and if this technology is going to be adopted, there have to be standards,” she said.
Buolamwini and a growing number of data scientists are offering to audit government, law enforcement and private systems for bias. Through the collective she founded, the Algorithmic Justice League, people can make these requests or report their experiences with bias. She wants to help people hold tech giants accountable.
“You have to be intentional about being inclusive because those in power reflect the current inequities that we have,” Buolamwini said.