This story was originally published in our July/August 2022 issue as “Ghosts in the Machine.” Click here to subscribe to read more stories like this one.
If a heart attack isn’t documented, did it really happen? For an artificial intelligence program, the answer may very well be “no.” Every year, an estimated 170,000 people in the United States experience asymptomatic — or “silent” — heart attacks. During these events, patients likely have no idea that a blockage is keeping blood from flowing or that vital tissue is dying. They won’t experience any chest pain, dizziness or trouble breathing. They don’t turn beet red or collapse. Instead, they may just feel a bit tired, or have no symptoms at all. But while the patient might not realize what happened, the underlying damage can be severe and long-lasting: People who suffer silent heart attacks are at higher risk for coronary heart disease and stroke and are more likely to die within the following 10 years.
But if a doctor doesn’t diagnose that attack, it won’t be included in a patient’s electronic health records. That omission can come with dangerous consequences. AI systems are trained on health records, sifting through troves of data to study how doctors treated past patients and make predictions that can inform decisions about future care. “That’s what makes a lot of medical AI very challenging,” says Ziad Obermeyer, an associate professor at the University of California, Berkeley, who studies machine learning, medicine and health policy. “We almost never observe the thing that we really care about.”
Read More About AI in Medicine:
The problem lies in the data – or rather, what’s not in the data. Electronic health records only show what doctors and nurses notice. If they can’t see an issue, even one as serious as a heart attack, then the AI won’t be able to see it either. Similarly, doctors may unwittingly encode their own racial, gender or socioeconomic biases into the system. at can lead to algorithms that prioritize certain demographics over others, entrench inequality and fail to make good on the promise that AI can help provide better care.
One such problem is that medical records can only store information about patients who have access to the medical system and can afford to see a doctor. “Datasets that don’t sufficiently represent certain groups — whether that’s racial groups, gender for certain diseases or rare diseases themselves — can produce algorithms that are biased against those groups,” says Curtis Langlotz, a radiologist and director of the Center for Artificial Intelligence in Medicine and Imaging at Stanford University.
Beyond that, diagnoses can reflect a doctor’s preconceptions and ideas — about, say, what might be behind a patient’s chronic pain — as much as they reflect the reality of what’s happening. “The dirty secrets of a lot of artificial intelligence tools is that a lot of the things that seem like biological variables that we’re predicting are in fact just someone’s opinion,” says Obermeyer. That means that rather than helping doctors make better decisions, these tools are often perpetuating the very inequalities they ought to help avoid.
(Credit: Kellie Jaeger)
When scientists train algorithms to operate a car, they know what’s out there on the road. There’s no debate about whether there’s a stop sign, school zone or pedestrian ahead. But in medicine, truth is often measured by what the doctor says, not what’s actually going on. A chest X-ray may be evidence of pneumonia because that’s what a doctor diagnosed and wrote in the health record, not because it’s necessarily the correct diagnosis. “Those proxies are often distorted by financial things and racial things and gender things, and all sorts of other things that are social in nature,” says Obermeyer.
In a 2019 study, Obermeyer and colleagues examined an algorithm developed by the health services company Optum. Hospitals use similar algorithms to predict which patients will need the most care, estimating the needs of over 200 million people annually. But there’s no simple variable for determining who is going to get the sickest. Instead of predicting concrete health needs, Optum’s algorithm predicted which patients were likely to cost more, the logic being that sicker people need more care and therefore will be more expensive to treat. For a variety of reasons including income, access to care, and poor treatment by doctors, Black people spend less on health care on average than their white counterparts. Therefore, the study authors found that using cost as a proxy measure for health led the algorithm to consistently underestimate the health needs of Black people.
Instead of reflecting reality, the algorithm was mimicking and further embedding racial biases in the health care system. “How do we get algorithms to do better than us?” asks Obermeyer. “And not just mirror our biases and our errors?”
Plus, determining the truth of a situation — whether a doctor made a mistake due to poor judgment, racism, or sexism, or whether a doctor just got lucky — isn’t always clear, says Rayid Ghani, a professor in the machine-learning department at Carnegie Mellon University. If a physician runs a test and discovers a patient has diabetes, did the physician do a good job? Yes, they diagnosed the disease. But perhaps they should have tested the patient earlier or treated their rising blood sugar months ago, before the diabetes developed.
If that same test was negative, the calculation gets even harder. Should the doctor have ordered that test in the first place, or was it a waste of resources? “You can only measure a late diagnosis if an early diagnosis didn’t happen,” says Ghani. Decisions about which tests get run (or which patients’ complaints are taken seriously) often end up reflecting the biases of the clinicians rather than the best medical treatment possible. But if medical records encode those biases as facts, then those prejudices will be replicated in the AI systems that learn from them, no matter how good the technology is.
“If the AI is using the same data to train itself, it’s going to have some of those inherent biases,” Ghani adds, “not because that’s what AI is but because that’s what humans are, unfortunately.”’
If wielded deliberately, however, this fault in AI could be a powerful tool, says Kadija Ferryman, an anthropologist at Johns Hopkins University who studies bias in medicine. She points to a 2020 study in which AI is used a resource to assess what the data shows: a kind of diagnostic for evaluating bias. If an algorithm is less accurate for women and people with public insurance, for example, that’s an indication that care isn’t being provided equitably. “Instead of the AI being the end, the AI is almost sort of the starting point to help us really understand the biases in clinical spaces,” she says.
In a 2021 study in Nature Medicine, researchers described an algorithm they developed to examine racial bias in diagnosing arthritic knee pain. Historically, Black and low-income patients have been significantly less likely to be recommended for surgery, even though they often report much higher levels of pain than white patients. Doctors would attribute this phenomenon to psychological factors like stress or social isolation, rather than to physiological causes. So instead of relying on radiologists’ diagnoses to predict the severity of a patients’ knee pain, researchers trained the AI with a data set that included knee X-rays and patient’s descriptions of their own discomfort.
Not only did the AI predict who felt pain more accurately than the doctors did, it also showed that Black patients’ pain wasn’t psychosomatic. Rather, the AI revealed the problem lay with what radiologists think diseased knees should look like. Because our understanding of arthritis is rooted in research conducted almost exclusively on a white population, doctors may not recognize features of diseased knees that are more prevalent in Black patients.
It’s much harder to design AI systems, like the knee pain algorithm, that can correct or check physicians’ biases, as opposed to simply mimicking them — and it will require much more oversight and testing than currently exists. But Obermeyer notes that, in some ways, fixing the bias in AI can happen much faster than fixing the biases in our systems — and in ourselves — that helped create these problems in the first place.
And building AIs that account for bias could be a promising step in addressing larger systemic issues. To change how a machine operates, after all, you just need a few keystrokes; changing how people think takes much more than that.
An early prototype of Watson, seen here in 2011, was originally the size of a master bedroom. (Credit: Clockready/Wikimedia Commons)
IBM’s Failed Revolution
In 2011, IBM’s Watson computer annihilated its human competitors on the trivia show Jeopardy!. Ken Jennings, the show’s all-time highest-earning player, lost by over $50,000. “I for one welcome our new computer overlords,” he wrote on his answer card during the final round.
But Watson’s reign was short-lived. One of the earliest — and most high-profile — attempts to use artificial intelligence in health care, Watson is now one of medical AI’s biggest failures. IBM spent billions building a vast repository of patient information, insurance claims and medical images. Watson Health could (allegedly) plunder this database to suggest new treatments, match patients to clinical trials and discover new drugs.
Despite Watson’s impressive database, and all of IBM’s bluster, doctors complained that it rarely made useful recommendations. The AI didn’t account for regional differences in patient populations, access to care or treatment protocols. For example, because its cancer data came exclusively from one hospital, Watson for Oncology simply reflected the preferences and biases of the physicians who practiced there.
In January 2022, IBM finally dismantled Watson, selling its most valuable data and analytics to the investment firm Francisco Partners. That downfall hasn’t dissuaded other data giants like Google and Amazon from hyping their own AIs, promising systems that can do everything from transcribe notes to predict kidney failure. For big tech companies experimenting with medical AI, the machine-powered doctor is still very much “in.”