For years, many artificial intelligence enthusiasts and researchers have promised that machine learning will change modern medicine. Thousands of algorithms have been developed to diagnose conditions like cancer, heart disease and psychiatric disorders. Now, algorithms are being trained to detect COVID-19 by recognizing patterns in CT scans and X-ray images of the lungs.
Many of these models aim to predict which patients will have the most severe outcomes and who will need a ventilator. The excitement is palpable; if these models are accurate, they could offer doctors a huge leg up in testing and treating patients with the coronavirus.
But the allure of AI-aided medicine for the treatment of real COVID-19 patients appears far off. A group of statisticians around the world are concerned about the quality of the vast majority of machine learning models and the harm they may cause if hospitals adopt them any time soon.
“[It] scares a lot of us because we know that models can be used to make medical decisions,” says Maarten van Smeden, a medical statistician at the University Medical Center Utrecht in the Netherlands. “If the model is bad, they can make the medical decision worse. So they can actually harm patients.”
Van Smeden is co-leading a project with a large team of international researchers to evaluate COVID-19 models using standardized criteria. The project is the first-ever living review at The BMJ, meaning their team of 40 reviewers (and growing) is actively updating their review as new models are released.
So far, their reviews of COVID-19 machine learning models aren’t good: They suffer from a serious lack of data and necessary expertise from a wide array of research fields. But the issues facing new COVID-19 algorithms aren’t new at all: AI models in medical research have been deeply flawed for years, and statisticians such as van Smeden have been trying to sound the alarm to turn the tide.
Before the COVID-19 pandemic, Frank Harrell, a biostatistician at Vanderbilt University, was traveling around the country to give talks to medical researchers about the widespread issues with current medical AI models. He often borrows a line from a famous economist to describe the problem: Medical researchers are using machine learning to “torture their data until it spits out a confession.”
And the numbers support Harrell’s claim, revealing that the vast majority of medical algorithms barely meet basic quality standards. In October 2019, a team of researchers led by Xiaoxuan Liu and Alastair Denniston at the University of Birmingham in England published the first systematic review aimed at answering the trendy yet elusive question: Can machines be as good, or even better, at diagnosing patients than human doctors? They concluded that the majority of machine learning algorithms are on par with human doctors when detecting diseases from medical imaging. Yet there was another more robust and shocking finding — of 20,530 total studies on disease-detecting algorithms published since 2012, fewer than 1 percent were methodologically rigorous enough to be included in their analysis.
The researchers believe the dismal quality of the vast majority of AI studies is directly related to the recent overhype of AI in medicine. Scientists increasingly want to add AI to their studies, and journals want to publish studies using AI more than ever before. “The quality of studies that are getting through to publication is not good compared to what we would expect if it didn’t have AI in the title,” Denniston says.
And the main quality issues with previous algorithms are showing up in the COVID-19 models, too. As the number of COVID-19 machine learning algorithms rapidly increase, they’re quickly becoming a microcosm of all the problems that already existed in the field.
Just like their predecessors, the flaws of the new COVID-19 models start with a lack of transparency. Statisticians are having a hard time simply trying to figure out what the researchers of a given COVID-19 AI study actually did, since the information often isn’t documented in their publications. “They’re so poorly reported that I do not fully understand what these models have as input, let alone what they give as an output,” van Smeden says. “It’s horrible.”
Because of the lack of documentation, van Smeden’s team is unsure where the data came from to build the model in the first place, making it difficult to assess whether the model is making accurate diagnoses or predictions about the severity the disease. That also makes it unclear whether the model will churn out accurate results when it’s applied to new patients.
Another common problem is that training machine learning algorithms requires massive amounts of data, but van Smeden says the models his team has reviewed use very little. He explains that complex models can have millions of variables, and this means datasets with thousands of patients are necessary to build an accurate model of diagnosis or disease progression. But van Smeden says current models don’t even come close to approaching this ballpark; most are only in the hundreds.
Those small datasets aren’t caused by a shortage of COVID-19 cases around the world, though. Instead, a lack of collaboration between researchers leads individual teams to rely on their own small datasets, van Smeden says. This also indicates that researchers across a variety of fields are not working together — creating a sizable roadblock in researchers’ ability to develop and fine-tune models that have a real shot at enhancing clinical care. As van Smeden notes, “You need the expertise not only of the modeler, but you need statisticians, epidemiologists [and] clinicians to work together to make something that is actually useful.” Finally, van Smeden points out that AI researchers need to balance quality with speed at all times — even during a pandemic. Speedy models that are bad models end up being time wasted, after all.
“We don’t want to be the statistical police,” he says. “We do want to find the good models. If there are good models, I think they might be of great help.”