How a largely untested AI algorithm crept into hundreds of hospitals

covid Jan 8, 2022

During the pandemic, the electronic health record giant Epic quickly rolled out an algorithm to help doctors decide which patients needed the most immediate care. Doctors believe it will change how they practice.

Last spring, physicians like us were confused. COVID-19 was just starting its deadly journey around the world, afflicting our patients with severe lung infections, strokes, skin rashes, debilitating fatigue, and numerous other acute and chronic symptoms. Armed with outdated clinical intuitions, we were left disoriented by a disease shrouded in ambiguity.

In the midst of the uncertainty, Epic, a private electronic health record giant and a key purveyor of American health data, accelerated the deployment of a clinical prediction tool called the Deterioration Index. Built with a type of artificial intelligence called machine learning and in use at some hospitals prior to the pandemic, the index is designed to help physicians decide when to move a patient into or out of intensive care, and is influenced by factors like breathing rate and blood potassium level. Epic had been tinkering with the index for years but expanded its use during the pandemic. At hundreds of hospitals, including those in which we both work, a Deterioration Index score is prominently displayed on the chart of every patient admitted to the hospital.

The Deterioration Index is poised to upend a key cultural practice in medicine: triage. Loosely speaking, triage is an act of determining how sick a patient is at any given moment to prioritize treatment and limited resources. In the past, physicians have performed this task by rapidly interpreting a patient’s vital signs, physical exam findings, test results, and other data points, using heuristics learned through years of on-the-job medical training.

Ostensibly, the core assumption of the Deterioration Index is that traditional triage can be augmented, or perhaps replaced entirely, by machine learning and big data. Indeed, a study of 392 COVID-19 patients admitted to Michigan Medicine that the index was moderately successful at discriminating between low-risk patients and those who were at high-risk of being transferred to an ICU, getting placed on a ventilator, or dying while admitted to the hospital. But last year’s hurried rollout of the Deterioration Index also sets a worrisome precedent, and it illustrates the potential for such decision-support tools to propagate biases in medicine and change the ways in which doctors think about their patients.

The use of algorithms to support clinical decision-making isn’t new. But historically, these tools have been put into use only after a rigorous peer review of the raw data and statistical analyses used to develop them. Epic’s Deterioration Index, on the other hand, remains proprietary despite its widespread deployment. Although physicians are provided with a list of the variables used to calculate the index and a rough estimate of each variable’s impact on the score, we aren’t allowed under the hood to evaluate the raw data and calculations.

Furthermore, the Deterioration Index was not independently validated or peer-reviewed before the tool was rapidly deployed to America’s largest healthcare systems. Even now, there have been, to our knowledge, only two peer-reviewed published studies of the index. The deployment of a largely untested proprietary algorithm into clinical practice—with minimal understanding of the potential unintended consequences for patients or clinicians—raises a host of issues.

It remains unclear, for instance, what biases may be encoded into the index. Medicine already has a fraught history with race and gender disparities and biases. Studies have shown that, among other injustices, physicians underestimate the pain of minority patients and are less likely to refer women to total knee replacement surgery when it is warranted. Some clinical scores, including calculations commonly used to assess kidney and lung function, have traditionally been adjusted based on a patient’s race—a practice that many in the medical community now oppose. Without direct access to the equations underlying Epic’s Deterioration Index, or further external inquiry, it is impossible to know whether the index incorporates such race-adjusted scores in its own algorithm, potentially propagating biases.

Introducing machine learning into the triage process could fundamentally alter the way we teach medicine. It has the potential to improve inpatient care by highlighting new links between clinical data and outcomes—links that might otherwise have gone unnoticed. But it could also over-sensitize young physicians to the specific tests and health factors that the algorithm deems important; it could compromise trainees’ ability to hone their own clinical intuition. In essence, physicians in training would be learning medicine on Epic’s terms.

Thankfully, there are safeguards that can be relatively painlessly put in place. In 2015, the international Equator Network created a 22-point Tripod checklist to guide the responsible development, validation, and improvement of clinical prediction tools like the Deterioration Index. For example, it asks tool developers to provide details on how risk groups were created, report performance measures with confidence intervals, and discuss limitations of validation studies. Private health data brokers like Epic should always be held to this standard.

Now that its Deterioration Index is already being used in clinical settings, Epic should immediately release for peer review the underlying equations and the anonymized data sets it used for its internal validation so that doctors and health services researchers can better understand any potential implications they may have for health equity. There need to be clear communication channels to raise, discuss, and resolve any issues that emerge in peer review, including concerns about the score’s validity, prognostic value, bias, or unintended consequences. Companies like Epic should also engage more deliberately and openly with the physicians who use their algorithms; they should share information about the populations on which the algorithms were trained, the questions the algorithms are best equipped to answer, and the flaws the algorithms may carry. Caveats and warnings should be communicated clearly and quickly to all clinicians who use the indices.

The COVID-19 pandemic, having accelerated the widespread deployment of clinical prediction tools like the Deterioration Index, may herald a new coexistence between physicians and machines in the art of medicine. Now is the time to set the ground rules to ensure that this partnership helps us change medicine for the better, and not the worse.


Dr. Vishal Khetpal is a resident physician training in the Brown University Internal Medicine Program.

Dr. Nishant R. Shah is an assistant professor of medicine at the Alpert Medical School of Brown University and an assistant professor of health services, practice, and policy at the Brown University School of Public Health.

By Vishal Khetpal and Nishant Shah

CONTINUED:

Amid a Pandemic, a Health Care Algorithm Shows Promise and Peril

A machine learning-based score designed to aid triage decisions is gaining in popularity — but lacking in transparency.

LAST SPRING, physicians like us were confused. Covid-19 was just starting its deadly journey around the world, afflicting our patients with severe lung infections, strokes, skin rashes, debilitating fatigue, and numerous other acute and chronic symptoms. Armed with outdated clinical intuitions, we were left disoriented by a disease shrouded in ambiguity.

In the midst of the uncertainty, Epic, a private electronic health record giant and a key purveyor of American health data, accelerated the deployment of a clinical prediction tool called the Deterioration Index. Built with a type of artificial intelligence called machine learning and in use at some hospitals prior to the pandemic, the index is designed to help physicians decide when to move a patient into or out of intensive care, and is influenced by factors like breathing rate and blood potassium level. Epic had been tinkering with the index for years but expanded its use during the pandemic. At hundreds of hospitals, including those in which we both work, a Deterioration Index score is prominently displayed on the chart of every patient admitted to the hospital.

Ongoing Coronavirus Coverage From Undark Magazine
A compilation of all of Undark’s coverage of SARS-CoV-2, the novel coronavirus responsible for the global Covid-19 pandemic.
For all of Undark’s coverage of the global Covid-19 pandemic, please visit our extensive coronavirus archive

The Deterioration Index is poised to upend a key cultural practice in medicine: triage. Loosely speaking, triage is an act of determining how sick a patient is at any given moment to prioritize treatment and limited resources. In the past, physicians have performed this task by rapidly interpreting a patient’s vital signs, physical exam findings, test results, and other data points, using heuristics learned through years of on-the-job medical training.

Ostensibly, the core assumption of the Deterioration Index is that traditional triage can be augmented, or perhaps replaced entirely, by machine learning and big data. Indeed, a study of 392 Covid-19 patients admitted to Michigan Medicine that the index was moderately successful at discriminating between low-risk patients and those who were at high-risk of being transferred to an ICU, getting placed on a ventilator, or dying while admitted to the hospital. But last year’s hurried rollout of the Deterioration Index also sets a worrisome precedent, and it illustrates the potential for such decision-support tools to propagate biases in medicine and change the ways in which doctors think about their patients.

The use of algorithms to support clinical decision making isn’t new. But historically, these tools have been put into use only after a rigorous peer review of the raw data and statistical analyses used to develop them. Epic’s Deterioration Index, on the other hand, remains proprietary despite its widespread deployment. Although physicians are provided with a list of the variables used to calculate the index and a rough estimate of each variable’s impact on the score, we aren’t allowed under the hood to evaluate the raw data and calculations.

Furthermore, the Deterioration Index was not independently validated or peer-reviewed before the tool was rapidly deployed to America’s largest health care systems. Even now, there have been, to our knowledge, only two peer-reviewed published studies of the index. The deployment of a largely untested proprietary algorithm into clinical practice — with minimal understanding of the potential unintended consequences for patients or clinicians — raises a host of issues.

It remains unclear, for instance, what biases may be encoded into the index. Medicine already has a fraught history with race and gender disparities and biases. Studies have shown that, among other injustices, physicians underestimate the pain of minority patients and are less likely to refer women to total knee replacement surgery when it is warranted. Some clinical scores, including calculations commonly used to assess kidney and lung function, have traditionally been adjusted based on a patient’s race — a practice that many in the medical community now oppose. Without direct access to the equations underlying Epic’s Deterioration Index, or further external inquiry, it is impossible to know whether the index incorporates such race-adjusted scores in its own algorithm, potentially propagating biases.

Introducing machine learning into the triage process could fundamentally alter the way we teach medicine. It has the potential to improve inpatient care by highlighting new links between clinical data and outcomes — links that might otherwise have gone unnoticed. But it could also over-sensitize young physicians to the specific tests and health factors that the algorithm deems important; it could compromise trainees’ ability to hone their own clinical intuition. In essence, physicians in training would be learning medicine on Epic’s terms.

Historically, these tools have been put into use only after a rigorous peer review of the raw data and statistical analyses used to develop them.

Thankfully, there are safeguards that can be relatively painlessly put in place. In 2015, the international Equator Network created a 22-point Tripod checklist to guide the responsible development, validation, and improvement of clinical prediction tools like the Deterioration Index. For example, it asks tool developers to provide details on how risk groups were created, report performance measures with confidence intervals, and discuss limitations of validation studies. Private health data brokers like Epic should always be held to this standard.

Now that its Deterioration Index is already being used in clinical settings, Epic should immediately release for peer review the underlying equations and the anonymized datasets it used for its internal validation, so that doctors and health services researchers can better understand any potential implications they may have for health equity. There need to be clear communication channels to raise, discuss, and resolve any issues that emerge in peer review, including concerns about the score’s validity, prognostic value, bias, or unintended consequences. Companies like Epic should also engage more deliberately and openly with the physicians who use their algorithms; they should share information about the populations on which the algorithms were trained, the questions the algorithms are best equipped to answer, and the flaws the algorithms may carry. Caveats and warnings should be communicated clearly and quickly to all clinicians who use the indices.

The Covid-19 pandemic, having accelerated the widespread deployment of clinical prediction tools like the Deterioration Index, may herald a new coexistence between physicians and machines in the art of medicine. Now is the time to set the ground rules to ensure that this partnership helps us change medicine for the better, and not the worse.


Vishal Khetpal, M.D., MSc is an internal medicine resident physician training in the Brown University Internal Medicine Program.

Nishant R. Shah, M.D., MPH is an assistant professor of medicine at the Alpert Medical School of Brown University and an assistant professor of health services, practice, and policy at the Brown University School of Public Health.

By Vishal Khetpal and Nishant Shah

CONTINUED:

How did a proprietary AI get into hundreds of hospitals - without extensive peer reviews?

The concerning story of Epic's Deterioration Index

How is it possible for proprietary AI models to enter patient care, without extensive peer reviews for algorithmic transparency? That's a question we should be asking about Epic's Deterioration Index, which has been utilized for several use cases, including COVID-19 patient risk models.

The conversation about machine learning development largely centers on how individual organizations proceed - and whether they use adequate data, methods, algorithms, transparency, and a process that guarantees models do not go into production until they are tested and vetted.

At the other end of the spectrum are AI models developed by Google and Facebook that are completely opaque, which is a different problem. But a third level between the two doesn't get the scrutiny it should - custom models developed by a third party, and given as an "enhancement" to existing licensees.

Epic Systems Corporation, or Epic, is a privately held healthcare software company. According to the company, hospitals that use its software held medical records of 54% of patients in the United States and 2.5% of patients worldwide in 2015. In terms of market share, as per Wikipedia, Epic holds 31% of the US EHR  (Electronic Health Records) share for acute care hospitals, ahead of all others, including Cerner at 25% (as of May 2020). More than 250 million patients' have electronic records in Epic.

Epic celebrated its 40th anniversary in March 2019. It had around 10,000 employees globally and generated about $2.9 billion in annual revenue. The company reports 40 percent of operating expenses are invested in research and development.

Founder and CEO of Epic Judy Faulkner unveiled a new data research initiative and software during the Epic User Group annual meeting on Aug. 27 2020. She highlighted the Cosmos program, which is designed to mine data from millions of patient medical records to improve research into treatments. The program gathers de-identified patient data from 8 million patients at nine health systems, and 31 more organizations have signed on to participate. The company also announced new products focused on letting physicians write shorter notes and voice recognition software.

What has happened with Cosmos since then is something of a mystery. We have been unable find any 2021 mentions of Cosmos, other than this: "Epic will also feature updates at the conference from its Epic Health Research Network, which publishes insights from its HIPAA-limited Cosmos data set from more than 113 million patients." The growth from 8million to 113 million is mentioned with not explanation.

Epic embarked on a program of AI prediction models as early as 2014. One of the first was called the Epic Sepsis Model. However, its usefulness is debatable. The predictive value of the index it produces is most significant for those who are hardly sick, and those who are deathly ill. That means, "This person isn't sick, or this person is circling the drain." It begs the question, "How useful is this?" Surely any clinician can apply the valuable heuristics of triage for the next steps of those not ill, and that are not likely to survive. The vast majority of decisions would seem to occur for all of those in the middle. This is called the Epic Deterioration Index. Via Fast Company:

Loosely speaking, triage is an act of determining how sick a patient is at any given moment to prioritize treatment and limited resources. But historically, these tools have been used only after a rigorous peer review of the raw data and statistical analyses used to develop them. Epic's Deterioration Index, on the other hand, remains proprietary despite its widespread deployment.  Without direct access to the equations underlying Epic's Deterioration Index, or further external inquiry.

More problems were documented via a Michigan-based study: Epic's widely used sepsis prediction model falls short among Michigan Medicine patients. The tool is included as part of Epic's electronic health record platform. According to the company, it calculates and indicates "the probability of a likelihood of sepsis" to help clinicians identify hard-to-spot cases. While some providers have reported success with the tool, as noted, researchers affiliated with the University of Michigan Medical School in Ann Arbor found its output to be "substantially worse" than what was reported by the vendor when applied to a large retrospective sample of more than 27,000 adult Michigan Medicine patients. The researchers highlighted the wider issues surrounding such proprietary models. They wrote in JAMA Internal Medicine:

Our study has important national implications... The increase and growth in deployment of proprietary models have led to an underbelly of confidential, non- peer-reviewed model performance documents that may not accurately reflect real-world model performance.

When AI goes sideways in an e-commerce context, and the online retailer sends you two left-foot shoes, it isn't the end of the world. After all, technology is never perfect. But when the biggest EHR (Electronic Health Records) provider, Epic, provides Machine Learning algorithms to predict care in a clinical setting, errors are a lot less tolerable, especially when the algorithms are not disclosed, nor is the model's development data explained.

I hate to come back to that old canard about ethics, but when a doctor using the tool admits that "Nobody has amassed the numbers to do a statistically valid" test, it's more than troubling. As STAT put it in AI used to predict COVID-19 patients' decline before proven to work:

'Nobody has amassed the numbers to do a statistically valid' test of the AI, said Mark Pierce, a physician and chief medical informatics officer at Parkview Health, a nine-hospital health system in Indiana and Ohio that is using Epic's tool. 'But in times like this that are unprecedented in U.S. health care, you do the best you can with the numbers you have and err on the side of patient care.'

It's even more troubling when you take into account that Epic PAYS hospitals as much as $1million to use the tool: I have not been able to determine the business case for this (The Verge: Health record company pays hospitals that use its algorithms).

A detailed study of the efficacy of Epic's Deterioration Index (EDI) for identifying at-risk COVID-19 patience concluded:

We found that EDI identifies small subsets of high- and low-risk patients with COVID-19 with sound discrimination. However, its clinical use as an early warning system is limited by low sensitivity. Studies of Epic's Deterioration Index for COVID-19 have been primarily negative. These findings highlight the importance of independent evaluation of proprietary models before widespread operational use among patients with COVID-1.

The concerns surrounding this practice are its opacity. It is a proprietary system. What data, and what data preparation methods were applied, what algorithms, etc. are not known. But for many, the glaring ethical problem is: why Epic pays clients to use it?

At HIMSS21 Digital, John Halamka, the president of Mayo Clinic Platform on the four significant challenges to AI adoption in healthcare, said, "augmentation of human decision making is going to be greatly beneficial" - but some hurdles need to be overcome first.

However, one key issue that must be solved first is ensuring equity and combating bias that can be "baked in" to AI "The AI algorithms are only as good as the underlying data. And yet, we don't publish statistics describing how these algorithms are developed. The solution, he said, is greater transparency - spelling out and sharing via technology the ethnicity, race, gender, education, income and other details that go into an algorithm.

Halamka points to what he calls the four grand challenges to AI adoption in healthcare:

  1. Gathering valuable novel data - such as GPS information from phones and other devices people carry as well as wearable technology - and incorporating it into algorithms.
  2. Creating discovery at an institutional level so that everyone - including those without AI experience - feels empowered and engaged in algorithm development.
  3. Validating an algorithm to ensure, across organizations and geographies, that it's fit for purpose as well as labeled appropriately as a product and for being described in academic literature.
  4. Workflow and delivery - getting information and advice to physicians instantly while they're in front of patients.

My take

Lots of opinions, but here is my take:

  1. We have no idea how the Epic Deterioration Index model was built. For instance, in the US, healthcare data is riddled with biased information of observations .
  2. No mention of the methodology of Epic's Deterioration Index is published. No description of de-biasing, if any, is given. Biases in medicine, inside and outside of AI systems, are prevalent. As per The Washington Post, it is well-documented that one-half of white medical school students and residents still believe that African Americans have a higher threshold of pain (and you don't have to be a genius to figure out the origin of that) as well as biases about gender, race, age etc. These biases leak into medical records, and no program should be taken at face-value, unless it can disclose this is rectified.
  3. EHR data is full of doctors' other narratives that are not edited.
  4. There is no viable peer-reviewed assessment of the Deterioration Index model that I know of. We have no idea if the training data represents a fair cross-section of the population.
  5. In addition, for something as complicated as assessing COVID-19 patients in real-time, Epic seems to have rushed this to market, delivering in just six months from the "first wave." One has to presume that the knowledge of the etiology of COVID-19 was only beginning to be understood when this was rolled out.

The fact that Epic pays hospitals to adopt it also needs a detailed explanation. ("Verona, Wis.-based EHR giant Epic gives financial incentives to hospitals and health systems that use its artificial intelligence algorithms, which can provide false predictions," according to a July 26 STAT News investigation).

As per The Verge, Epic provided the following explanation for these financial incentives:

'Epic's Honor Roll is a voluntary program which encourages the adoption of features that help save lives, increase data exchange, and improve satisfaction for patients, physicians, and health systems,' Epic said in a statement to Stat News.

I must have missed the part where they justify paying hospitals to use their model. The byzantine healthcare system in the US doesn't always seem logical, what we know for sure is that incentives drive the system.

Whatever the explanation, Epic's Deterioration Index is now widely used. Fast Company already raised a similar question: How did a largely untested AI creep into hundreds of hospitals?. As the authors wrote:

Even now, there have been, to our knowledge, only two peer-reviewed published studies of the index. The deployment of a largely untested proprietary algorithm into clinical practice-with minimal understanding of the potential unintended consequences for patients or clinicians-raises a host of issues.

It remains unclear, for instance, what biases may be encoded into the index. Medicine already has a fraught history with race and gender disparities and biases... Some clinical scores, including calculations commonly used to assess kidney and lung function, have traditionally been adjusted based on a patient's race-a practice many in the medical community now oppose. Without direct access to the equations underlying Epic's Deterioration Index, or further external inquiry, it is impossible to know whether the index incorporates such race-adjusted scores in its algorithm, potentially propagating biases.

What I find distressing is that Epic would develop a model, precise or not, that reduces a human being's course of treatment for a potentially deadly disease with a simple index of 1-10. I understand that clinicians may find this a helpful tool, but I'm an advocate for the patients. They didn't ask for this disease; clinicians chose this profession. They shouldn't be looking for shortcuts or cookie-cutter treatment.

By Neil Raden

CONTINUED:

Hundreds of AI tools have been built to catch covid. None of them helped.

Some have been used in hospitals, despite not being properly tested. But the pandemic could help make medical AI better.

When covid-19 struck Europe in March 2020, hospitals were plunged into a health crisis that was still badly understood. “Doctors really didn’t have a clue how to manage these patients,” says Laure Wynants, an epidemiologist at Maastricht University in the Netherlands, who studies predictive tools.

But there was data coming out of China, which had a four-month head start in the race to beat the pandemic. If machine-learning algorithms could be trained on that data to help doctors understand what they were seeing and make decisions, it just might save lives. “I thought, ‘If there’s any time that AI could prove its usefulness, it’s now,’” says Wynants. “I had my hopes up.”

It never happened—but not for lack of effort. Research teams around the world stepped up to help. The AI community, in particular, rushed to develop software that many believed would allow hospitals to diagnose or triage patients faster, bringing much-needed support to the front lines—in theory.

In the end, many hundreds of predictive tools were developed. None of them made a real difference, and some were potentially harmful.

That’s the damning conclusion of multiple studies published in the last few months. In June, the Turing Institute, the UK’s national center for data science and AI, put out a report summing up discussions at a series of workshops it held in late 2020. The clear consensus was that AI tools had made little, if any, impact in the fight against covid.

Not fit for clinical use

This echoes the results of two major studies that assessed hundreds of predictive tools developed last year. Wynants is lead author of one of them, a review in the British Medical Journal that is still being updated as new tools are released and existing ones tested. She and her colleagues have looked at 232 algorithms for diagnosing patients or predicting how sick those with the disease might get. They found that none of them were fit for clinical use. Just two have been singled out as being promising enough for future testing.

“It’s shocking,” says Wynants. “I went into it with some worries, but this exceeded my fears.”

Wynants’s study is backed up by another large review carried out by Derek Driggs, a machine-learning researcher at the University of Cambridge, and his colleagues, and published in Nature Machine Intelligence. This team zoomed in on deep-learning models for diagnosing covid and predicting patient risk from medical images, such as chest x-rays and chest computer tomography (CT) scans. They looked at 415 published tools and, like Wynants and her colleagues, concluded that none were fit for clinical use.

“This pandemic was a big test for AI and medicine,” says Driggs, who is himself working on a machine-learning tool to help doctors during the pandemic. “It would have gone a long way to getting the public on our side,” he says. “But I don’t think we passed that test.”

Both teams found that researchers repeated the same basic errors in the way they trained or tested their tools. Incorrect assumptions about the data often meant that the trained models did not work as claimed.

Wynants and Driggs still believe AI has the potential to help. But they are concerned that it could be harmful if built in the wrong way because they could miss diagnoses or underestimate risk for vulnerable patients. “There is a lot of hype about machine-learning models and what they can do today,” says Driggs.

Unrealistic expectations encourage the use of these tools before they are ready. Wynants and Driggs both say that a few of the algorithms they looked at have already been used in hospitals, and some are being marketed by private developers. “I fear that they may have harmed patients,” says Wynants.

So what went wrong? And how do we bridge that gap? If there’s an upside, it is that the pandemic has made it clear to many researchers that the way AI tools are built needs to change. “The pandemic has put problems in the spotlight that we’ve been dragging along for some time,” says Wynants.

What went wrong

Many of the problems that were uncovered are linked to the poor quality of the data that researchers used to develop their tools. Information about covid patients, including medical scans, was collected and shared in the middle of a global pandemic, often by the doctors struggling to treat those patients. Researchers wanted to help quickly, and these were the only public data sets available. But this meant that many tools were built using mislabeled data or data from unknown sources.

Driggs highlights the problem of what he calls Frankenstein data sets, which are spliced together from multiple sources and can contain duplicates. This means that some tools end up being tested on the same data they were trained on, making them appear more accurate than they are.

It also muddies the origin of certain data sets. This can mean that researchers miss important features that skew the training of their models. Many unwittingly used a data set that contained chest scans of children who did not have covid as their examples of what non-covid cases looked like. But as a result, the AIs learned to identify kids, not covid.

Driggs’s group trained its own model using a data set that contained a mix of scans taken when patients were lying down and standing up. Because patients scanned while lying down were more likely to be seriously ill, the AI learned wrongly to predict serious covid risk from a person’s position.

In yet other cases, some AIs were found to be picking up on the text font that certain hospitals used to label the scans. As a result, fonts from hospitals with more serious caseloads became predictors of covid risk.

Errors like these seem obvious in hindsight. They can also be fixed by adjusting the models, if researchers are aware of them. It is possible to acknowledge the shortcomings and release a less accurate, but less misleading model. But many tools were developed either by AI researchers who lacked the medical expertise to spot flaws in the data or by medical researchers who lacked the mathematical skills to compensate for those flaws.

A more subtle problem Driggs highlights is incorporation bias, or bias introduced at the point a data set is labeled. For example, many medical scans were labeled according to whether the radiologists who created them said they showed covid. But that embeds, or incorporates, any biases of that particular doctor into the ground truth of a data set. It would be much better to label a medical scan with the result of a PCR test rather than one doctor’s opinion, says Driggs. But there isn’t always time for statistical niceties in busy hospitals.

That hasn’t stopped some of these tools from being rushed into clinical practice. Wynants says it isn’t clear which ones are being used or how. Hospitals will sometimes say that they are using a tool only for research purposes, which makes it hard to assess how much doctors are relying on them. “There’s a lot of secrecy,” she says.

Wynants asked one company that was marketing deep-learning algorithms to share information about its approach but did not hear back. She later found several published models from researchers tied to this company, all of them with a high risk of bias. “We don’t actually know what the company implemented,” she says.

According to Wynants, some hospitals are even signing nondisclosure agreements with medical AI vendors. When she asked doctors what algorithms or software they were using, they sometimes told her they weren’t allowed to say.

How to fix it

What’s the fix? Better data would help, but in times of crisis that’s a big ask. It’s more important to make the most of the data sets we have. The simplest move would be for AI teams to collaborate more with clinicians, says Driggs. Researchers also need to share their models and disclose how they were trained so that others can test them and build on them. “Those are two things we could do today,” he says. “And they would solve maybe 50% of the issues that we identified.”

Getting hold of data would also be easier if formats were standardized, says Bilal Mateen, a doctor who leads the clinical technology team at the Wellcome Trust, a global health research charity based in London.

Another problem Wynants, Driggs, and Mateen all identify is that most researchers rushed to develop their own models, rather than working together or improving existing ones. The result was that the collective effort of researchers around the world produced hundreds of mediocre tools, rather than a handful of properly trained and tested ones.

“The models are so similar—they almost all use the same techniques with minor tweaks, the same inputs—and they all make the same mistakes,” says Wynants. “If all these people making new models instead tested models that were already available, maybe we’d have something that could really help in the clinic by now.”

In a sense, this is an old problem with research. Academic researchers have few career incentives to share work or validate existing results. There’s no reward for pushing through the last mile that takes tech from “lab bench to bedside,” says Mateen.

To address this issue, the World Health Organization is considering an emergency data-sharing contract that would kick in during international health crises. It would let researchers move data across borders more easily, says Mateen. Before the G7 summit in the UK in June, leading scientific groups from participating nations also called for “data readiness” in preparation for future health emergencies.

Such initiatives sound a little vague, and calls for change always have a whiff of wishful thinking about them. But Mateen has what he calls a “naïvely optimistic” view. Before the pandemic, momentum for such initiatives had stalled. “It felt like it was too high of a mountain to hike and the view wasn’t worth it,” he says. “Covid has put a lot of this back on the agenda.”

“Until we buy into the idea that we need to sort out the unsexy problems before the sexy ones, we’re doomed to repeat the same mistakes,” says Mateen. “It’s unacceptable if it doesn’t happen. To forget the lessons of this pandemic is disrespectful to those who passed away.”

By Will Douglas Heaven

CONTINUED:

Artificial intelligence for COVID-19: saviour or saboteur?

As 2020 draws to a close, one thing is certain: the COVID-19 pandemic has had an irreversible effect on the world. The effect on digital health is no exception.

The pandemic has forced health-care providers and governments around the world to accelerate the development of artificial intelligence (AI) tools and scale up their use in medicine, even before they are proven to work. An untested AI algorithm has even received emergency authorisation from the US Food and Drug Administration. But will the use of untested AI systems help or hinder patients with COVID-19?

The lax regulatory landscape for COVID-19 AI algorithms has raised substantial concern among medical researchers. A living systematic review published in the BMJ highlights that COVID-19 AI models are poorly reported and trained on small or low quality datasets with high risk of bias. Gary Collins, Professor of Medical Statistics at the University of Oxford and co-author of the BMJ review told The Lancet Digital Health, “full and transparent reporting of all key details of the development and evaluation of prediction models for COVID-19 is vital. Failure to report important details not only contributes to research waste, but more importantly can lead to a poorly developed and evaluated model being used that could cause more harm than benefit in clinical decision making.”

To support transparent and reproducible reporting, source code and deidentified patient datasets for COVID-19 AI algorithms should be open and accessible to the research community. One such study, published in The Lancet Digital Health, reports a new AI COVID-19 screening test, named CURIAL AI, which uses routinely collected clinical data for patients presenting to hospital. In the hope that AI can help keep patients and health-care workers safe, Andrew Soltan and colleagues state that the AI test could allow exclusion of patients who do not have COVID-19 and ensure that patients with COVID-19 receive treatments rapidly. This is one of the largest AI studies to date with clinical data from more than a hundred thousand cases in the UK. Prospective validation of the AI screening test showed accurate and faster results compared with gold standard PCR tests.However, like other COVID-19 AI models, CURIAL AI requires validation across geographically and ethnically diverse populations to assess its real-world performance. Soltan emphasised that “We also do not yet know if the AI model would generalise to patient cohorts in different countries, where patients may come to hospital with a different spectrum of medical problems.”Even if preliminary models, like CURIAL AI, are proven to accurately diagnose disease in a wide range of populations, do they add clinical value to health-care systems? Last month, X, the Alphabet subsidiary announced that although they were able to develop an AI to identify features of electroencephalography data that might be useful for diagnosing depression and anxiety they found that experts were not convinced of the clinical value of the diagnostic aid. How AI tools for diagnosing health conditions can improve medical care is not always well understood by those developing the AI. Therefore, COVID-19 AI models must be developed in close collaboration with health-care workers, to understand how output of these models could be applied in patient care.As we enter flu season, AI tools, like CURIAL AI, face an increasingly challenging task to help clinicians differentiate between two respiratory infections with similar symptoms. If AI tools cannot be proven to discern one pneumonia from another, premature use of these technologies could increase misdiagnosis and sabotage clinical care for patients. Mistakes like this, if allowed to scale, will slow future use of potentially life-saving technologies and compromise clinician and patient trust in AI. To assess true accuracy of AI tools for COVID-19, clinical trials are essential to establish how AI can support COVID-19 patients in the real world.Soltan and colleagues are now planning clinical trials for deploying CURIAL AI within the existing clinical pathways at hospitals in the UK. The Lancet Digital Health strongly encourages researchers doing AI intervention-based clinical trials to follow the new extension guidelines SPIRIT-AI and CONSORT-AI. In our previous Editorial, we described the importance of these guidelines to support accurate and transparent evaluation of AI.AI could be the saviour of the COVID-19 pandemic in the coming year; we just need to prove it.

By The Lancet Digital Health

CONTINUED:

Epic's AI Fail; Controversial Alzheimer's Marketing; No Good COVID Drugs

This past week in healthcare investigations

Several of Epic's AI Algorithms Deliver Inaccurate Info to Hospitals

Predictive tools developed by electronic health record giant Epic Systems are meant to help providers deliver better patient care. However, several of the company's artificial intelligence (AI) algorithms are delivering inaccurate information to hospitals when it comes to seriously ill patients, a STAT investigation revealed.

Employees at several major health systems told STAT that they were "particularly concerned" about Epic's algorithm for predicting sepsis. The algorithm "routinely fails to identify the condition in advance, and triggers frequent false alarms," the outlet reported.

Epic's algorithm for predicting sepsis is one of about 20 developed by the company, STAT reported. Experts said in published research and interviews with STAT that they've also seen issues with a number of other algorithms, including those for predicting patients' length of stay and chances of becoming seriously ill.

"Taken together, their findings paint the picture of a company whose business goals -- and desire to preserve its market dominance -- are clashing with the need for careful, independent review of algorithms before they are used in the care of millions of patients," STAT wrote.

Further, the STAT investigation found that Epic has paid some health systems as much as $1 million in part due to their adoption of predictive algorithms developed by the company or others. "Those payments may create a conflict between duties to deliver the best care to patients and preserve their bottom lines," STAT wrote.

Epic, however, defended its testing and distribution of AI products. The company told STAT its incentives are meant to reward the implementation of technologies that can improve care. Epic further said that some of the published studies that are critical of its sepsis algorithm are not due to underlying problems with its performance, rather differences in the way institutions define the onset of the life-threatening condition.

Biogen's Controversial Alzheimer's Marketing Push

As Biogen and Eisai face criticism around their new Alzheimer's drug aducanumab (Aduhelm), the partners have launched a controversial marketing campaign that targets consumers already worried about slips in memory, Kaiser Health News reported.

Questions on an online "symptoms quiz" include how often someone loses their train of thought or feels more anxious than is typical, KHN reported. They also include questions about how often someone struggles to come up with a word, asks the same questions, or gets lost.

"No matter the answers, however, it directs quiz takers to talk with their doctors about their concerns and whether additional testing is needed," KHN wrote.

While some of the concerns can be valid, "this clearly does overly medicalize very common events that most adults experience in the course of daily life: Who hasn't lost one's train of thought or the thread of a conversation, book, or movie? Who hasn't had trouble finding the right word for something?" Jerry Avorn, MD, of Harvard Medical School, told KHN.

The marketing campaign has been called misleading by some, KHN reported. And it comes as critics of the drug point to a lack of definitive evidence that it slows the progression of Alzheimer's disease, at a staggering cost of $56,000 per year.

Still No Good Drugs for COVID-19

Well over a year into the pandemic -- and with vaccines taking center stage -- effective and easy-to-use drugs to treat COVID-19 remain elusive, the Wall Street Journal reported.

In the U.S., 10 drugs have been cleared or recommended for use, WSJ reported. However, two of those drugs had their authorizations rescinded, and a third had its shipments paused due to ineffectiveness against new variants.

"We're really limited, to be honest," Daniel Griffin, MD, chief of infectious disease at healthcare provider network ProHealth New York, told WSJ. "We do not have any dramatic treatments."

Factors that have contributed to the dearth of drugs to treat COVID have included the federal government's focus on quickly developing vaccines as well as a lack of drug research on coronaviruses, even though there have been previous outbreaks, WSJ reported.

Additionally, "Scattered U.S. clinical trials competed against each other for patients," according to WSJ wrote. And, "When effective yet hard-to-administer drugs were developed, a fragmented American healthcare system struggled to deliver them to patients."

However, the need for drugs to treat COVID remains. That's especially true as vaccination rates in many parts of the U.S. -- and in other parts of the world -- remain low. The continued rise of the highly transmissible Delta variant is also contributing to the need.

By Jennifer Henderson

CONTINUED:

Health systems are using AI to predict severe Covid-19 cases

But limited data could produce unreliable results

As the United States braces for a bleak winter, hospital systems across the country are ramping up their efforts to develop AI systems to predict how likely their Covid-19 patients are to fall severely ill or even die. Yet most of the efforts are being developed in silos and trained on limited datasets, raising crucial questions about their reliability.

A medical staff member treats a Covid-19 patient at the United Memorial Medical Center in Houston. GO NAKAMURA/GETTY IMAGES

Dozens of institutions and companies — including Stanford, Mount Sinai, and the electronic health records vendors Epic and Cerner — have been working since the spring on models that are essentially designed to do the same thing: crunch large amounts of patient data and turn out a risk score for a patient’s chances of dying or needing a ventilator.

In the months since launching those efforts, though, transparency about the tools, including the data they’re trained on and their impact on patient care, has been mixed. Some institutions have not published any results showing whether their models work. And among those that have published findings, the research has raised concerns about the generalizability of a given model, especially one that is tested and trained only on local data.

A study published this month in Nature Machine Intelligence revealed that a Covid-19 deterioration model successfully deployed in Wuhan, China, yielded results that were no better than a roll of the dice when applied to a sample of patients in New York.

Several of the datasets also fail to include diverse sets of patients, putting some of the models at high risk of contributing to biased and unequal care for Covid-19, which has already taken a disproportionate toll on Black and Indigenous communities and other communities of color. That risk is clear in an ongoing review published in the BMJ: After analyzing dozens of Covid-19 prediction models designed around the world, the authors concluded that all of them were highly susceptible to bias.

“I don’t want to call it racism, but there are systemic inequalities that are built in,” said Benjamin Glicksberg, head of Mount Sinai’s center for Covid-19 bioinformatics and an assistant professor of genetics at the Icahn School of Medicine. Gilcksberg is helping to develop a Covid-19 prediction tool for the health system.

Those shortcomings raise an important question: Do the divided efforts come at the cost of a more comprehensive, accurate model — one that is built with contributions from all of the research groups currently working in isolation on their own algorithms?

There are obstacles, of course, to such a unified approach: It would require spending precious time and money coordinating approaches, as well as coming up with a plan to merge patient data that may be stored, protected, and codified in different ways. Moreover, while the current system isn’t perfect, it could still produce helpful local tools that could later be supplemented with additional research and data, several experts said.

“Sure, maybe if everyone worked together we’d come up with a single best one, but if everyone works on it individually, perhaps we’ll see the best one win,” said Peter Winkelstein, executive director of the Jacobs Institute for Healthcare Informatics at the University at Buffalo and the vice president and chief medical informatics officer of Kaleida Health. Winkelstein is collaborating with Cerner to develop a Covid-19 prediction algorithm on behalf of his health system.

But determining the best algorithm will mean publishing data that includes the models’ performance and impact on care, and so far, that isn’t happening in any uniform fashion.

Many of these efforts were first launched in the spring, as the first surge of coronavirus cases began to overwhelm hospitals, sending clinicians and developers scrambling for solutions to help predict which patients could become the sickest and which were teetering on the edge of death. Almost simultaneously, similar efforts sprang up across dozens of medical institutions to analyze patient data for this purpose.

Yet the institutions’ process of verifying those tools could not be more varied: While some health systems have started publishing research on preprint servers or in peer-reviewed journals as they continue to hone and shape their tools, others have declined to publish while they test and train the models internally, and still others have deployed their tools without first sharing any research.

Take Epic, for instance: The EHR vendor took a tool it had been using to predict critical outcomes in non-Covid patients and repurposed it for use on those with Covid-19 without first sharing any public research on whether or how well the model worked for this purpose. James Hickman, a software developer on Epic’s data science team, told STAT in a statement that the model was initially trained on a large dataset and validated by more than 50 health systems. “To date, we have tested the [tool’s] performance on a combined total of over 29,000 hospital admissions of Covid-19 patients across 29 healthcare organizations,” Hickman said. None of the data has been shared publicly.

Epic offered the model to clinics already using its system, including Stanford University’s health system. But Stanford instead decided to try creating its own algorithm, which it is now testing head-to-head with Epic’s.

“Covid patients do not act like your typical patient,” said Tina Hernandez-Broussard, associate professor of medicine and director of faculty development in biomedical informatics at Stanford. “Because the clinical manifestation is so different, we were interested to see: Can you even use that Epic tool, and how well does it work?”

Other systems are now trying to answer that same question about their own models. At the University at Buffalo, where Winkelstein works, he and his colleagues are collaborating with Cerner to create a deterioration model by testing it in “silent mode” on all admitted patients who are suspected or confirmed to have Covid-19. This means that while the tool and its results will be seen by health care workers, its outcomes won’t be used to make any clinical decisions. They have not yet shared any public-facing studies showing how well the tool works.

“Given Covid-19, where we need to know as much as we can as quickly as possible, we’re jumping in and using what we’ve got,” Winkelstein said.

The biggest challenge with trying to take a unified approach to these tools “is data sharing and interoperability,” said Andrew Roberts, director of data science at Cerner Intelligence who leads the team that is collaborating with Winkelstein. “I don’t think the industry is quite there yet.”

Further south in the heart of New York City, Glicksberg is leading efforts to publish research on Mount Sinai’s prediction model. In November, he and his colleagues published positive but preliminary results in the Journal of Medical Internet Research. That study suggested its tool could pinpoint at-risk patients and identify characteristics linked to that risk, such as age and high blood sugar. Unlike many of the other existing tools, the Mount Sinai algorithm was trained on a diverse pool of patient data drawn from hospitals including those in Brooklyn, Queens, and Manhattan’s Upper East Side.

The idea was to ensure the model works “outside of this little itty bitty hospital you have,” he said.

So far, two Covid-19 prediction models have received clearance from the Food and Drug Administration for use during the pandemic. But some of the models currently being used in clinics haven’t been cleared and don’t need to be greenlit, because they are not technically considered human research and they still require a health care worker to interpret the results.

“I think there’s a dirty little secret which is if you’re using a local model for decision support, you don’t have to go through any regulatory clearance or peer-reviewed research at all,” said Andrew Beam, assistant professor of epidemiology at the Harvard T.H. Chan School of Public Health.

Some of the tools, however, may fall under FDA’s purview. If they are, for example, intended to allow health care workers to see how they reached their conclusions — that is, if they are not “black box” tools — they could meet the FDA’s requirements for regulation as devices.

“Depending on the type of information provided to the health care provider and the type of information analyzed, decision support tools that are intended as Covid-19 deterioration models likely … meet the device definition and are subject to FDA regulatory oversight,” a spokesperson from the agency told STAT in an email.

Some of the institutions developing models that appear to meet the definition of regulatory oversight, though, have not submitted an application for FDA clearance. All of the models that have landed FDA clearance thus far have been developed not by academic institutions, but by startups. And like with academic medical centers, these companies have taken divergent approaches to publishing research on their products.

In September, Bay Area-based clinical AI system developer Dascena published results from a study testing its model on a small sample of 197 patients across five health systems. The study suggested the tool could accurately pinpoint 16% more at-risk patients than a widely used scoring system. The following month, Dascena received conditional, pandemic-era approval from the FDA for the tool.

In June, another startup — predictive analytics company CLEW Medical, based in Israel — received the same FDA clearance for a Covid-19 deterioration tool it said it had trained on retrospective data from 76,000 ICU patients over 10 years. None of the patients had Covid-19, however, so the company is currently testing it on 500 patients with the virus at two U.S. health systems.

Beam, the Harvard researcher, said he was especially skeptical about these models, since they tend to have far more limited access to patient data compared with academic medical centers.

“I think, as a patient, if you were just dropping me into any health system that was using one of these tools, I’d be nervous,” Beam said.

By Erin Brodwin

CONTINUED:

AN EPIC FAILURE

Overstated AI Claims in Medicine

Epic Systems, America’s largest electronic health records company, maintains medical information for 180 million U.S. patients (56% of the population). Using the slogan, “with the patient at the heart,” it has a portfolio of 20 proprietary artificial intelligence (AI) algorithms designed to identify different illnesses and predict the length of hospital stays.

As with many proprietary algorithms in medicine and elsewhere, users have no way of knowing whether Epic’s programs are reliable or just another marketing ploy. The details inside the black boxes are secret and independent tests are scarce.

One of the most important Epic algorithms is for predicting sepsis, the leading cause of death in hospitals. Sepsis occurs when the human body overreacts to an infection and sends chemicals into the bloodstream that can cause tissue damage and organ failure. Early detection can be life-saving, but sepsis is hard to detect early on.

Epic claims that the predictions made by its Epic Sepsis Model (ESM) are 76 percent to 83 percent accurate, but there have been no credible independent tests of any of its algorithms — until now. In a just published article in JAMA Internal Medicine, a team examined the hospital records of 38,455 patients at Michigan Medicine (the University of Michigan health system), of whom 2,552 (6.6 percent) experienced sepsis. The results are in the table. “Epic +” means that ESM generated sepsis alerts; “Epic –” means it did not.

Epic +Epic –Total
Sepsis8431,7092,552
No Sepsis6,12829,77535,903
Total6,97131,48438,455

There are two big takeaways:

a.   Of the 2,552 patients with sepsis, ESM only generated sepsis alerts for 843 (33 percent). They missed 67 percent of the people with sepsis.
b.   Of the 6,971 ESM sepsis alerts, only 843 (12 percent) were correct; 88 percent of the ESM sepsis alerts were false alarms, creating what the authors called “a large burden of alert fatigue.”

Reiterating, ESM failed to identify 67 percent of the patients with sepsis; of those patients with ESM sepsis alerts, 88 percent did not have sepsis.

A recent investigation by STAT, a health-oriented news site affiliated with the Boston Globe, came to a similar conclusion. Its article, titled “Epic’s AI algorithms, shielded from scrutiny by a corporate firewall, are delivering inaccurate information on seriously ill patients,” pulled few punches:

Several artificial intelligence algorithms developed by Epic Systems, the nation’s largest electronic health record vendor, are delivering inaccurate or irrelevant information to hospitals about the care of seriously ill patients, contrasting sharply with the company’s published claims.

[The findings] paint the picture of a company whose business goals — and desire to preserve its market dominance — are clashing with the need for careful, independent review of algorithms before they are used in the care of millions of patients.CASEY ROSS, “EPIC’S AI ALGORITHMS, SHIELDED FROM SCRUTINY BY A CORPORATE FIREWALL, ARE DELIVERING INACCURATE INFORMATION ON SERIOUSLY ILL PATIENTS,” AT STAT NEWS

Why have hundreds of hospitals adopted ESM? Part of the explanation is surely that many people believe the AI hype — computers are smarter than us and we should trust them. The struggles of Watson Health and Radiology AI say otherwise. The AI hype is nourished here by the scarcity, until recently, of independent tests.

In addition, the STAT investigation found that Epic has been paying hospitals up to $1 million to use their algorithms. Perhaps the payments were for bragging rights? Perhaps the payments were to get a foot firmly in the hospital door, so that Epic could start charging licensing fees after hospitals commit to using Epic algorithms? What is certain is that the payments create a conflict of interest. As Glenn Cohen, Faculty Director of Harvard University’s Petrie-Flom Center for Health Law Policy, Biotechnology & Bioethics, observed, “It would be a terrible world where Epic is giving people a million dollars, and the end result is the patients’ health gets worse.”

This Epic failure is yet another of countless examples of why we shouldn’t trust AI algorithms that we don’t understand — particularly if their claims have not been tested independently.

By Gary Smith

How a largely untested AI algorithm crept into hundreds of hospitals
How a largely untested AI algorithm crept into hundreds of hospitals During the pandemic, the electronic health record giant Epic quickly rolled out an algo...
Discuss on the forum

Tags