close
close

AI-powered transcription tool used in hospitals makes up things no one has ever said, researchers say

AI-powered transcription tool used in hospitals makes up things no one has ever said, researchers say

SAN FRANCISCO – Tech giant OpenAI has touted its AI-powered transcription tool Whisper as having near “human-level robustness and accuracy.”

But Whisper has a major flaw: It’s prone to making up chunks of text or even entire sentences, according to interviews with more than a dozen software engineers, developers and academic researchers. These experts said some of the made-up texts – known in the industry as hallucinations – can include racial commentary, violent rhetoric and even imaginary medical treatments.

Experts said such inventions are problematic because Whisper is used in a host of industries around the world to translate and transcribe interviews, generate text in popular consumer technologies and create captions for videos.

More worrisome, they said, is the rush by medical centers to use Whisper-based tools to transcribe patients’ consultations with doctors, despite OpenAI’s warnings that the tool should not be used in “high-risk areas.”

It’s difficult to perceive the full extent of the problem, but researchers and engineers said they frequently encountered Whisper’s hallucinations in their work. A University of Michigan researcher conducting a study of public meetings, for example, said he found hallucinations in 8 out of 10 audio transcripts he inspected before he began trying to improve the model.

One machine learning engineer said he initially found hallucinations in about half of the more than 100 hours of Whisper transcripts he analyzed. A third developer said he found hallucinations in nearly every one of the 26,000 transcripts he created with Whisper.

Problems persist even in short, well-recorded audio samples. A recent study by computer scientists found 187 hallucinations in more than 13,000 clear audio fragments they examined. This trend would lead to tens of thousands of faulty transcripts over millions of records, the researchers said.

Such mistakes could have “really serious consequences,” particularly in the hospital setting, said Alondra Nelson, who until last year led the White House Office of Science and Technology Policy for the Biden administration.

“Nobody wants a misdiagnosis,” said Nelson, a professor at the Institute for Advanced Study in Princeton, New Jersey. “There should be a higher bar.”

Whispering is also used to create subtitles for the deaf and hard of hearing – a population at particular risk of mistranscriptions. That’s because deaf and hard-of-hearing people have no way of identifying the fabrications that are “hidden among all the other text,” said Christian Vogler, who is deaf and directs Gallaudet University’s Technology Access Program.

OPENAI HELPED TO ADDRESS PROBLEMS

The prevalence of such hallucinations has prompted OpenAI experts, advocates and former employees to call on the federal government to consider AI regulation. At the very least, they said, OpenAI needs to address the flaw.

“This seems solvable if the company is willing to prioritize it,” said William Saunders, a research engineer in San Francisco who left OpenAI in February over concerns about the company’s direction. “It’s problematic if you put this out there and people are overconfident in what it can do and integrate it into all these other systems.”

An OpenAI spokesperson said the company is continually studying how to reduce hallucinations and praised the researchers’ findings, adding that OpenAI incorporates feedback into model updates.

While most developers assume transcription tools misspell words or make other errors, engineers and researchers said they’ve never seen another AI-powered transcription tool hallucinate as much as Whisper.

WHISPERED HALLUCINATIONS

The tool is integrated into some versions of OpenAI’s flagship chatbot, ChatGPT, and is an embedded offering in Oracle and Microsoft cloud computing platforms that serve thousands of companies worldwide. It is also used to transcribe and translate text into multiple languages.

In the past month alone, a recent version of Whisper has been downloaded more than 4.2 million times from the open-source AI platform HuggingFace. Sanchit Gandhi, a machine learning engineer there, said Whisper is the most popular open-source speech recognition model and is integrated into everything from call centers to voice assistants.

Professors Allison Koenecke of Cornell University and Mona Sloane of the University of Virginia examined thousands of short excerpts they obtained from TalkBank, a research repository hosted at Carnegie Mellon University. They determined that nearly 40 percent of the hallucinations were harmful or worrisome because the speaker could be misinterpreted or misrepresented.

In one example they found, a speaker said, “He, the boy, was going to, I’m not sure exactly, take the umbrella.”

But the transcription software added: “He took a big piece of the cross, a small, small piece. … I’m sure he didn’t have a terror knife, so he killed a number of people.”

A speaker on another recording described “two other girls and a lady”. Whisper made up additional comments about race, adding “two other girls and a lady, um, who were black.”

In a third transcript, Whisper invented a non-existent drug called “hyperactive antibiotics”.

Researchers aren’t sure why Whisper and similar tools hallucinate, but software developers said the inventions tend to occur in the middle of pauses, background sounds or music playback.

OpenAI advised in its online disclosures against using Whisper in “decision-making contexts where flaws in accuracy can lead to pronounced flaws in outcomes.”

Transcription of doctor’s appointments

That warning hasn’t stopped hospitals or medical centers from using speech-to-text models, including Whisper, to transcribe what’s said during doctor visits to free up health care providers to spend less time taking notes or typing the reports.

More than 30,000 clinicians and 40 health systems, including the Mankato Clinic in Minnesota and Children’s Hospital Los Angeles, have begun using a whisper-based tool built by Nabla, which has offices in France and the US.

This tool has been fine-tuned to medical language to transcribe and summarize patient interactions, said Nabla’s chief technology officer, Martin Raison.

Company officials said they are aware that Whisper can hallucinate and are mitigating the problem.

It’s impossible to compare Nabla’s AI-generated transcript to the original recording because Nabla’s tool deletes the original audio for “data security reasons,” Raison said.

Nabla said the tool has been used to transcribe about 7 million medical visits.

Saunders, the former OpenAI engineer, said erasing the original audio could be a concern if transcriptions aren’t double-checked or if clinicians can’t access the recording to verify they’re correct: “You can’t detect errors if you remove the ground truth,” he said.

Nabla said no model is perfect, and his tool currently requires healthcare providers to quickly edit and approve transcribed notes, but that could change.

PRIVACY CONCERNS

Because patients’ meetings with their doctors are confidential, it’s hard to know how AI-generated transcripts affect them.

California state Rep. Rebecca Bauer-Kahan said she took one of her children to the doctor earlier this year and refused to sign a form provided by the health network that asked her permission to share audio of consulting with vendors that include Microsoft Azure, the cloud computing system run by OpenAI’s largest investor. Bauer-Kahan didn’t want such intimate medical conversations shared with tech companies, she said.

“The trigger was very specific that for-profit companies would have the right to have this,” said Bauer-Kahan, a Democrat who represents part of the San Francisco suburbs in the state Assembly. “I was like ‘absolutely not.’ “

John Muir Health spokesman Ben Drew said the health system complies with state and federal privacy laws.

Schellmann reported from New York.