Unreliable narrator

Hospitals adopt error-prone AI transcription tools despite warnings

OpenAI’s Whisper tool may add fake text to medical transcripts, investigation finds.

Benj Edwards – Oct 29, 2024 3:23 AM | 166

On Saturday, an Associated Press investigation revealed that OpenAI's Whisper transcription tool creates fabricated text in medical and business settings despite warnings against such use. The AP interviewed more than 12 software engineers, developers, and researchers who found the model regularly invents text that speakers never said, a phenomenon often called a "confabulation" or "hallucination" in the AI field.

Upon its release in 2022, OpenAI claimed that Whisper approached "human level robustness" in audio transcription accuracy. However, a University of Michigan researcher told the AP that Whisper created false text in 80 percent of public meeting transcripts examined. Another developer, unnamed in the AP report, claimed to have found invented content in almost all of his 26,000 test transcriptions.

The fabrications pose particular risks in health care settings. Despite OpenAI's warnings against using Whisper for "high-risk domains," over 30,000 medical workers now use Whisper-based tools to transcribe patient visits, according to the AP report. The Mankato Clinic in Minnesota and Children's Hospital Los Angeles count among 40 health systems using a Whisper-powered AI copilot service from medical tech company Nabla that is fine-tuned on medical terminology.

Nabla acknowledges that Whisper can confabulate, but it also reportedly erases original audio recordings "for data safety reasons." This could cause additional issues since doctors cannot verify accuracy against the source material. And deaf patients may be highly impacted by mistaken transcripts since they would have no way to know if medical transcript audio is accurate or not.

The potential problems with Whisper extend beyond health care. Researchers from Cornell University and the University of Virginia studied thousands of audio samples and found Whisper adding non-existent violent content and racial commentary to neutral speech. They found that 1 percent of samples included "entire hallucinated phrases or sentences which did not exist in any form in the underlying audio" and 38 percent of those included "explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority."

In one case from the study cited by AP, when a speaker described "two other girls and one lady," Whisper added fictional text specifying that they "were Black." In another, the audio said, "He, the boy, was going to, I’m not sure exactly, take the umbrella." Whisper transcribed it to, "He took a big piece of a cross, a teeny, small piece ... I’m sure he didn’t have a terror knife so he killed a number of people."

An OpenAI spokesperson told the AP that the company appreciates the researchers' findings and that it actively studies how to reduce fabrications and incorporates feedback in updates to the model.

Why Whisper confabulates

The key to Whisper's unsuitability in high-risk domains comes from its propensity to sometimes confabulate, or plausibly make up, inaccurate outputs. The AP report says, "Researchers aren’t certain why Whisper and similar tools hallucinate," but that isn't true. We know exactly why Transformer-based AI models like Whisper behave this way.

Whisper is based on technology that is designed to predict the next most likely token (chunk of data) that should appear after a sequence of tokens provided by a user. In the case of ChatGPT, the input tokens come in the form of a text prompt. In the case of Whisper, the input is tokenized audio data.

The transcription output from Whisper is a prediction of what is most likely, not what is most accurate. Accuracy in Transformer-based outputs is typically proportional to the presence of relevant accurate data in the training dataset, but it is never guaranteed. If there is ever a case where there isn't enough contextual information in its neural network for Whisper to make an accurate prediction about how to transcribe a particular segment of audio, the model will fall back on what it "knows" about the relationships between sounds and words it has learned from its training data.

According to OpenAI in 2022, Whisper learned those statistical relationships from "680,000 hours of multilingual and multitask supervised data collected from the web." But we now know a little more about the source. Given Whisper's well-known tendency to produce certain outputs like "thank you for watching," "like and subscribe," or "drop a comment in the section below" when provided silent or garbled inputs, it's likely that OpenAI trained Whisper on thousands of hours of captioned audio scraped from YouTube videos (the researchers needed audio paired with existing captions to train the model).

There's also a phenomenon called "overfitting" in AI models where information (in this case, text found in audio transcriptions) encountered more frequently in the training data is more likely to be reproduced in an output. In cases where Whisper encounters poor-quality audio in medical notes, the AI model will produce what its neural network predicts is the most likely output, even if it is incorrect. And the most likely output for any given YouTube video, since so many people say it, is "Thanks for watching."

In other cases, Whisper seems to draw on the context of the conversation to fill in what should come next, which can lead to problems because its training data could include racist commentary or inaccurate medical information. For example, if many examples of training data featured speakers saying the phrase "crimes by black criminals," when Whisper encounters a "crimes by [garbled audio] criminals," audio sample, it will be more likely to fill in the transcription with "black."

In the original Whisper model card, OpenAI researchers wrote about this very phenomenon: "Because the models are trained in a weakly supervised manner using large-scale noisy data, the predictions may include texts that are not actually spoken in the audio input (i.e. hallucination). We hypothesize that this happens because, given their general knowledge of language, the models combine trying to predict the next word in audio with trying to transcribe the audio itself."

So in that sense, Whisper "knows" something about the content of what is being said and keeps track of the context of the conversation, which can lead to issues like the one where Whisper identified two women as being Black even though that information was not contained in the original audio. Theoretically, this erroneous scenario could be reduced by using a second AI model trained to pick out areas of confusing audio where the Whisper model is likely to confabulate and flag the transcript in that location, so a human could manually double-check those instances for accuracy later.

Clearly, OpenAI's advice not to use Whisper in high-risk domains, such as critical medical records, was a good one. But health care companies are constantly driven by a need to decrease costs by using seemingly "good enough" AI tools—as we've already seen with Epic Systems using GPT-4 for medical records and UnitedHealth using a flawed AI model for insurance decisions. It's entirely possible that people are already suffering negative outcomes due to AI mistakes, and fixing them will likely involve some sort of regulation and certification of AI tools used in the medical field.

Benj Edwards Senior AI Reporter

Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

166 Comments

ikavakalolo

Practising non-US hospital-based clinician here, and keen enthusiast for ways to make me more efficient.

In the end, I want technology to allow me to spend more time with my patients, talking to them and actually being a clinician, and less time on administration and documenting the consultation.

I've used human transcription, Dragon Medical, then Dragon Medical One extensively for several years. 12 months ago I tested 8 different 'AI' medical scribes and they were all OK, but ultimately useless - not because of hallucinations but because of inability to tease out the important points in the conversation, and fixing the notes took more time than just using Dragon.

Out of interest, I re-tested the best of them (IMO) - Heidi Health - last week and was amazed at how much better it had become after just 12 months. I only tried a half-dozen pre-consented consultations, and the output was accurate, with minimal changes required to my clinical note. I will be testing it more but I think it is up to a level I am happy with where it will improve my workflows.

More recently, I have been using a cloud LLM-based HIPAA-certified transcription package and it is an order of magnitude more accurate than Dragon ever has been for me, and can handle me speaking much faster than Dragon.

I have no doubt there are bad implementations out there in healthcare (and frankly, enterprise-scale healthcare IT - from Epic, Cerner, etc - is an absolute shambles from a clinician's perspective). It sounds like there are bad implementations out there that should not be used. But dismissing the concept entirely as having no potential to improve healthcare provision is a bit naive and almost wilfully ignorant.

AI is not going to put me out of a job, but it sure as heck has the potential to make me more efficient (yes, I also want efficiency in my day, and will happily see more patients in my day if the systems I have around me allow me to do so safely and competently!).

October 29, 2024 at 7:18 am

Staff Picks

Neric

20 Years ago I worked for a company that did medical transcription using speech recognition. Key from back then - some speakers were clear enough transcription was doable w/o human intervention. The next set we had tech that could remove filled pauses - so transcribing it was a lot faster. And only a few were so bad they had to be fully transcribed by a human.

That was 20 years ago so the tech should have gotten better. And that tech didn't make stuff up... This is a worrying trend.

October 28, 2024 at 7:24 pm