On Saturday, an Associated Press investigation revealed that OpenAI's Whisper transcription tool creates fabricated text in medical and business settings despite warnings against such use. The AP interviewed more than 12 software engineers, developers, and researchers who found the model regularly invents text that speakers never said, a phenomenon often called a "confabulation" or "hallucination" in the AI field.
Upon its release in 2022, OpenAI claimed that Whisper approached "human level robustness" in audio transcription accuracy. However, a University of Michigan researcher told the AP that Whisper created false text in 80 percent of public meeting transcripts examined. Another developer, unnamed in the AP report, claimed to have found invented content in almost all of his 26,000 test transcriptions.
The fabrications pose particular risks in health care settings. Despite OpenAI's warnings against using Whisper for "high-risk domains," over 30,000 medical workers now use Whisper-based tools to transcribe patient visits, according to the AP report. The Mankato Clinic in Minnesota and Children's Hospital Los Angeles count among 40 health systems using a Whisper-powered AI copilot service from medical tech company Nabla that is fine-tuned on medical terminology.
Nabla acknowledges that Whisper can confabulate, but it also reportedly erases original audio recordings "for data safety reasons." This could cause additional issues since doctors cannot verify accuracy against the source material. And deaf patients may be highly impacted by mistaken transcripts since they would have no way to know if medical transcript audio is accurate or not.
The potential problems with Whisper extend beyond health care. Researchers from Cornell University and the University of Virginia studied thousands of audio samples and found Whisper adding non-existent violent content and racial commentary to neutral speech. They found that 1 percent of samples included "entire hallucinated phrases or sentences which did not exist in any form in the underlying audio" and 38 percent of those included "explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority."
That was 20 years ago so the tech should have gotten better. And that tech didn't make stuff up... This is a worrying trend.
In the end, I want technology to allow me to spend more time with my patients, talking to them and actually being a clinician, and less time on administration and documenting the consultation.
I've used human transcription, Dragon Medical, then Dragon Medical One extensively for several years. 12 months ago I tested 8 different 'AI' medical scribes and they were all OK, but ultimately useless - not because of hallucinations but because of inability to tease out the important points in the conversation, and fixing the notes took more time than just using Dragon.
Out of interest, I re-tested the best of them (IMO) - Heidi Health - last week and was amazed at how much better it had become after just 12 months. I only tried a half-dozen pre-consented consultations, and the output was accurate, with minimal changes required to my clinical note. I will be testing it more but I think it is up to a level I am happy with where it will improve my workflows.
More recently, I have been using a cloud LLM-based HIPAA-certified transcription package and it is an order of magnitude more accurate than Dragon ever has been for me, and can handle me speaking much faster than Dragon.
I have no doubt there are bad implementations out there in healthcare (and frankly, enterprise-scale healthcare IT - from Epic, Cerner, etc - is an absolute shambles from a clinician's perspective). It sounds like there are bad implementations out there that should not be used. But dismissing the concept entirely as having no potential to improve healthcare provision is a bit naive and almost wilfully ignorant.
AI is not going to put me out of a job, but it sure as heck has the potential to make me more efficient (yes, I also want efficiency in my day, and will happily see more patients in my day if the systems I have around me allow me to do so safely and competently!).