Replies: 14 comments 5 replies
VAD, probably. The earlier models will also produce some miscellaneous crap when they encounter silence For example, these things can be effective for the small model (but not for v3):
|
is there any good arabic model you guys found which is better than large v3 ? |
I found a similar thing happens in German where it says For both German and Arabic I found that this pretty much only happens at the very end of videos / when there is sustained silence. |
Essentially this seems to be an artifact of the fact that Whisper was trained on (amongst other things) YouTube audio + available subtitles. Often subtitlers add their copyright notice onto the end of the subtitles, and the end of the videos are often credits with music, applause, or silence. Thus whisper learned that silence == "copyright notice". See some research for the Norwegian example here: https://medium.com/@lehandreassen/who-is-nicolai-winther-985409568201 |
this also happens when you don't speak into the voice mode, the transcript usually results in the same Arabic phrase |
I've also seen this happen a lot in English with Skyeye: It also happens a lot with hallucinations saying stuff like "This is the end of the video, remember to like and subscribe" |
In german it's "Vielen Dank" (Thank you very much) |
This has been a problem since at least February 2024: https://x.com/SheriefFYI/status/1756694995241951398 |
in romanian, i’ve noticed multiple instances where the transcripts ends with “nu uitati sa da-ti like si subscribe” which, as you might easily infer , translates to “don’t forget to like and subscribe”. |
You can either finetune the model or filter the response from whisper
|
ChatGPT voice mode is also affected by this fwiw: https://x.com/SheriefFYI/status/1929129956153377144 |
If you generate complete silence in a wav file and run whisper on it, it will always hallucinate the same thing
ffmpeg -f lavfi -i anullsrc=r=44100:cl=stereo -t 30 silence.wav
whisper ./silence.wav --language Arabic --model large-v3
[00:00.000 --> 00:29.980] ترجمة نانسي قنقر
It seems that the model learned to interpret silence as ترجمة نانسي قنقر in arabic
Any way to fix / circumvent this?
All reactions