Transcribing audio with AI using Speech Note

This article brought to you by LWN subscribers
Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Joe Brockmeier
September 3, 2024

One of the joys of writing about technology is the opportunity to cover interesting talks on open‑source and free‑software topics. One of the pains is creating transcriptions of said talks, or continually referring back to a recording, to be able to write about them. Speech Note is an open-source application that uses machine-learning models, running locally, to translate speech to text and take the pain out of transcription. It also handles text to speech, and language translations. While not perfect, its transcriptions are better than one might expect, even when handling jargon, accents, and less-than-perfect audio.

Speech Note is a desktop application, licensed under the MPL-2.0, that is distributed as a Flatpak (via Flathub) and available for x86_64 and aarch64. Packages are also available for Arch Linux and Sailfish OS. It also has optional add-ons that provide support for AMD ("Speech Note AMD") or NVIDIA ("Speech Note NVIDIA") GPUs. It will work without hardware acceleration, but at a substantially slower pace.

Models and engines

Speech Note acts as a front-end for various open-source processing engines and their models. A full list of the processing engines, models, and languages supported by Speech Note is available that also has a table that shows the support each engine has for specific languages. So, for example, if one wants to translate audio of spoken Czech to text, it shows that DeepSpeech / Coqui STT, whisper.cpp, and Vosk have support, but that the april-asr engine does not.

The application has a limited number of actions that can be controlled via the command line, or by other applications using D-Bus, when Speech Note is running. For example, users can use Speech Note to read text aloud from the clipboard (in X11 only) using this command:

    $ flatpak run net.mkiol.SpeechNote --action start-reading-clipboard

Currently, only a few of Speech Note's actions are supported via the command line. Specifically, users can control text to speech, or listening for text to transcribe from the default system audio input. It does not support passing files to transcribe or translate via the command-line interface at this time.

After installing Speech Note, the next step is to add models. The model files can be rather large (in some cases, more than 1GB), so they are not distributed with Speech Note by default. Models are installed by clicking the "Languages" button, and then selecting the language that Speech Note will transcribe from, speak in, or translate from. For example, to transcribe audio in English to text, select English. To translate from Spanish text to English text, select Spanish. Speech Note has support for quite a few languages, even Esperanto, though the number of models available for each language varies. (Esperanto has only one speech-to-text model, and two text-to-speech models.) Sadly, Speech Note has no Klingon support, at least not yet.

Each model is listed by name with an information button next to its download button. Clicking the information button will display some extended information, including the model type, its processing speed, supported hardware acceleration (if any), quality, license, download size, and files that will be downloaded.

Speech to text

Unless one is already familiar with the models, it may be a bit confusing to decide which to download. For English audio to text, I can recommend the WhisperCPP-Distil Large-v3 and FasterWhisper-Distil Large-v3 models. Both are available under the MIT license. These are slow processing models, but they offer high-quality transcriptions with punctuation inserted with a reasonable degree of accuracy. The WhisperCPP model has support for hardware acceleration with NVIDIA GPUs (but not AMD), and the FasterWhisper model has support for CPU hardware acceleration on some Intel CPUs using the OpenVINO toolkit.

Using Speech Note without hardware acceleration meant processing times at least twice as long as the audio being processed. With hardware acceleration using an NVIDIA GPU or Intel CPU with OpenVINO support, Speech Note could chew through a transcription in less time than the actual length of the audio. Of course, even slow transcription is better than having to do it manually. Simply feed Speech Note something to transcribe in the background while off doing other things and come back to a finished transcription.

The fast models are, in my testing at least, not worth the tradeoff of speed versus accuracy. The Mozilla DeepSpeech model was indeed much faster than WhisperCPP's, but its results were much less accurate and included little punctuation. A short example of FasterWhisper versus DeepSpeech is available here. Neither transcription is perfect, and both lack paragraph breaks, but FasterWhisper clearly does a much better job and its text could be cleaned up to be wholly accurate in little time.

After downloading one or more models, users can select the model they wish to use and then go to the File menu to import an audio or video file, or click "Listen" to process speech from the system's sound input. There is no listing of supported file types or codecs, but Speech Note has happily transcribed from MP3, FLAC, and MP4 (video) files so far. While transcribing, Speech Note displays a status message at the bottom of the screen with an estimate of how much of the file has been transcribed. Transcription results are displayed in the Notepad area as they are finalized.

Once the audio file is fully processed, it can be exported to a plain-text file. Speech Note can also produce SRT files, which would be useful for producing subtitles for videos.

While some of the models offer impressively accurate transcription and punctuation, none of the models distinguish speakers. So, for example, if Speech Note is given an MP3 of a podcast with multiple speakers using the WhisperCpp-Distil Large-v3 model, it will likely do a credible job of generating a mostly accurate transcript, but the transcript will not reflect that more than one person was speaking.

Text to speech and translation

Speech Note has a dizzying array of model options for text-to-speech conversions. It will read text in real time from its text window, or it can export text to an audio file. The options are MP3, WAV, Opus, or Ogg Vorbis. The pace it uses to read text can be adjusted from regular playback speed up to 2.0x speed or down to 0.1x speed. The fastest and slowest speeds are not recommended except for comedic effect.

The text-to-speech models that I have tried are clear and crisp, but they are obviously artificial voices that have odd pauses and pronunciations. It is unlikely that someone would listen to a recording from Speech Note and mistake it for a human speaker, but its output is serviceable.

Unfortunately, my fluency in other languages is too weak to effectively judge Speech Note's machine translations. However, using Speech Note to translate English to German (for example) is fairly fast, and putting the German translation into Google Translate returns an English version that is almost identical to the original version.

The most recent feature release, version 4.6.0, came out on August 3. It included a number of new translator models, new text-to-speech voices, and added separate settings for speech-to-text engines in the preferences. Version 4.6.1 included a few bug fixes and new translator models. The project is hosted on GitHub and GitLab; contributors are invited to report issues or submit pull/merge requests on the platform they prefer most.

In my limited experience, Speech Note with the Whisper models is more accurate than services like Otter.ai or Amazon Transcribe, and is much more accurate than YouTube's automatically generated transcripts for videos. It is not perfect, and tends to stumble most on acronyms, names, and jargon, in particular, which is to be expected. However, it does a much better job with those than one might expect.

Speech Note is a useful tool for anyone who needs to convert audio to text (or vice-versa) without depending on a third-party service. It is a particularly appealing option if one wants to convert audio (such as company meeting notes) that should not be shared with third-parties. It is, of course, also much more cost-effective than using subscription services and far faster than typing up a transcription manually.

Cross-platform Alternative

Posted Sep 3, 2024 16:22 UTC (Tue) by burki99 (subscriber, #17149) [Link]

If you are on Mac and Windows and cannot run Speech Note, noScribe also runs Whispher-AI models through a fairly friendly GUI: https://github.com/kaixxx/noScribe

Whisper is great

Posted Sep 3, 2024 17:30 UTC (Tue) by dskoll (subscriber, #1630) [Link] (5 responses)

I use Whisper (directly from the command-line) with the English language "small" model and it's extremely good. The only nitpicks I have are that its timecodes are a little off, so I have to adjust video captions, and the captions are not split in logical places. But it still saves a huge amount of time compared to captioning a video by hand, and all the processing is local so there's no cloud nonsense involved.

… until it's not

Posted Sep 3, 2024 20:53 UTC (Tue) by yeltsin (guest, #171611) [Link] (4 responses)

I've tried using models of all sizes to transcribe podcasts for later reference (running rg through thousands of files is a lot easier than trying to remember which episode mentioned that application whose name you can't remember, and then digging through it).

It's generally fine, but often struggles with abbreviations, and writes the same name in many creative ways (calling the same person John → Joan → Jon → Johan in one episode).

It would be great to have voice recognition, so it can at least reliably split different speakers into separate paragraphs. Names and abbreviations I'm not sure, it will probably have to learn to understand the context, which seems like it would require a much more complex implementation.

… until it's not

Posted Sep 3, 2024 21:40 UTC (Tue) by Paf (subscriber, #91811) [Link]

Larger multimodal large language models do this - understand context and give reasonable and consistent transcriptions - but I don't know that any that are good enough can be run locally on realistic hardware, and they come with all of the baggage and issues we're all familiar with and I won't go in to here.

But they're 'smart' enough to do what you're talking about quite well.

… until it's not

Posted Sep 7, 2024 22:46 UTC (Sat) by ringerc (subscriber, #3071) [Link] (2 responses)

I'd love to be able to seed one with a context.

An invite list and a list of common terms for a meeting recording. Even better if there's a way to tell it how some jargon is pronounced ("Ceph" spoken by Americans seems to sound like "Seth" to transcription software; "jit" for JIT, git has a hard "g", etc).

Or a seed like an episode synopsis for a TV show.

This could potentially be done in 2 passes too. One to rough cut the transcript and dump out key terms, names etc. you fix then and re-run it.

… until it's not

Posted Sep 8, 2024 7:09 UTC (Sun) by Wol (subscriber, #4433) [Link]

Going the other way (satnavs) this is a hard problem.

Bear in mind, iirc, the rule is "an i or e after g makes it soft". We have a town locally called Gillingham, which follows the rule and has a soft G. Most satnavs get it wrong and use a hard G, which is a town down Somerset way called Gillingham, where we were on holiday a year or so ago.

If humans don't follow the rules and "just know" how something works, how on earth are "speech to text" and "text to speech" engines going to get it right! :-) Most of the roads down our way they simply can't pronounce.

Cheers.
Wol

Providing context

Posted Sep 25, 2024 7:53 UTC (Wed) by mwood (guest, #55622) [Link]

> I'd love to be able to seed one with a context.

You actually can! Whisper has an `--initial_prompt` option.

e.g. I tried the example from the article and gave it some context, which allowed it to correctly transcribe "shan't" and "bosh" and the repetition, but for some reason it got horribly confused in the middle.

I gave it the following context:

'This is a reading of a two stanza poem. It contains some old/unusual exclamations and the last line of each stanza contains some repetition like "and WORD, and WORD, and WORD"'

[00:00.000 --> 00:13.720] the cat this is a LibriVox recording all LibriVox recordings are in the public domain for more information or to volunteer please visit LibriVox.org
[00:13.720 --> 00:26.560] the cat advice to the young by harry graham from ruthless rhymes for heartless homes LibriVox coffee break collection number eight
[00:26.560 --> 00:41.280] my children you should imitate the harmless necessary cat who eats whatever's on his plate and doesn't even leave the fat who never stays in bed too late or does immoral things like that
[00:41.280 --> 00:55.080] instead of saying shan't or bosh he'll sit and wash and wash and wash when shadows fall and lights grow dim he sits beneath the kitchen stair
[00:55.080 --> 00:55.880] basta
[00:55.880 --> 00:56.460] ba
[00:56.460 --> 01:03.260] and limb a simple couch he chooses there and if you tumble over him he simply loves to hear you
[01:03.260 --> 01:19.500] swear and while bad language you prefer he'll sit and purr and purr and purr end of the cat by harry graham read by patrick wallace

whisper in italian

Posted Sep 5, 2024 8:00 UTC (Thu) by LtWorf (subscriber, #124958) [Link] (3 responses)

In my experience, whisper in italian is kinda useless, unless someone speaks very slow and clear like the "learn italian" records.

It constantly invents new words, prefers completely unknown words to very common ones, gets word boundaries wrong very often.

whisper in italian

Posted Sep 5, 2024 16:12 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

Sounds like Google Meet (I think it is) ... which we use for company pep broadcasts.

In the background my team are usually making fun of it - one of our senior guys has a very strong Scottish accent and oh boy does Meet have trouble with it ...

Cheers,
Wol

whisper in italian

Posted Sep 7, 2024 16:13 UTC (Sat) by SLi (subscriber, #53131) [Link]

Ah, yes. Does it still consistently transcribe someone coughing as "Google"?

Burnistoun sketch on voice recognition with accents

Posted Sep 6, 2024 10:35 UTC (Fri) by farnz (subscriber, #17727) [Link]

There's a great comedy sketch about a voice activated elevator that can't handle Scottish accents, where they're struggling to get the machine to recognise a simple number.

Zoom

Posted Sep 7, 2024 22:50 UTC (Sat) by ringerc (subscriber, #3071) [Link]

I'm not a big fan of Zoom but I have to give them some serious credit for their speech to text and automatic transcripts.

My work often has calls with mixes of strong accented fast talking Americans, accented fast taking Chinese, accented fast talking Indians and Pakistanis, and all sorts of others. Plus a lot of jargon. Zoom can sometimes understand better than I can.

I'd love to see openly available models reach this standard. It's troubling how AI model training is becoming another barrier to control of your own things. As if the relentless drive toward SaaS and IoT subscription models and forced cloud-tied accounts isn't already enough to remove control of what you own and use.

Real-time Speech-to-text?

Posted Sep 13, 2024 22:26 UTC (Fri) by jch (guest, #51929) [Link]

I've been experimenting with real-time STT for the Galene videoconferencing system¹ using the whisper-cpp library², and I've found it challenging. I've found that the API is not naturally adapted to real-time transcription; I've worked around that by splitting the audio into chunks that I then pipe into the model. More seriously, I've found that the only model I can run in real time on the CPU is the "base" model, which is not very useful in practice. I'll be looking at running on the GPU next time I have some time to work on it.

Does anyone have experience with real-time STT? If so, I'd appreciate a note at <galene@lists.galene.org>.

¹ https://github.com/jech/galene-stt
² https://github.com/ggerganov/whisper.cpp