Evaluating Voice AI Models
Voice AI is a complex and expensive technology that advances fast. Comparing different models using vendor claims, such as “the best,” “revolutionary,” and “the most accurate” is mission impossible. We open-sourced our benchmarks and shared tips to help enterprises make data-driven decisions.
False Alarm and False Rejection
A wake word recognition failure adversely affects user experience, and a false recognition can damage user trust. Finding the most accurate wake word engine by comparing metrics taken verbatim from different parties is impossible. Learn how to benchmark wake word detection engines.
Voice Command Acceptance
Finding the most accurate NLU for voice assistants is challenging. Precision, recall, F-score… End-to-End Intent Inference vs. Conventional Spoken Language Understanding… We simplified it!
Word Error Rate (WER)
WER is a widely known metric. However, vendors can share technically correct but misleading claims using WER. It’s critical for an enterprise to know the nuances and calculate a WER using its own data.
Short-Time Objective Intelligibility (STOI)
Different speech quality and speech intelligibility metrics are used to compare noise suppression engines. Learn why Picovoice researchers have chosen STOI to compare noise suppression engines.
Miss-rate
Transcribing audio and video files makes them searchable, but it doesn’t work for every use case and with high accuracy. Octopus makes audio files searchable without relying on text and misses fewer words. The open-source search benchmark proves it.
ROC Curve
The ROC (Receiver Operator Characteristics) Curve allows researchers to study the interplay of detection rate vs. false positive rate. While making our internal voice activity detection, Cobra, publicly available, we open-sourced our internal benchmark, too!