Finetuning OpenAI’s Whisper: creating your own custom dataset (I)

Lenny Bijan

Published in

Generative AI

9 min readMay 17, 2024

If you haven’t yet, make sure to check out the introduction to this guide series:

Finetuning OpenAI’s Whisper: Introduction to an in-depth guide

by Lenny Bijan Müller

generativeai.pub

In the quest to fine-tune Whisper for optimal performance, the quality of the data you use stands out as the cornerstone of success. While the robustness of any machine learning model hinges on several factors, the caliber of its training data invariably casts the longest shadow over its efficacy. Indeed, in a world abundant with readily accessible high-quality datasets — like Mozilla’s “Common Voice” — the question arises: Why go through the effort of creating your own custom dataset?

The answer lies in the unique requirements of specialized applications. Public datasets, though extensive and diverse, often lack the depth needed to capture the subtle intricacies and complex nuances required by fields such as medicine or law. In these domains, even minor inaccuracies can lead to significant repercussions, underscoring the need for a meticulously tailored approach to data preparation.

As previously mentioned in this series, the effectiveness of an Automatic Speech Recognition (ASR) system in specialized settings hinges not just on understanding broad linguistic patterns, but on grasping the specific jargon and contextual nuances of its intended application. This part of our series will guide you through the crucial steps of crafting a dataset that not only trains Whisper to achieve general competence but hones its abilities to deliver precision where it counts most.

How to preprocess your audio
Applying the preprocessing techniques
What’s next?

How to preprocess your audio

The first step in crafting a high-quality dataset for fine-tuning Whisper is the careful preprocessing of your audio files. Whisper operates within specific constraints — one key limitation being that it can only process audio clips up to 30 seconds in length. Any audio beyond this duration isn’t just truncated; it’s completely disregarded in the training process. This introduces a critical challenge: effectively segmenting your audio into manageable, coherent chunks that fit within this time frame.

Photo by Susan Wilkinson on Unsplash

When preparing audio for fine-tuning Whisper, it’s essential to consider not just the mechanical splitting of audio into segments but the preservation of contextual integrity and acoustic quality. Simply dividing an audio file into 30-second chunks without regard for the content can disrupt the linguistic flow and lead to the loss of crucial information at the borders between segments. Moreover, silence within these segments can skew the model’s training, focusing its learning on irrelevant pauses rather than useful speech.

In order to navigate these challenges, there is one crucial information that needs to be respected. Human speech, when broken down into very short segments such as 30 milliseconds (ms), can generally be considered as static [1]. During such a brief interval, the vocal apparatus (larynx, tongue, lips, etc.) does not have the physical capability to produce dramatic changes in sound. This characteristic is crucial for developing effective audio preprocessing strategies because it provides a foundation for making assumptions about the continuity and stability of speech within these short windows.

All of these preprocessing steps can be achieved through carefully crafted Python code. But how do we embark on this coding journey? To help you visualize and understand the sequence of tasks and their interconnections, I’ve prepared a flowchart. This diagram will guide you through the functionality of the code you will develop by the end of this guide.

Preprocessing audio flowchart

Applying the preprocessing techniques

To commence the audio preprocessing for Whisper, a crucial initial step involves setting appropriate thresholds to detect voice activity (VAD) within your audio files. A typical approach, though fraught with potential inaccuracies, is to use a decibel threshold, such as -30 dB, combined with a fixed duration like 600 milliseconds. In this method, any audio segments where the sound does not exceed -30 dB for 600 milliseconds would be removed. This approach is problematic because it presumes a uniform loudness spectrum across all audio files, which is seldom the case.

To address these challenges more effectively, we will utilize two sophisticated metrics: the Zero Crossing Rate (ZCR) and the energy level of the audio. Instead of depending on a fixed decibel threshold, we’ll dynamically calculate the mean ZCR and energy for each individual audio file. To enhance precision, we will set our thresholds at the mean plus one standard deviation of these measurements. In practice, for an audio frame to be classified as silent, it must fall below both the ZCR and energy thresholds. This dual-check ensures that we only deem a segment silent when it truly exhibits low levels of both vocal activity and sound intensity, thus minimizing the risk of erroneously discarding important audio content [2]. The compute_dynamic_tresholdsfunction establishes two thresholds, with full functionality expected once all components are fully implemented.

def compute_dynamic_thresholds(audio, sr, frame_length_samples, segment_length_samples):
    """
    Computes thresholds for energy and Zero-Crossing Rate (ZCR).

    The thresholds are determined based on the mean and standard deviation
    of the energy and ZCR within individual segments of the audio signal.
    """

    # Initialization of lists for energy thresholds and ZCR thresholds
    energy_thresholds = []
    zcr_thresholds = []

    # Iterating over the audio signal in segments
    for i in range(0, len(audio), segment_length_samples):
        # Extract the current segment from the audio signal
        segment = audio[i:i + segment_length_samples]
        
        # Calculate the number of frames in the current segment
        n_frames = max(int(len(segment) / frame_length_samples), 1)
        
        # Calculate the energy for each frame in the segment
        energy = np.array([np.sum(segment[j * frame_length_samples:(j + 1) * frame_length_samples] ** 2) for j in range(n_frames)])
        
        # Calculate the Zero-Crossing Rate for each frame in the segment
        zcr = np.array([np.sum(librosa.zero_crossings(segment[j * frame_length_samples:(j + 1) * frame_length_samples], pad=False)) for j in range(n_frames)])
        
        # Add the energy threshold to the array
        # The threshold is the mean plus the standard deviation of the energy in the segment
        energy_thresholds.append(np.mean(energy) + np.std(energy))
        
        # Add the ZCR threshold to the array
        # The threshold is the mean plus the standard deviation of ZCR in the segment
        zcr_thresholds.append(np.mean(zcr) + np.std(zcr))
    
    # Return the lists of thresholds
    return energy_thresholds, zcr_thresholds

Following the establishment of our dynamic thresholds using the ZCR and energy levels, the subsequent step involves implementing the logic for VAD and the assembly of audio frames into coherent chunks. This critical portion of the code is where our preprocessing strategy truly takes shape.

In the provided analyze_chunks function, the audio is methodically divided into 30ms frames, which are then analyzed to determine their status as either active or silent based on the calculated energy and ZCR thresholds. This analysis helps in identifying and grouping active frames into chunks while using silent frames as indicators to terminate current chunks. Furthermore, this approach not only helps in retaining contextual information within each chunk but also in minimizing the potential loss of important linguistic details at the edges of audio windows. By carefully defining and detecting the end points of speech segments, we ensure that the dataset prepared for training Whisper is of the highest quality, with each chunk representing a clear and complete snippet of speech.

def analyze_chunks(audio, sr, min_chunk_length_samples, max_chunk_length_samples, frame_length_samples, overlap_samples):
    """
    Analyzes the audio to find chunks based on energy and Zero-Crossing Rate (ZCR).

    This function divides the audio into frames and determines whether each frame is silent or active.
    Active frames are aggregated into chunks. Silent frames signal the end of a chunk,
    taking into account the specified minimum and maximum chunk lengths.
    """
    # Dividing the audio into five segments for threshold calculation
    segment_length_samples = len(audio) // 5
    # Compute dynamic thresholds for energy and ZCR for each segment
    energy_thresholds, zcr_thresholds = compute_dynamic_thresholds(audio, sr, frame_length_samples, segment_length_samples)

    chunks = []  # Initialization of the list for the found chunks
    current_chunk_start = None  # Starting point of the current chunk
    last_valid_end = None  # Endpoint of the last valid frame in the current chunk
    is_previous_frame_silent = False  # State of the previous frame

    # Iterating over the audio in frame steps
    for i in range(0, len(audio), frame_length_samples):
        segment_index = i // segment_length_samples
        segment_index = min(segment_index, len(energy_thresholds) - 1)  # Prevent index overflow

        # Assigning thresholds for the current frame
        energy_threshold = energy_thresholds[segment_index]
        zcr_threshold = zcr_thresholds[segment_index]

        # Extracting the current frame
        frame = audio[i:min(i + frame_length_samples, len(audio))]
        # Calculating energy and ZCR for the frame
        frame_energy = np.sum(frame ** 2)
        frame_zcr = np.sum(librosa.zero_crossings(frame, pad=False))

        # Determining whether the frame is silent
        is_silent = frame_energy <= energy_threshold and frame_zcr <= zcr_threshold

        # If no chunk has started and the frame is not silent, start a new chunk
        if current_chunk_start is None and not is_silent:
            current_chunk_start = i  # Start of a new chunk
            last_valid_end = i + frame_length_samples

        # If a chunk is active
        if current_chunk_start is not None:
            # If the frame is silent or the maximum chunk length has been reached
            if is_silent and (is_previous_frame_silent or i + frame_length_samples - current_chunk_start >= max_chunk_length_samples):
                # Check if the chunk is long enough to be saved
                if last_valid_end and last_valid_end - current_chunk_start >= min_chunk_length_samples:
                    chunks.append((current_chunk_start, last_valid_end + overlap_samples))  # Save with overlap
                    current_chunk_start = None
            else:
                # Update the end of the current chunk
                last_valid_end = i + frame_length_samples
                # If the frame is not silent and the maximum chunk length has been reached
                if not is_silent and i + frame_length_samples - current_chunk_start >= max_chunk_length_samples:
                    chunks.append((current_chunk_start, min(len(audio), last_valid_end + overlap_samples)))  # Save with overlap
                    current_chunk_start = None

        is_previous_frame_silent = is_silent

    # Saving the last chunk, if necessary
    if current_chunk_start is not None and last_valid_end - current_chunk_start >= min_chunk_length_samples:
        chunks.append((current_chunk_start, min(len(audio), last_valid_end + overlap_samples)))  # Save with overlap

    return chunks

To fully implement this preprocessing pipeline, the complete code, along with all necessary supporting functions, is available on GitHub. This resource allows you to access, review, and utilize the code to ensure your audio data is optimally prepared for training with Whisper or any other ASR system. By following the steps outlined, you’ll end up with a well-organized audio directory — each preprocessed audio file will have its own subdirectory containing all the corresponding chunks. This structure facilitates easy access and management of your processed data.

What’s next?

As outlined in the title of this blog, “Finetuning OpenAI’s Whisper: Creating Your Own Custom Dataset (I),” this topic is extensive enough to warrant a two-part series, given the depth of technical details involved. We have now established a ready-to-use audio directory where the quality of the audio is exceptionally well-suited for finetuning purposes.

However, finetuning Whisper goes beyond just preparing audio; it also necessitates the creation of “ground truth” transcriptions for each audio snippet used. This step is critical but can be incredibly time-consuming, presenting a significant hurdle when you’re keen to learn and experiment with finetuning Whisper. To address this, the next installment of this guide will delve into the intricacies of generating ground truths for your audio files. We’ll also discuss how to correlate these transcriptions with the audio files through a comprehensive metadata file.

Finally, we’ll develop a custom script to load your dataset effectively and integrate it seamlessly into the Hugging Face platform. This approach will not only streamline the process but also enhance your ability to fine-tune Whisper with precision and ease. Stay tuned for these valuable insights in the next part of our series. If you have any questions or ideas, feel free to leave a comment!

Literature:
[1] Labied et al., 2022, S. 2
[2] Tazi & El Makhfi, 2018, S. 1529

This story is published under Generative AI Publication.

Connect with us on Substack, LinkedIn, and Zeniteq to stay in the loop with the latest AI stories. Let’s shape the future of AI together!

Finetuning OpenAI’s Whisper: creating your own custom dataset (I)

Finetuning OpenAI’s Whisper: Introduction to an in-depth guide

by Lenny Bijan Müller

Table of Contents

How to preprocess your audio

Applying the preprocessing techniques

What’s next?

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Generative AI

Written by Lenny Bijan

More from Lenny Bijan and Generative AI

Finetuning OpenAI’s Whisper: creating metadata & integration into Huggingface (II)

By Lenny Bijan Müller

Why ChatGPT failed so spectacularly at predicting the US election, and why AI “voted” Kamala Harris

Even “wrong” answers prove there are political preferences in AI

Taming the Wild West of Data: How AI and Analytics Are Shaping Our Future

Finetuning OpenAI’s Whisper: Introduction to an in-depth guide

by Lenny Bijan Müller

Recommended from Medium

How to Install and Use CLIP: A Complete Step-by-Step Guide

Install, Set Up, and Use OpenAI’s CLIP Model for Image and Text Understanding

Mastering OpenAI Whisper: Fine-Tuning for Custom Speech Recognition on Colab

So recently I have been working on Fine Tuning OpenAI Whisper on my custom dataset. It has been a tremendous journey with a lot of down and…

Lists

Predictive Modeling w/ Python

AI Regulation

Natural Language Processing

ChatGPT

Color Your Captions: Streamlining Live Transcriptions With “diart” and OpenAI’s Whisper

Combine OpenAI’s Whisper with diart for speaker-aware captions!

OpenAI’s GPT-4V vs. Google’s Gemini Pro for Photo Captioning

A head-to-head comparison of using commercial LLMs for creating captions of personal photos with face recognition

PaliGemma VLM for Image Captioning: A Practical Guide Using Kaggle and Google Colab

What is PaliGemma?

Building a Speech-to-Text Analysis System with Python

Speaker Diarization and Identification