Finetuning OpenAI’s Whisper: creating your own custom dataset (I)
If you haven’t yet, make sure to check out the introduction to this guide series:
In the quest to fine-tune Whisper for optimal performance, the quality of the data you use stands out as the cornerstone of success. While the robustness of any machine learning model hinges on several factors, the caliber of its training data invariably casts the longest shadow over its efficacy. Indeed, in a world abundant with readily accessible high-quality datasets — like Mozilla’s “Common Voice” — the question arises: Why go through the effort of creating your own custom dataset?
The answer lies in the unique requirements of specialized applications. Public datasets, though extensive and diverse, often lack the depth needed to capture the subtle intricacies and complex nuances required by fields such as medicine or law. In these domains, even minor inaccuracies can lead to significant repercussions, underscoring the need for a meticulously tailored approach to data preparation.
As previously mentioned in this series, the effectiveness of an Automatic Speech Recognition (ASR) system in specialized settings hinges not just on understanding broad linguistic patterns, but on grasping the specific jargon and contextual nuances of its intended application. This part of our series will guide you through the crucial steps of crafting a dataset that not only trains Whisper to achieve general competence but hones its abilities to deliver precision where it counts most.
Table of Contents
- How to preprocess your audio
- Applying the preprocessing techniques
- What’s next?
How to preprocess your audio
The first step in crafting a high-quality dataset for fine-tuning Whisper is the careful preprocessing of your audio files. Whisper operates within specific constraints — one key limitation being that it can only process audio clips up to 30 seconds in length. Any audio beyond this duration isn’t just truncated; it’s completely disregarded in the training process. This introduces a critical challenge: effectively segmenting your audio into manageable, coherent chunks that fit within this time frame.
When preparing audio for fine-tuning Whisper, it’s essential to consider not just the mechanical splitting of audio into segments but the preservation of contextual integrity and acoustic quality. Simply dividing an audio file into 30-second chunks without regard for the content can disrupt the linguistic flow and lead to the loss of crucial information at the borders between segments. Moreover, silence within these segments can skew the model’s training, focusing its learning on irrelevant pauses rather than useful speech.
In order to navigate these challenges, there is one crucial information that needs to be respected. Human speech, when broken down into very short segments such as 30 milliseconds (ms), can generally be considered as static [1]. During such a brief interval, the vocal apparatus (larynx, tongue, lips, etc.) does not have the physical capability to produce dramatic changes in sound. This characteristic is crucial for developing effective audio preprocessing strategies because it provides a foundation for making assumptions about the continuity and stability of speech within these short windows.
All of these preprocessing steps can be achieved through carefully crafted Python code. But how do we embark on this coding journey? To help you visualize and understand the sequence of tasks and their interconnections, I’ve prepared a flowchart. This diagram will guide you through the functionality of the code you will develop by the end of this guide.
Applying the preprocessing techniques
To commence the audio preprocessing for Whisper, a crucial initial step involves setting appropriate thresholds to detect voice activity (VAD) within your audio files. A typical approach, though fraught with potential inaccuracies, is to use a decibel threshold, such as -30 dB, combined with a fixed duration like 600 milliseconds. In this method, any audio segments where the sound does not exceed -30 dB for 600 milliseconds would be removed. This approach is problematic because it presumes a uniform loudness spectrum across all audio files, which is seldom the case.
To address these challenges more effectively, we will utilize two sophisticated metrics: the Zero Crossing Rate (ZCR) and the energy level of the audio. Instead of depending on a fixed decibel threshold, we’ll dynamically calculate the mean ZCR and energy for each individual audio file. To enhance precision, we will set our thresholds at the mean plus one standard deviation of these measurements. In practice, for an audio frame to be classified as silent, it must fall below both the ZCR and energy thresholds. This dual-check ensures that we only deem a segment silent when it truly exhibits low levels of both vocal activity and sound intensity, thus minimizing the risk of erroneously discarding important audio content [2]. The compute_dynamic_tresholds
function establishes two thresholds, with full functionality expected once all components are fully implemented.
def compute_dynamic_thresholds(audio, sr, frame_length_samples, segment_length_samples):
"""
Computes thresholds for energy and Zero-Crossing Rate (ZCR).
The thresholds are determined based on the mean and standard deviation
of the energy and ZCR within individual segments of the audio signal.
"""
# Initialization of lists for energy thresholds and ZCR thresholds
energy_thresholds = []
zcr_thresholds = []
# Iterating over the audio signal in segments
for i in range(0, len(audio), segment_length_samples):
# Extract the current segment from the audio signal
segment = audio[i:i + segment_length_samples]
# Calculate the number of frames in the current segment
n_frames = max(int(len(segment) / frame_length_samples), 1)
# Calculate the energy for each frame in the segment
energy = np.array([np.sum(segment[j * frame_length_samples:(j + 1) * frame_length_samples] ** 2) for j in range(n_frames)])
# Calculate the Zero-Crossing Rate for each frame in the segment
zcr = np.array([np.sum(librosa.zero_crossings(segment[j * frame_length_samples:(j + 1) * frame_length_samples], pad=False)) for j in range(n_frames)])
# Add the energy threshold to the array
# The threshold is the mean plus the standard deviation of the energy in the segment
energy_thresholds.append(np.mean(energy) + np.std(energy))
# Add the ZCR threshold to the array
# The threshold is the mean plus the standard deviation of ZCR in the segment
zcr_thresholds.append(np.mean(zcr) + np.std(zcr))
# Return the lists of thresholds
return energy_thresholds, zcr_thresholds
Following the establishment of our dynamic thresholds using the ZCR and energy levels, the subsequent step involves implementing the logic for VAD and the assembly of audio frames into coherent chunks. This critical portion of the code is where our preprocessing strategy truly takes shape.
In the provided analyze_chunks
function, the audio is methodically divided into 30ms frames, which are then analyzed to determine their status as either active or silent based on the calculated energy and ZCR thresholds. This analysis helps in identifying and grouping active frames into chunks while using silent frames as indicators to terminate current chunks. Furthermore, this approach not only helps in retaining contextual information within each chunk but also in minimizing the potential loss of important linguistic details at the edges of audio windows. By carefully defining and detecting the end points of speech segments, we ensure that the dataset prepared for training Whisper is of the highest quality, with each chunk representing a clear and complete snippet of speech.
def analyze_chunks(audio, sr, min_chunk_length_samples, max_chunk_length_samples, frame_length_samples, overlap_samples):
"""
Analyzes the audio to find chunks based on energy and Zero-Crossing Rate (ZCR).
This function divides the audio into frames and determines whether each frame is silent or active.
Active frames are aggregated into chunks. Silent frames signal the end of a chunk,
taking into account the specified minimum and maximum chunk lengths.
"""
# Dividing the audio into five segments for threshold calculation
segment_length_samples = len(audio) // 5
# Compute dynamic thresholds for energy and ZCR for each segment
energy_thresholds, zcr_thresholds = compute_dynamic_thresholds(audio, sr, frame_length_samples, segment_length_samples)
chunks = [] # Initialization of the list for the found chunks
current_chunk_start = None # Starting point of the current chunk
last_valid_end = None # Endpoint of the last valid frame in the current chunk
is_previous_frame_silent = False # State of the previous frame
# Iterating over the audio in frame steps
for i in range(0, len(audio), frame_length_samples):
segment_index = i // segment_length_samples
segment_index = min(segment_index, len(energy_thresholds) - 1) # Prevent index overflow
# Assigning thresholds for the current frame
energy_threshold = energy_thresholds[segment_index]
zcr_threshold = zcr_thresholds[segment_index]
# Extracting the current frame
frame = audio[i:min(i + frame_length_samples, len(audio))]
# Calculating energy and ZCR for the frame
frame_energy = np.sum(frame ** 2)
frame_zcr = np.sum(librosa.zero_crossings(frame, pad=False))
# Determining whether the frame is silent
is_silent = frame_energy <= energy_threshold and frame_zcr <= zcr_threshold
# If no chunk has started and the frame is not silent, start a new chunk
if current_chunk_start is None and not is_silent:
current_chunk_start = i # Start of a new chunk
last_valid_end = i + frame_length_samples
# If a chunk is active
if current_chunk_start is not None:
# If the frame is silent or the maximum chunk length has been reached
if is_silent and (is_previous_frame_silent or i + frame_length_samples - current_chunk_start >= max_chunk_length_samples):
# Check if the chunk is long enough to be saved
if last_valid_end and last_valid_end - current_chunk_start >= min_chunk_length_samples:
chunks.append((current_chunk_start, last_valid_end + overlap_samples)) # Save with overlap
current_chunk_start = None
else:
# Update the end of the current chunk
last_valid_end = i + frame_length_samples
# If the frame is not silent and the maximum chunk length has been reached
if not is_silent and i + frame_length_samples - current_chunk_start >= max_chunk_length_samples:
chunks.append((current_chunk_start, min(len(audio), last_valid_end + overlap_samples))) # Save with overlap
current_chunk_start = None
is_previous_frame_silent = is_silent
# Saving the last chunk, if necessary
if current_chunk_start is not None and last_valid_end - current_chunk_start >= min_chunk_length_samples:
chunks.append((current_chunk_start, min(len(audio), last_valid_end + overlap_samples))) # Save with overlap
return chunks
To fully implement this preprocessing pipeline, the complete code, along with all necessary supporting functions, is available on GitHub. This resource allows you to access, review, and utilize the code to ensure your audio data is optimally prepared for training with Whisper or any other ASR system. By following the steps outlined, you’ll end up with a well-organized audio directory — each preprocessed audio file will have its own subdirectory containing all the corresponding chunks. This structure facilitates easy access and management of your processed data.
What’s next?
As outlined in the title of this blog, “Finetuning OpenAI’s Whisper: Creating Your Own Custom Dataset (I),” this topic is extensive enough to warrant a two-part series, given the depth of technical details involved. We have now established a ready-to-use audio directory where the quality of the audio is exceptionally well-suited for finetuning purposes.
However, finetuning Whisper goes beyond just preparing audio; it also necessitates the creation of “ground truth” transcriptions for each audio snippet used. This step is critical but can be incredibly time-consuming, presenting a significant hurdle when you’re keen to learn and experiment with finetuning Whisper. To address this, the next installment of this guide will delve into the intricacies of generating ground truths for your audio files. We’ll also discuss how to correlate these transcriptions with the audio files through a comprehensive metadata file.
Finally, we’ll develop a custom script to load your dataset effectively and integrate it seamlessly into the Hugging Face platform. This approach will not only streamline the process but also enhance your ability to fine-tune Whisper with precision and ease. Stay tuned for these valuable insights in the next part of our series. If you have any questions or ideas, feel free to leave a comment!
Literature:
[1] Labied et al., 2022, S. 2
[2] Tazi & El Makhfi, 2018, S. 1529
This story is published under Generative AI Publication.
Connect with us on Substack, LinkedIn, and Zeniteq to stay in the loop with the latest AI stories. Let’s shape the future of AI together!