OpenAI Whisper is a speech-to-text model, focusing on transcription performance. The default model works great for most languages, but even better results can be obtained via fine-tuning.

In this article, you will find a complete guide on how to prepare datasets for fine-tuning Whisper, the fine-tuning process, and deploying the fine-tuned model efficiently.

Toolchain Preparation

There are many tools you need to install.

Pip Dependencies

Bash
pip install transformers datasets huggingface-hub accelerate evaluate tensorboard

Login to HuggingFace

After installing Pip dependencies, log in via huggingface-cli.

Bash
huggingface-cli login

You will need a token to log in. huggingface-cli will provide you a URL teaching you how to acquire an API token. You will need a token with write access.

Dataset Preparation

Whisper is a supervised model. Therefore in the dataset, you must provide both audio and tags (transcribed text) for each audio segment.

The easiest way to prepare an audio dataset is with HuggingFace AudioFolder. Prepare your directory like so:

Plaintext
folder/train/metadata.jsonl
folder/train/first.mp3
folder/train/second.mp3
folder/train/third.mp3

Note that not all audio formats are supported. For example, m4a does not work. For organization purposes, audio files can be placed into subfolders.

In metadata.jsonl:

JSONL
{"file_name": "first.mp3", "transcription": "First Audio Transcription"}
{"file_name": "second.mp3", "transcription": "Second Audio Transcription"}
{"file_name": "third.mp3", "transcription": "Third Audio Transcription"}
  • JSONL means JSON Lines: Each line is a JSON, and the entire file can be seen as an array of JSON Objects.
  • The file_name column must be named file_name. It is the relative path to the audio file, from the metadata.jsonl file.
  • The names of other columns are irrelevant. In the end, your dataset will be converted into Arrow Tables (parquet files) similar to Pandas datasets, and the names of the other columns will be presented as another column in the table.
  • You can have as many other columns as you like.

Finally, as you have prepared the dataset, use the following Python code to build this dataset and push it to HuggingFace Hub.

Python
from datasets import load_dataset
audio_dataset = load_dataset("audiofolder", data_dir=".")
audio_dataset.push_to_hub("YOUR_HF_NAME/HF_DATASET_REPO") # Replace this with your Huggingface Repository

This will read the audio files, convert them into Parquet format, automatically generate a README.md that contains dataset metadata, and push the dataset to HuggingFace Hub.

Finetuning

A more detailed guide to fine-tuning Whisper can be found here: https://huggingface.co/blog/fine-tune-whisper. Here we just provide a copied code listing, with some minor modifications.

Python
# NOTE: The base model you want to finetune. Fine-tuning large-v3 model requires around 32GB of VRAM.
base_model = "openai/whisper-large-v3"

# NOTE: Don't change this. Unless you were trying to fine-tuning for translate, and your dataset is tagged with English translated input.
task = "transcribe"

from datasets import load_dataset, DatasetDict

# ========== Load Dataset ==========
tl_dataset = DatasetDict()
tl_dataset["train"] = load_dataset("YOUR_HF_NAME/HF_DATASET_REPO", split="train")
# NOTE: If you have a test split, uncomment the following.
# common_voice["test"] = load_dataset("metricv/tl-whisper", "hi", split="test")

# ========== Load Whisper Preprocessor ==========

from transformers import WhisperFeatureExtractor
from transformers import WhisperTokenizer
from transformers import WhisperProcessor

feature_extractor = WhisperFeatureExtractor.from_pretrained(base_model)
tokenizer = WhisperTokenizer.from_pretrained(base_model, task=task)
processor = WhisperProcessor.from_pretrained(base_model, task=task)

# ========== Process Dataset ==========

from datasets import Audio

tl_dataset = tl_dataset.cast_column("audio", Audio(sampling_rate=16000))

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]
    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # encode target text to label ids
    # NOTE: Here, the key "transcription" refers to the column that contains transcription we used in metadata.jsonl. Change this if you changed column names.
    batch["labels"] = tokenizer(batch["transcription"]).input_ids
    return batch

tl_dataset = tl_dataset.map(prepare_dataset, remove_columns=tl_dataset.column_names["train"], num_proc=8)

# ========== Load Whisper Model ==========

from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained(base_model)
model.generation_config.task = task
model.generation_config.forced_decoder_ids = None

# ========== Fine-tune model ==========

import torch
from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch
    
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

import evaluate

metric = evaluate.load("wer")

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-large-v3-ft-train",  # change to a repo name of your choice
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    num_train_epochs=2.0,
    # warmup_steps=500,
    # max_steps=4000,
    gradient_checkpointing=True,
    fp16=True,
    do_eval=False,
    # eval_strategy="steps",    # NOTE: If you have a test split, you can uncomment this.
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    # save_steps=1000,
    # eval_steps=1000,
    logging_steps=5,
    report_to=["tensorboard"],
    load_best_model_at_end=False,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
)

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=tl_dataset["train"],
    # eval_dataset=tl_dataset["test"], # NOTE: If you have a test split, you can uncomment this.
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

processor.save_pretrained(training_args.output_dir)

trainer.train()

# ========== Save model ==========

trainer.save_model(output_dir="./whisper-large-v3-finetuned")
torch.save(model.state_dict(), f"{training_args.output_dir}/pytorch_model.bin")

# ========== Push model to HF hub ==========
# If you do not want to push the model to hub, or your training machine do not have Internet connection, comment this out.
trainer.push_to_hub("YOUR_HF_NAME/HF_MODEL_REPO") # Change this.

Note that this code referred to several directories. We will need to use them later.

  • Training Data Directory: ./whisper-large-v3-ft-train
  • Output Model Directory: ./whisper-large-v3-finetuned

Preparation

Before your fine-tuned model can be used, there are some things you need to configure.

  1. tokenizer.json from the original Whisper model may not be copied over. You need to do that by yourself.
    • Go to HuggingFace and find the original Whisper model (such as openai/whisper-large-v3), download tokenizer.json, and put it under “Output Model Directory”. If you used the push_to_hub() function to push the model and tokenizer.json is not present in YOUR_HF_NAME/HF_MODEL_REPO, you can upload it manually on HuggingFace Web UI.
    • If not present, copy tokenizer_config.json from the “Training Data Directory” to the “Output Model Directory.”
    • If not present, copy preprocessor_config.json from the “Training Data Directory” to the “Output Model Directory.”

Deploying

There are many ways to deploy the fine-tuned model.

With HuggingFace

The fine-tuned model can be loaded just like the original Whisper model via the HuggingFace from_pretrained() function. This approach will be faster than the openai-whisper package but with a higher VRAM consumption.

Python
from transformers import WhisperForConditionalGeneration, WhisperProcessor

processor = WhisperProcessor.from_pretrained("YOUR_HF_NAME/HF_MODEL_REPO")

With original openai-whisper package

The model can be converted to be compatible with the openai-whisper PyPI package.

In the training code, we saved the final model in PyTorch format to "Training Data Directory"/pytorch_model.bin. However, the layer names of the Whisper model on Huggingface are different from the layer names of that model in the original OpenAI Whisper on GitHub(https://github.com/openai/whisper). This can be fixed via a simple renaming.

Use the following code, contributed by https://github.com/openai/whisper/discussions/830

Python
# NOTE: Change this to the base model you fine-tuned from.
BASE_MODEL = "large-v3"

#!/bin/env python3
import whisper
import re
import torch

def hf_to_whisper_states(text):
    text = re.sub('.layers.', '.blocks.', text)
    text = re.sub('.self_attn.', '.attn.', text)
    text = re.sub('.q_proj.', '.query.', text)
    text = re.sub('.k_proj.', '.key.', text)
    text = re.sub('.v_proj.', '.value.', text)
    text = re.sub('.out_proj.', '.out.', text)
    text = re.sub('.fc1.', '.mlp.0.', text)
    text = re.sub('.fc2.', '.mlp.2.', text)
    text = re.sub('.fc3.', '.mlp.3.', text)
    text = re.sub('.fc3.', '.mlp.3.', text)
    text = re.sub('.encoder_attn.', '.cross_attn.', text)
    text = re.sub('.cross_attn.ln.', '.cross_attn_ln.', text)
    text = re.sub('.embed_positions.weight', '.positional_embedding', text)
    text = re.sub('.embed_tokens.', '.token_embedding.', text)
    text = re.sub('model.', '', text)
    text = re.sub('attn.layer_norm.', 'attn_ln.', text)
    text = re.sub('.final_layer_norm.', '.mlp_ln.', text)
    text = re.sub('encoder.layer_norm.', 'encoder.ln_post.', text)
    text = re.sub('decoder.layer_norm.', 'decoder.ln.', text)
    text = re.sub('proj_out.weight', 'decoder.token_embedding.weight', text)
    return text

# Load HF Model
# NOTE: Change the following line to point to "Training Data Directory"/pytorch_model.bin
hf_state_dict = torch.load("Training Data Directory/pytorch_model.bin", map_location=torch.device('cpu'))

# Rename layers
for key in list(hf_state_dict.keys())[:]:
    new_key = hf_to_whisper_states(key)
    hf_state_dict[new_key] = hf_state_dict.pop(key)

model = whisper.load_model(BASE_MODEL)
dims = model.dims

# Save it
# NOTE: This will save file to whisper-model.bin. Change the path as you wish.
torch.save({
    "dims": model.dims.__dict__,
    "model_state_dict": hf_state_dict
}, "whisper-model.bin")

Then, the model can be loaded from Whisper with whisper.load("whisper-model.bin")

Faster-Whisper (CTranslate 2)

The most efficient way of deploying the Whisper model is probably with the faster-whisper package. We will need to convert our model into yet another format.

First, install tools we will need to use

Bash
git clone --depth=1 https://github.com/SYSTRAN/faster-whisper
cd faster-whisper
pip install -e .[convert] # In zsh, quote ".[convert]"

Then we can perform the convesion

Bash
ct2-transformers-converter \
    --model YOUR_HF_NAME/HF_MODEL_REPO \
    --output_dir whisper-largve-v3-ft-ct2-f16 \
    --copy_files tokenizer.json preprocessor_config.json \
    --quantization float16
  • CTranslate2 models are saved as a directory, not a single file. Change whisper-largve-v3-ft-ct2-f16 to the target directory.
  • Quantization is not required. If your target platform does not support efficient f16 computing, or you don’t want to quantize, ignore that flag.

Then, the fine-tuned model can be loaded just like any faster-whisper model.

Bash
from faster_whisper import WhisperModel

model = WhisperModel("/path/to/model/directory", device="cuda", compute_type="float16")

Happy finetuning!

Leave a Reply

Your email address will not be published. Required fields are marked *