OpenAI Whisper is a speech-to-text model, focusing on transcription performance. The default model works great for most languages, but even better results can be obtained via fine-tuning.
In this article, you will find a complete guide on how to prepare datasets for fine-tuning Whisper, the fine-tuning process, and deploying the fine-tuned model efficiently.
Toolchain Preparation
There are many tools you need to install.
Pip Dependencies
pip install transformers datasets huggingface-hub accelerate evaluate tensorboard
Login to HuggingFace
After installing Pip dependencies, log in via huggingface-cli
.
huggingface-cli login
You will need a token to log in. huggingface-cli
will provide you a URL teaching you how to acquire an API token. You will need a token with write access.
Dataset Preparation
Whisper is a supervised model. Therefore in the dataset, you must provide both audio and tags (transcribed text) for each audio segment.
The easiest way to prepare an audio dataset is with HuggingFace AudioFolder. Prepare your directory like so:
folder/train/metadata.jsonl
folder/train/first.mp3
folder/train/second.mp3
folder/train/third.mp3
Note that not all audio formats are supported. For example, m4a
does not work. For organization purposes, audio files can be placed into subfolders.
In metadata.jsonl
:
{"file_name": "first.mp3", "transcription": "First Audio Transcription"}
{"file_name": "second.mp3", "transcription": "Second Audio Transcription"}
{"file_name": "third.mp3", "transcription": "Third Audio Transcription"}
- JSONL means JSON Lines: Each line is a JSON, and the entire file can be seen as an array of JSON Objects.
- The
file_name
column must be namedfile_name
. It is the relative path to the audio file, from themetadata.jsonl
file. - The names of other columns are irrelevant. In the end, your dataset will be converted into Arrow Tables (
parquet
files) similar to Pandas datasets, and the names of the other columns will be presented as another column in the table. - You can have as many other columns as you like.
Finally, as you have prepared the dataset, use the following Python code to build this dataset and push it to HuggingFace Hub.
from datasets import load_dataset
audio_dataset = load_dataset("audiofolder", data_dir=".")
audio_dataset.push_to_hub("YOUR_HF_NAME/HF_DATASET_REPO") # Replace this with your Huggingface Repository
This will read the audio files, convert them into Parquet format, automatically generate a README.md
that contains dataset metadata, and push the dataset to HuggingFace Hub.
Finetuning
A more detailed guide to fine-tuning Whisper can be found here: https://huggingface.co/blog/fine-tune-whisper. Here we just provide a copied code listing, with some minor modifications.
# NOTE: The base model you want to finetune. Fine-tuning large-v3 model requires around 32GB of VRAM.
base_model = "openai/whisper-large-v3"
# NOTE: Don't change this. Unless you were trying to fine-tuning for translate, and your dataset is tagged with English translated input.
task = "transcribe"
from datasets import load_dataset, DatasetDict
# ========== Load Dataset ==========
tl_dataset = DatasetDict()
tl_dataset["train"] = load_dataset("YOUR_HF_NAME/HF_DATASET_REPO", split="train")
# NOTE: If you have a test split, uncomment the following.
# tl_dataset["test"] = load_dataset("metricv/tl-whisper", "hi", split="test")
# ========== Load Whisper Preprocessor ==========
from transformers import WhisperFeatureExtractor
from transformers import WhisperTokenizer
from transformers import WhisperProcessor
feature_extractor = WhisperFeatureExtractor.from_pretrained(base_model)
tokenizer = WhisperTokenizer.from_pretrained(base_model, task=task)
processor = WhisperProcessor.from_pretrained(base_model, task=task)
# ========== Process Dataset ==========
from datasets import Audio
tl_dataset = tl_dataset.cast_column("audio", Audio(sampling_rate=16000))
def prepare_dataset(batch):
# load and resample audio data from 48 to 16kHz
audio = batch["audio"]
# compute log-Mel input features from input audio array
batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
# encode target text to label ids
# NOTE: Here, the key "transcription" refers to the column that contains transcription we used in metadata.jsonl. Change this if you changed column names.
batch["labels"] = tokenizer(batch["transcription"]).input_ids
return batch
tl_dataset = tl_dataset.map(prepare_dataset, remove_columns=tl_dataset.column_names["train"], num_proc=8)
# ========== Load Whisper Model ==========
from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained(base_model)
model.generation_config.task = task
model.generation_config.forced_decoder_ids = None
# ========== Fine-tune model ==========
import torch
from dataclasses import dataclass
from typing import Any, Dict, List, Union
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
processor: Any
decoder_start_token_id: int
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
# split inputs and labels since they have to be of different lengths and need different padding methods
# first treat the audio inputs by simply returning torch tensors
input_features = [{"input_features": feature["input_features"]} for feature in features]
batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
# get the tokenized label sequences
label_features = [{"input_ids": feature["labels"]} for feature in features]
# pad the labels to max length
labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
# replace padding with -100 to ignore loss correctly
labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
# if bos token is appended in previous tokenization step,
# cut bos token here as it's append later anyways
if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
labels = labels[:, 1:]
batch["labels"] = labels
return batch
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
processor=processor,
decoder_start_token_id=model.config.decoder_start_token_id,
)
import evaluate
metric = evaluate.load("wer")
def compute_metrics(pred):
pred_ids = pred.predictions
label_ids = pred.label_ids
# replace -100 with the pad_token_id
label_ids[label_ids == -100] = tokenizer.pad_token_id
# we do not want to group tokens when computing the metrics
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
wer = 100 * metric.compute(predictions=pred_str, references=label_str)
return {"wer": wer}
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-large-v3-ft-train", # change to a repo name of your choice
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
num_train_epochs=2.0,
# warmup_steps=500,
# max_steps=4000,
gradient_checkpointing=True,
fp16=True,
do_eval=False,
# eval_strategy="steps", # NOTE: If you have a test split, you can uncomment this.
per_device_eval_batch_size=8,
predict_with_generate=True,
generation_max_length=225,
# save_steps=1000,
# eval_steps=1000,
logging_steps=5,
report_to=["tensorboard"],
load_best_model_at_end=False,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=False,
)
trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=tl_dataset["train"],
# eval_dataset=tl_dataset["test"], # NOTE: If you have a test split, you can uncomment this.
data_collator=data_collator,
compute_metrics=compute_metrics,
tokenizer=processor.feature_extractor,
)
processor.save_pretrained(training_args.output_dir)
trainer.train()
# ========== Save model ==========
trainer.save_model(output_dir="./whisper-large-v3-finetuned")
torch.save(model.state_dict(), f"{training_args.output_dir}/pytorch_model.bin")
# ========== Push model to HF hub ==========
# If you do not want to push the model to hub, or your training machine do not have Internet connection, comment this out.
trainer.push_to_hub("YOUR_HF_NAME/HF_MODEL_REPO") # Change this.
Note that this code referred to several directories. We will need to use them later.
- Training Data Directory:
./whisper-large-v3-ft-train
- Output Model Directory:
./whisper-large-v3-finetuned
Preparation
Before your fine-tuned model can be used, there are some things you need to configure.
tokenizer.json
from the original Whisper model may not be copied over. You need to do that by yourself.- Go to HuggingFace and find the original Whisper model (such as
openai/whisper-large-v3
), downloadtokenizer.json
, and put it under "Output Model Directory". If you used thepush_to_hub()
function to push the model andtokenizer.json
is not present inYOUR_HF_NAME/HF_MODEL_REPO
, you can upload it manually on HuggingFace Web UI. - If not present, copy
tokenizer_config.json
from the "Training Data Directory" to the "Output Model Directory." - If not present, copy
preprocessor_config.json
from the "Training Data Directory" to the "Output Model Directory."
- Go to HuggingFace and find the original Whisper model (such as
Deploying
There are many ways to deploy the fine-tuned model.
With HuggingFace
The fine-tuned model can be loaded just like the original Whisper model via the HuggingFace from_pretrained()
function. This approach will be faster than the openai-whisper
package but with a higher VRAM consumption.
from transformers import WhisperForConditionalGeneration, WhisperProcessor
processor = WhisperProcessor.from_pretrained("YOUR_HF_NAME/HF_MODEL_REPO")
With original openai-whisper
package
The model can be converted to be compatible with the openai-whisper
PyPI package.
In the training code, we saved the final model in PyTorch format to "Training Data Directory"/pytorch_model.bin
. However, the layer names of the Whisper model on Huggingface are different from the layer names of that model in the original OpenAI Whisper on GitHub(https://github.com/openai/whisper). This can be fixed via a simple renaming.
Use the following code, contributed by https://github.com/openai/whisper/discussions/830
# NOTE: Change this to the base model you fine-tuned from.
BASE_MODEL = "large-v3"
#!/bin/env python3
import whisper
import re
import torch
def hf_to_whisper_states(text):
text = re.sub('.layers.', '.blocks.', text)
text = re.sub('.self_attn.', '.attn.', text)
text = re.sub('.q_proj.', '.query.', text)
text = re.sub('.k_proj.', '.key.', text)
text = re.sub('.v_proj.', '.value.', text)
text = re.sub('.out_proj.', '.out.', text)
text = re.sub('.fc1.', '.mlp.0.', text)
text = re.sub('.fc2.', '.mlp.2.', text)
text = re.sub('.fc3.', '.mlp.3.', text)
text = re.sub('.fc3.', '.mlp.3.', text)
text = re.sub('.encoder_attn.', '.cross_attn.', text)
text = re.sub('.cross_attn.ln.', '.cross_attn_ln.', text)
text = re.sub('.embed_positions.weight', '.positional_embedding', text)
text = re.sub('.embed_tokens.', '.token_embedding.', text)
text = re.sub('model.', '', text)
text = re.sub('attn.layer_norm.', 'attn_ln.', text)
text = re.sub('.final_layer_norm.', '.mlp_ln.', text)
text = re.sub('encoder.layer_norm.', 'encoder.ln_post.', text)
text = re.sub('decoder.layer_norm.', 'decoder.ln.', text)
text = re.sub('proj_out.weight', 'decoder.token_embedding.weight', text)
return text
# Load HF Model
# NOTE: Change the following line to point to "Training Data Directory"/pytorch_model.bin
hf_state_dict = torch.load("Training Data Directory/pytorch_model.bin", map_location=torch.device('cpu'))
# Rename layers
for key in list(hf_state_dict.keys())[:]:
new_key = hf_to_whisper_states(key)
hf_state_dict[new_key] = hf_state_dict.pop(key)
model = whisper.load_model(BASE_MODEL)
dims = model.dims
# Save it
# NOTE: This will save file to whisper-model.bin. Change the path as you wish.
torch.save({
"dims": model.dims.__dict__,
"model_state_dict": hf_state_dict
}, "whisper-model.bin")
Then, the model can be loaded from Whisper with whisper.load("whisper-model.bin")
Faster-Whisper (CTranslate 2)
The most efficient way of deploying the Whisper model is probably with the faster-whisper
package. We will need to convert our model into yet another format.
First, install tools we will need to use
git clone --depth=1 https://github.com/SYSTRAN/faster-whisper
cd faster-whisper
pip install -e .[convert] # In zsh, quote ".[convert]"
Then we can perform the convesion
ct2-transformers-converter \
--model YOUR_HF_NAME/HF_MODEL_REPO \
--output_dir whisper-largve-v3-ft-ct2-f16 \
--copy_files tokenizer.json preprocessor_config.json \
--quantization float16
- CTranslate2 models are saved as a directory, not a single file. Change
whisper-largve-v3-ft-ct2-f16
to the target directory. - Quantization is not required. If your target platform does not support efficient f16 computing, or you don't want to quantize, ignore that flag.
Then, the fine-tuned model can be loaded just like any faster-whisper
model.
from faster_whisper import WhisperModel
model = WhisperModel("/path/to/model/directory", device="cuda", compute_type="float16")
Happy finetuning!
Comments NOTHING