Fine-Tuning Whisper-Small for Hindi Speech Recognition (Hindi ASR)

Source: Derived from Resources/Task Assignment | AI Researcher Intern- Speech & Audio | Josh Talks.pdf — a real take-home assessment for the AI Researcher Intern (Speech & Audio) role at Josh Talks.

Skills Required

Python — comfortable scripting, working with file I/O, and downloading remote assets.
PyTorch — the underlying deep-learning framework that Whisper runs on (tensors, GPU placement, mixed precision).
HuggingFace Transformers — WhisperForConditionalGeneration, WhisperProcessor, WhisperFeatureExtractor, WhisperTokenizer, and Seq2SeqTrainer.
HuggingFace Datasets — loading/streaming audio datasets, cast_column with the Audio feature, .map() preprocessing, and pulling google/fleurs for evaluation.
Audio preprocessing — resampling to 16 kHz, converting raw audio to log-Mel spectrograms, handling mono/stereo, trimming silence, filtering by duration, and dealing with noisy field recordings.
Whisper model family — understanding the encoder-decoder architecture, language/task tokens, the 30-second input window, and whisper-small specifically.
WER / evaluation metrics — computing Word Error Rate (and ideally Character Error Rate) with the evaluate library; understanding text normalization and why it changes the score for Hindi (Devanagari) text.
Data engineering — fetching audio + transcript pairs from cloud URLs (Google Cloud Storage), joining metadata, building a clean train/validation split, and detecting corrupt or mismatched samples.
GPU / Colab tooling — running training on a single GPU (Google Colab, Kaggle, or a cloud VM), managing memory with gradient checkpointing and fp16, and checkpointing long runs.
Hindi / Devanagari text handling — Unicode awareness, normalization of diacritics, and the risks of over-normalizing low-resource-language text.
Soft skills — clearly documenting the preprocessing decisions, presenting results in a structured table, and writing a short, honest analysis of what worked and what did not.

Background a Student Needs

You should know basic Python and have at least an introductory understanding of deep learning and neural networks. Familiarity with the HuggingFace ecosystem (loading a pretrained model, running .from_pretrained(), and using a Trainer) is very helpful, as is a conceptual grasp of how speech recognition turns audio into text. You do not need to train a model from scratch — Whisper is pretrained — but you should understand what fine-tuning means, why a 16 kHz sampling rate matters, and how Word Error Rate measures transcription quality. Access to a single GPU (a free Colab or Kaggle GPU is enough for whisper-small) is required.

Task Summary

You are given roughly 10 hours of labeled Hindi speech (audio files plus ground-truth transcripts). Your job is to clean and prepare that data, fine-tune OpenAI's pretrained whisper-small model on it, and then measure how much the fine-tuning helped. You evaluate both the original pretrained model and your fine-tuned version on the Hindi portion of Google's FLEURS test set, and report the Word Error Rate of each in a clear comparison table.

The Task

1. Understand the dataset

You receive ~10 hours of Hindi ASR training data. Each record has the following schema:

Field	Meaning
`user_id`	Anonymized identifier for the speaker.
`recording_id`	Unique identifier for each audio recording.
`language`	Language label (e.g. `"hi"` for Hindi).
`duration`	Length of the audio in seconds — useful for filtering and batching.
`rec_url_gcp`	URL to the raw audio file in cloud storage (e.g. Google Cloud Storage). This is the model input.
`transcription_url`	URL to the ground-truth transcript text. This is the training label.
`metadata_url`	URL to extra metadata (device type, noise level, accent, collection conditions). Optional for training but useful for analysis.

2. Preprocess the dataset (deliverable a)

Download the audio and transcripts, pair them by recording_id, and prepare them for Whisper. Document exactly what you did. A solid pipeline typically includes:

Resampling all audio to 16 kHz mono (the rate Whisper expects).
Converting raw waveforms to log-Mel spectrograms with WhisperFeatureExtractor.
Tokenizing the Hindi transcripts with WhisperTokenizer configured for language="Hindi", task="transcribe".
Filtering out unusable samples: empty/missing transcripts, clips far longer than 30 seconds, duration/transcript mismatches, or corrupt downloads.
Optionally normalizing the Devanagari text (be careful — aggressive normalization can strip meaningful diacritics).
Building a clean train/validation split (e.g. hold out a small slice for monitoring).
Briefly using the metadata (noise level, accents) to comment on data quality.

Write up these decisions clearly — the preprocessing report is itself a graded deliverable.

3. Fine-tune Whisper-small (deliverable b)

Starting from openai/whisper-small:

Load the model with WhisperForConditionalGeneration.from_pretrained("openai/whisper-small") and set the generation language/task to Hindi/transcribe.
Use a speech-to-text data collator that pads input features and masks label padding with -100 so it is ignored in the loss.
Train with Seq2SeqTrainer / Seq2SeqTrainingArguments: a small learning rate (≈1e-5, roughly 40x smaller than pretraining), warmup steps, fp16, gradient_checkpointing, predict_with_generate=True, and metric_for_best_model="wer".
Evaluate periodically and keep the best checkpoint.

4. Evaluate and report WER (deliverable c)

Evaluate two models on the Hindi split of the google/fleurs test set:

The pretrained whisper-small baseline (no fine-tuning).
Your fine-tuned model.

Compute Word Error Rate for both using the evaluate library's wer metric, with a consistent normalization scheme. Present results in a structured table, for example:

Model	Test Set	WER (%)
`whisper-small` (pretrained baseline)	FLEURS Hindi test	_your number_
`whisper-small` (fine-tuned)	FLEURS Hindi test	_your number_

A correct fine-tune should noticeably reduce WER versus the baseline. Conclude with a short analysis explaining the improvement (or lack thereof) and any limitations.

Alternate Tasks (Mini-Project Variations)

1. (Beginner) Baseline-only WER report. Skip fine-tuning entirely and focus on the evaluation half of the pipeline. Load the pretrained whisper-small model, run inference on the Hindi portion of the FLEURS test set, and compute Word Error Rate. Produce a clean results table and a short write-up comparing two text-normalization strategies (raw vs. Whisper-style normalization) and how each affects the WER number for Devanagari text. This teaches you how ASR models are loaded, how transcription inference works end-to-end, and why "the metric" is never a single fixed number — it depends on how you normalize. It is the natural first rung because it requires no GPU training, just inference, yet it forces you to handle real audio loading, batching, and metric computation.

2. (Beginner-Intermediate) Build a clean audio-preprocessing pipeline. Take the raw Josh Talks-style dataset schema (audio URL + transcript URL + metadata) and build a robust, reusable preprocessing script that downloads each pair, resamples to 16 kHz mono, computes durations, filters out corrupt or mismatched samples, and emits a tidy HuggingFace Dataset ready for training. Add a small data-quality dashboard (counts of dropped samples, duration histogram, noise-level breakdown from metadata). The deliverable is the pipeline plus a one-page "data report." This rung deepens the data-engineering muscle that real ASR work depends on — most of the effort in speech projects is data cleaning, not modeling.

3. (Intermediate-Advanced) Full fine-tune with an ablation study. Do the complete assignment — fine-tune whisper-small on the ~10 hours of Hindi data and report baseline vs. fine-tuned WER on FLEURS — but go further by running a small ablation. Compare at least two training configurations (e.g. different learning rates, with vs. without data filtering, or whisper-tiny vs. whisper-small), and additionally report Character Error Rate alongside WER. Present everything in a single comparison table and write a paragraph on the trade-offs you observed (training time, GPU memory, accuracy). This is the advanced ML rung: it mirrors how a researcher actually iterates, not just runs one job, and teaches you to reason about experiments rather than chase a single score.

4. (Intermediate, Agentic AI twist) Voice-driven assistant agent. Wrap a Whisper model (pretrained or your fine-tuned Hindi checkpoint) as the "ears" of an agentic AI. Build an agent that accepts a spoken Hindi audio clip, transcribes it with Whisper, then routes the transcript to an LLM-based tool-using agent that can answer questions, look something up, or trigger a function (e.g. "set a reminder," "summarize this"). The deliverable is a small command-line or notebook agent loop demonstrating speech -> transcription -> reasoning -> action. This twist pushes you from pure ML into the Agentic AI track: you learn how speech models become one tool inside a larger agent pipeline, and you confront real issues like transcription errors propagating into agent decisions.

5. (Advanced, MERN full-stack twist) Web transcription studio. Build a MERN (MongoDB, Express, React, Node) web application that exposes your transcription model as a service. Users upload or record an audio clip in a React front end; an Express/Node back end sends it to a Whisper inference endpoint (your fine-tuned Hindi model served via a small Python FastAPI microservice, called from Node), and the transcript is stored in MongoDB and displayed with an editable text area so users can correct it. Add a history view of past transcriptions and a simple WER display when a reference transcript is provided. This is the most ambitious rung and lives squarely in the full-stack track: you integrate an ML model into a production-style web app, handle file uploads and async jobs, and design a usable UI — the kind of end-to-end product engineering that turns a model into something people actually use.

Reference Links

Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers (HuggingFace blog) — The canonical step-by-step guide. Walks through 16 kHz resampling, WhisperProcessor, the speech-to-text data collator, the compute_metrics WER function, and Seq2SeqTrainer. Its worked example is literally Hindi and reports a baseline-to-fine-tuned WER drop, making it the closest match to this assignment.
HuggingFace Audio Course — Fine-tuning the ASR model — A structured, course-style version of the same workflow with more explanation of each component; good if the blog moves too fast.
vasistalodagala/whisper-finetune (GitHub) — Ready-to-use scripts to fine-tune and evaluate Whisper on custom datasets and HuggingFace datasets such as google/fleurs, reporting both WER and CER per utterance. Excellent reference for the data-prep and evaluation scripts this task needs.
vasista22/whisper-hindi-small (HuggingFace model) — A published Whisper-small model fine-tuned on Hindi from multiple public corpora; useful as a sanity-check reference for the WER range you should expect.
Breaking language barriers: Fine-tuning Whisper for Hindi (Collabora) — A practical write-up specifically on Hindi, including a discussion of Whisper-normalization vs. Indic-normalization and how each affects WER on Devanagari text — directly relevant to deliverable (c).
yash072/Whisper-small-finetuned-hindi (HuggingFace model) — Another community Hindi fine-tune (on Common Voice), handy as a comparison point and for seeing realistic training hyperparameters.
Fine-tune Whisper models on Amazon SageMaker with LoRA (AWS blog) — Optional advanced reading on parameter-efficient fine-tuning (LoRA) for Whisper, relevant if you want to try the ablation variation cheaply.