← All projects

Fine-Tuning Whisper-Small for Hindi Speech Recognition (Hindi ASR)

Source: Derived from Resources/Task Assignment | AI Researcher Intern- Speech & Audio | Josh Talks.pdf — a real take-home assessment for the AI Researcher Intern (Speech & Audio) role at Josh Talks.

Skills Required

Background a Student Needs

You should know basic Python and have at least an introductory understanding of deep learning and neural networks. Familiarity with the HuggingFace ecosystem (loading a pretrained model, running .from_pretrained(), and using a Trainer) is very helpful, as is a conceptual grasp of how speech recognition turns audio into text. You do not need to train a model from scratch — Whisper is pretrained — but you should understand what fine-tuning means, why a 16 kHz sampling rate matters, and how Word Error Rate measures transcription quality. Access to a single GPU (a free Colab or Kaggle GPU is enough for whisper-small) is required.

Task Summary

You are given roughly 10 hours of labeled Hindi speech (audio files plus ground-truth transcripts). Your job is to clean and prepare that data, fine-tune OpenAI's pretrained whisper-small model on it, and then measure how much the fine-tuning helped. You evaluate both the original pretrained model and your fine-tuned version on the Hindi portion of Google's FLEURS test set, and report the Word Error Rate of each in a clear comparison table.

The Task

1. Understand the dataset

You receive ~10 hours of Hindi ASR training data. Each record has the following schema:

FieldMeaning
user_idAnonymized identifier for the speaker.
recording_idUnique identifier for each audio recording.
languageLanguage label (e.g. "hi" for Hindi).
durationLength of the audio in seconds — useful for filtering and batching.
rec_url_gcpURL to the raw audio file in cloud storage (e.g. Google Cloud Storage). This is the model input.
transcription_urlURL to the ground-truth transcript text. This is the training label.
metadata_urlURL to extra metadata (device type, noise level, accent, collection conditions). Optional for training but useful for analysis.

2. Preprocess the dataset (deliverable a)

Download the audio and transcripts, pair them by recording_id, and prepare them for Whisper. Document exactly what you did. A solid pipeline typically includes:

Write up these decisions clearly — the preprocessing report is itself a graded deliverable.

3. Fine-tune Whisper-small (deliverable b)

Starting from openai/whisper-small:

4. Evaluate and report WER (deliverable c)

Evaluate two models on the Hindi split of the google/fleurs test set:

  1. The pretrained whisper-small baseline (no fine-tuning).
  2. Your fine-tuned model.

Compute Word Error Rate for both using the evaluate library's wer metric, with a consistent normalization scheme. Present results in a structured table, for example:

ModelTest SetWER (%)
whisper-small (pretrained baseline)FLEURS Hindi test_your number_
whisper-small (fine-tuned)FLEURS Hindi test_your number_

A correct fine-tune should noticeably reduce WER versus the baseline. Conclude with a short analysis explaining the improvement (or lack thereof) and any limitations.

Alternate Tasks (Mini-Project Variations)

1. (Beginner) Baseline-only WER report. Skip fine-tuning entirely and focus on the evaluation half of the pipeline. Load the pretrained whisper-small model, run inference on the Hindi portion of the FLEURS test set, and compute Word Error Rate. Produce a clean results table and a short write-up comparing two text-normalization strategies (raw vs. Whisper-style normalization) and how each affects the WER number for Devanagari text. This teaches you how ASR models are loaded, how transcription inference works end-to-end, and why "the metric" is never a single fixed number — it depends on how you normalize. It is the natural first rung because it requires no GPU training, just inference, yet it forces you to handle real audio loading, batching, and metric computation.

2. (Beginner-Intermediate) Build a clean audio-preprocessing pipeline. Take the raw Josh Talks-style dataset schema (audio URL + transcript URL + metadata) and build a robust, reusable preprocessing script that downloads each pair, resamples to 16 kHz mono, computes durations, filters out corrupt or mismatched samples, and emits a tidy HuggingFace Dataset ready for training. Add a small data-quality dashboard (counts of dropped samples, duration histogram, noise-level breakdown from metadata). The deliverable is the pipeline plus a one-page "data report." This rung deepens the data-engineering muscle that real ASR work depends on — most of the effort in speech projects is data cleaning, not modeling.

3. (Intermediate-Advanced) Full fine-tune with an ablation study. Do the complete assignment — fine-tune whisper-small on the ~10 hours of Hindi data and report baseline vs. fine-tuned WER on FLEURS — but go further by running a small ablation. Compare at least two training configurations (e.g. different learning rates, with vs. without data filtering, or whisper-tiny vs. whisper-small), and additionally report Character Error Rate alongside WER. Present everything in a single comparison table and write a paragraph on the trade-offs you observed (training time, GPU memory, accuracy). This is the advanced ML rung: it mirrors how a researcher actually iterates, not just runs one job, and teaches you to reason about experiments rather than chase a single score.

4. (Intermediate, Agentic AI twist) Voice-driven assistant agent. Wrap a Whisper model (pretrained or your fine-tuned Hindi checkpoint) as the "ears" of an agentic AI. Build an agent that accepts a spoken Hindi audio clip, transcribes it with Whisper, then routes the transcript to an LLM-based tool-using agent that can answer questions, look something up, or trigger a function (e.g. "set a reminder," "summarize this"). The deliverable is a small command-line or notebook agent loop demonstrating speech -> transcription -> reasoning -> action. This twist pushes you from pure ML into the Agentic AI track: you learn how speech models become one tool inside a larger agent pipeline, and you confront real issues like transcription errors propagating into agent decisions.

5. (Advanced, MERN full-stack twist) Web transcription studio. Build a MERN (MongoDB, Express, React, Node) web application that exposes your transcription model as a service. Users upload or record an audio clip in a React front end; an Express/Node back end sends it to a Whisper inference endpoint (your fine-tuned Hindi model served via a small Python FastAPI microservice, called from Node), and the transcript is stored in MongoDB and displayed with an editable text area so users can correct it. Add a history view of past transcriptions and a simple WER display when a reference transcript is provided. This is the most ambitious rung and lives squarely in the full-stack track: you integrate an ML model into a production-style web app, handle file uploads and async jobs, and design a usable UI — the kind of end-to-end product engineering that turns a model into something people actually use.

Reference Links