Source: Derived from Resources/Task Assignment | AI Researcher Intern- Speech & Audio | Josh Talks.pdf — a real take-home assessment for the AI Researcher Intern (Speech & Audio) role at Josh Talks.
WhisperForConditionalGeneration, WhisperProcessor, WhisperFeatureExtractor, WhisperTokenizer, and Seq2SeqTrainer.cast_column with the Audio feature, .map() preprocessing, and pulling google/fleurs for evaluation.whisper-small specifically.evaluate library; understanding text normalization and why it changes the score for Hindi (Devanagari) text.fp16, and checkpointing long runs.You should know basic Python and have at least an introductory understanding of deep learning and neural networks. Familiarity with the HuggingFace ecosystem (loading a pretrained model, running .from_pretrained(), and using a Trainer) is very helpful, as is a conceptual grasp of how speech recognition turns audio into text. You do not need to train a model from scratch — Whisper is pretrained — but you should understand what fine-tuning means, why a 16 kHz sampling rate matters, and how Word Error Rate measures transcription quality. Access to a single GPU (a free Colab or Kaggle GPU is enough for whisper-small) is required.
You are given roughly 10 hours of labeled Hindi speech (audio files plus ground-truth transcripts). Your job is to clean and prepare that data, fine-tune OpenAI's pretrained whisper-small model on it, and then measure how much the fine-tuning helped. You evaluate both the original pretrained model and your fine-tuned version on the Hindi portion of Google's FLEURS test set, and report the Word Error Rate of each in a clear comparison table.
You receive ~10 hours of Hindi ASR training data. Each record has the following schema:
| Field | Meaning |
|---|---|
user_id | Anonymized identifier for the speaker. |
recording_id | Unique identifier for each audio recording. |
language | Language label (e.g. "hi" for Hindi). |
duration | Length of the audio in seconds — useful for filtering and batching. |
rec_url_gcp | URL to the raw audio file in cloud storage (e.g. Google Cloud Storage). This is the model input. |
transcription_url | URL to the ground-truth transcript text. This is the training label. |
metadata_url | URL to extra metadata (device type, noise level, accent, collection conditions). Optional for training but useful for analysis. |
Download the audio and transcripts, pair them by recording_id, and prepare them for Whisper. Document exactly what you did. A solid pipeline typically includes:
WhisperFeatureExtractor.WhisperTokenizer configured for language="Hindi", task="transcribe".Write up these decisions clearly — the preprocessing report is itself a graded deliverable.
Starting from openai/whisper-small:
WhisperForConditionalGeneration.from_pretrained("openai/whisper-small") and set the generation language/task to Hindi/transcribe.-100 so it is ignored in the loss.Seq2SeqTrainer / Seq2SeqTrainingArguments: a small learning rate (≈1e-5, roughly 40x smaller than pretraining), warmup steps, fp16, gradient_checkpointing, predict_with_generate=True, and metric_for_best_model="wer".Evaluate two models on the Hindi split of the google/fleurs test set:
whisper-small baseline (no fine-tuning).Compute Word Error Rate for both using the evaluate library's wer metric, with a consistent normalization scheme. Present results in a structured table, for example:
| Model | Test Set | WER (%) |
|---|---|---|
whisper-small (pretrained baseline) | FLEURS Hindi test | _your number_ |
whisper-small (fine-tuned) | FLEURS Hindi test | _your number_ |
A correct fine-tune should noticeably reduce WER versus the baseline. Conclude with a short analysis explaining the improvement (or lack thereof) and any limitations.
1. (Beginner) Baseline-only WER report. Skip fine-tuning entirely and focus on the evaluation half of the pipeline. Load the pretrained whisper-small model, run inference on the Hindi portion of the FLEURS test set, and compute Word Error Rate. Produce a clean results table and a short write-up comparing two text-normalization strategies (raw vs. Whisper-style normalization) and how each affects the WER number for Devanagari text. This teaches you how ASR models are loaded, how transcription inference works end-to-end, and why "the metric" is never a single fixed number — it depends on how you normalize. It is the natural first rung because it requires no GPU training, just inference, yet it forces you to handle real audio loading, batching, and metric computation.
2. (Beginner-Intermediate) Build a clean audio-preprocessing pipeline. Take the raw Josh Talks-style dataset schema (audio URL + transcript URL + metadata) and build a robust, reusable preprocessing script that downloads each pair, resamples to 16 kHz mono, computes durations, filters out corrupt or mismatched samples, and emits a tidy HuggingFace Dataset ready for training. Add a small data-quality dashboard (counts of dropped samples, duration histogram, noise-level breakdown from metadata). The deliverable is the pipeline plus a one-page "data report." This rung deepens the data-engineering muscle that real ASR work depends on — most of the effort in speech projects is data cleaning, not modeling.
3. (Intermediate-Advanced) Full fine-tune with an ablation study. Do the complete assignment — fine-tune whisper-small on the ~10 hours of Hindi data and report baseline vs. fine-tuned WER on FLEURS — but go further by running a small ablation. Compare at least two training configurations (e.g. different learning rates, with vs. without data filtering, or whisper-tiny vs. whisper-small), and additionally report Character Error Rate alongside WER. Present everything in a single comparison table and write a paragraph on the trade-offs you observed (training time, GPU memory, accuracy). This is the advanced ML rung: it mirrors how a researcher actually iterates, not just runs one job, and teaches you to reason about experiments rather than chase a single score.
4. (Intermediate, Agentic AI twist) Voice-driven assistant agent. Wrap a Whisper model (pretrained or your fine-tuned Hindi checkpoint) as the "ears" of an agentic AI. Build an agent that accepts a spoken Hindi audio clip, transcribes it with Whisper, then routes the transcript to an LLM-based tool-using agent that can answer questions, look something up, or trigger a function (e.g. "set a reminder," "summarize this"). The deliverable is a small command-line or notebook agent loop demonstrating speech -> transcription -> reasoning -> action. This twist pushes you from pure ML into the Agentic AI track: you learn how speech models become one tool inside a larger agent pipeline, and you confront real issues like transcription errors propagating into agent decisions.
5. (Advanced, MERN full-stack twist) Web transcription studio. Build a MERN (MongoDB, Express, React, Node) web application that exposes your transcription model as a service. Users upload or record an audio clip in a React front end; an Express/Node back end sends it to a Whisper inference endpoint (your fine-tuned Hindi model served via a small Python FastAPI microservice, called from Node), and the transcript is stored in MongoDB and displayed with an editable text area so users can correct it. Add a history view of past transcriptions and a simple WER display when a reference transcript is provided. This is the most ambitious rung and lives squarely in the full-stack track: you integrate an ML model into a production-style web app, handle file uploads and async jobs, and design a usable UI — the kind of end-to-end product engineering that turns a model into something people actually use.
WhisperProcessor, the speech-to-text data collator, the compute_metrics WER function, and Seq2SeqTrainer. Its worked example is literally Hindi and reports a baseline-to-fine-tuned WER drop, making it the closest match to this assignment.google/fleurs, reporting both WER and CER per utterance. Excellent reference for the data-prep and evaluation scripts this task needs.