← All projects

Product Data Explorer — A Live-Scraping Full-Stack Platform

Source: Derived from Resources/Product Data Explorer — Full-Stack Assignment.pdf — a real full-stack hiring take-home that asks candidates to build a production-minded product exploration platform powered by live, on-demand scraping of World of Books (worldofbooks.com). Submission is via a Google Form requiring a GitHub repo link plus a live deployed URL.

Skills Required

Background a Student Needs

A student should be comfortable building a full-stack TypeScript application end to end: a Next.js frontend that fetches from a REST API, and a NestJS backend that talks to a relational or document database. Beyond CRUD, the defining challenge here is live web scraping — pulling structured data (navigation headings, categories, products, reviews) from a real e-commerce site on demand, then persisting it cleanly with relationships, unique constraints, deduplication, and time-based caching. That means understanding headless browsers (Playwright via Crawlee), why long-running scrapes belong in a background queue rather than blocking an HTTP request, and how to scrape ethically (robots.txt, rate limits, backoff). Familiarity with deployment, Docker, and writing a clear README rounds out the profile, because the assignment is graded as much on engineering hygiene and a working live link as on raw features.

Task Summary

Build a production-minded "Product Data Explorer" that lets users drill down from high-level navigation headings → categories → product grids → individual product detail pages, where the underlying data is scraped live and on-demand from World of Books. The Next.js/TypeScript frontend consumes a NestJS REST API; the backend triggers Crawlee + Playwright scrapes, persists results to a real database with relationships and unique constraints, and caches/deduplicates aggressively so the source site is not overloaded. Deliverables include a deployed live app, a GitHub repo with tests and CI, API documentation, and a fallback seed script.

The Task

Goal: Build a product exploration platform that navigates from high-level headings → categories → products → product detail pages, powered by live, on-demand scraping. All scraping must target World of Books (https://www.worldofbooks.com/). You must use the technologies specified below.

Submission: A GitHub repo link (public, or private with access) and a deployed project link (live frontend + working backend), submitted through the provided Google Form. The deployment must be live at submission time.

Frontend (must have)

Backend (must have)

Scraping (World of Books)

Suggested database schema (entities)

Non-functional requirements

Deliverables

  1. GitHub repo with: frontend/ and backend/ folders; CI pipeline (GitHub Actions) for lint/test/build (recommended); README with architecture overview, design decisions, and deployment instructions; database schema + sample seed script; API documentation (Swagger or markdown); tests (unit + a couple of integration); Dockerfiles (bonus, preferred).
  2. Deployed project link(s): production frontend URL, live at submission time.

Acceptance checklist (must pass)

Landing loads navigation headings (via backend from World of Books); drilldown loads categories/subcategories; product grid displays real scraped products; product detail page includes description, reviews/ratings, recommendations; DB persists all scraped objects reliably; on-demand scrape can refresh a product/category; frontend is responsive with an accessibility baseline; README + deploy links + API docs present; repo builds and runs with the provided instructions.

Evaluation rubric (weights)

Bonus (highly valued)

Product search + rich filters (price range, rating, author); intelligent caching / refresh strategy (DB-backed TTL, conditional scraping); SWR / React Query with optimistic UI; personalized recommendations or a simple content-based similarity engine; full Docker setup with docker-compose; comprehensive test coverage (unit + e2e); API versioning and OpenAPI/Swagger with examples; CI-based deploy.

Tips & constraints

Be kind to World of Books (delays, backoff, caching). Focus on core features before bonuses. Never commit secrets — use .env.example. Include a fallback seed script so reviewers can test even if scraping fails during review.

Alternate Tasks (Mini-Project Variations)

  1. (Beginner) Static Books Catalog Explorer — no live scraping. Start from a provided JSON dataset (or scrape books.toscrape.com, a sandbox built explicitly for practice, just once and commit the output). Build a Next.js + TypeScript + Tailwind frontend with a home page of category headings, a product grid with client-side paging, and a product detail page. There is no backend queue and no live scraping; instead, students learn the shape of the problem — navigation → category → product → detail — and focus purely on clean component structure, responsive layout, skeleton loading states, and accessibility basics (alt text, keyboard nav). This isolates the frontend craft and the data model before any scraping complexity is introduced.
  2. (Beginner–Intermediate) Single-Page Scraper + REST API. Build a minimal NestJS backend with one scraping module that, given a category URL from books.toscrape.com, scrapes the product tiles (title, price, image, link) once per request and returns them via a single REST endpoint with proper DTO validation and error handling. Persist results to a lightweight Postgres or MongoDB table with a unique constraint on the source URL, and add a simple "cache with expiry" check so a repeated request within N minutes returns the stored copy instead of re-scraping. This variation teaches the core backend loop — scrape, validate, persist, cache — without the full multi-entity schema or background queues, and pairs naturally with variation 1 as a frontend.
  3. (Intermediate–Advanced) The Full Assignment, Faithful. Implement the original brief end to end against World of Books: the four-level drilldown (navigation → categories → products → detail), live on-demand Crawlee + Playwright scraping behind a NestJS API, the full relational schema (navigation, category, product, product_detail, review, scrape_job, view_history) with indexes and unique constraints, a background queue/worker so scrapes never block the request thread, DB-backed TTL caching with deduplication, retries with exponential backoff, a Next.js + React Query frontend with persisted browsing history, plus README, API docs, tests, CI, Dockerfiles, a fallback seed script, and a live deployment. This is the production-grade target and exercises every skill in the rubric simultaneously.
  4. (Advanced — MERN twist) Express/Mongo Product Explorer with a Worker Service. Re-architect the same platform on a MERN-flavored stack: replace NestJS with an Express + TypeScript API, use MongoDB with Mongoose schemas for navigation/category/product/review documents, and split scraping into a separate worker process driven by a BullMQ + Redis queue so the API container and the scraper container scale independently (wire it all together with docker-compose). The React frontend uses SWR with optimistic updates and a "Refresh data" button that enqueues an on-demand re-scrape and polls job status from a scrape_job collection. This variation keeps the domain identical but pushes students into queue-driven microservice thinking, MongoDB document modeling, and inter-service communication — the realities of a MERN job.
  5. (Advanced — Agentic AI twist) Agent-Driven Book Discovery & Enrichment. Layer an AI agent on top of the scraped catalog. After the scraper populates the product database, build a NestJS (or FastAPI) endpoint backed by an LLM agent with tools: a search_catalog tool over your DB, a scrape_detail tool that triggers an on-demand Crawlee scrape for a missing product, and an enrich tool that summarizes reviews and generates content-based "you might also like" recommendations using embeddings/vector similarity. The frontend becomes a conversational explorer: a user asks "find me cozy mystery novels under £10 with good reviews," and the agent plans, queries the cached data, scrapes on demand if coverage is thin, and returns a ranked, explained result set. This twist combines the scraping/caching backbone with tool-calling agent design, retrieval, and recommendation — bridging the full-stack and Agentic AI tracks.

Reference Links