Product Data Explorer — A Live-Scraping Full-Stack Platform
Source: Derived from Resources/Product Data Explorer — Full-Stack Assignment.pdf — a real full-stack hiring take-home that asks candidates to build a production-minded product exploration platform powered by live, on-demand scraping of World of Books (worldofbooks.com). Submission is via a Google Form requiring a GitHub repo link plus a live deployed URL.
Skills Required
- Frontend: React with Next.js (App Router), TypeScript, Tailwind CSS, component-driven UI design.
- Data fetching on the client: SWR or React Query (caching, revalidation, optimistic UI updates).
- UX engineering: responsive layouts (desktop + mobile), skeleton/loading states, smooth transitions, and accessibility (semantic HTML, keyboard navigation, alt text, color contrast, WCAG AA basics).
- Backend: NestJS (Node + TypeScript) — modules, controllers, providers, dependency injection.
- API design: RESTful endpoint design, DTO validation (class-validator / class-transformer), structured error handling, logging, CORS configuration.
- Databases: PostgreSQL or MongoDB schema design, relationships, foreign keys, indexes, and unique constraints; an ORM/ODM (TypeORM, Prisma, or Mongoose).
- Web scraping: Crawlee + Playwright (headless browser automation), DOM selection, handling dynamic/JavaScript-rendered content, pagination crawling.
- Scraping discipline: respecting robots.txt and ToS, rate limiting, delays, retries with exponential backoff, deduplication, and caching with expiry (TTL).
- Concurrency & reliability: background queue/worker model (e.g., BullMQ + Redis), idempotent jobs, not blocking the request thread, resource cleanup of browser instances.
- Caching layer: DB-backed or Redis caching with explicit expiry to avoid re-scraping unchanged pages.
- Observability: logging, basic metrics, and error tracking.
- Testing & CI: unit tests, a couple of integration tests, GitHub Actions for lint/test/build.
- DevOps & deployment: Docker / docker-compose, environment-variable hygiene (
.env.example, no committed secrets), and deploying to Vercel / Render / Railway / Fly.io / Heroku. - Documentation: README with architecture overview and design decisions, plus API docs via Swagger/OpenAPI or markdown.
Background a Student Needs
A student should be comfortable building a full-stack TypeScript application end to end: a Next.js frontend that fetches from a REST API, and a NestJS backend that talks to a relational or document database. Beyond CRUD, the defining challenge here is live web scraping — pulling structured data (navigation headings, categories, products, reviews) from a real e-commerce site on demand, then persisting it cleanly with relationships, unique constraints, deduplication, and time-based caching. That means understanding headless browsers (Playwright via Crawlee), why long-running scrapes belong in a background queue rather than blocking an HTTP request, and how to scrape ethically (robots.txt, rate limits, backoff). Familiarity with deployment, Docker, and writing a clear README rounds out the profile, because the assignment is graded as much on engineering hygiene and a working live link as on raw features.
Task Summary
Build a production-minded "Product Data Explorer" that lets users drill down from high-level navigation headings → categories → product grids → individual product detail pages, where the underlying data is scraped live and on-demand from World of Books. The Next.js/TypeScript frontend consumes a NestJS REST API; the backend triggers Crawlee + Playwright scrapes, persists results to a real database with relationships and unique constraints, and caches/deduplicates aggressively so the source site is not overloaded. Deliverables include a deployed live app, a GitHub repo with tests and CI, API documentation, and a fallback seed script.
The Task
Goal: Build a product exploration platform that navigates from high-level headings → categories → products → product detail pages, powered by live, on-demand scraping. All scraping must target World of Books (https://www.worldofbooks.com/). You must use the technologies specified below.
Submission: A GitHub repo link (public, or private with access) and a deployed project link (live frontend + working backend), submitted through the provided Google Form. The deployment must be live at submission time.
Frontend (must have)
- Tech: React (Next.js, App Router), TypeScript, Tailwind CSS.
- Core pages/components:
- Landing / Home showing navigation headings.
- Category drilldown pages.
- Product grid / results with paging / limit support.
- Product detail page (reviews, ratings, recommendations, metadata).
- About / Contact / README page within the site.
- UX: Responsive (desktop & mobile), accessible (WCAG AA basics), skeleton/loading states, and smooth transitions. Persist user navigation & browsing history both client-side and via the backend so it survives reloads. Use a client data-fetching strategy (SWR or React Query recommended).
- Deliverables: Live deployed frontend URL; README with local-run instructions, environment variables, and build steps.
Backend (must have)
- Tech: NestJS (Node + TypeScript).
- DB: A production-ready database — PostgreSQL, MongoDB, or another (justify the choice).
- Expose RESTful endpoints (an example API contract is implied by the entities below).
- On relevant calls, trigger a real-time scrape (Crawlee + Playwright) and store results, supporting both on-demand scrapes triggered by user actions and safe caching to avoid excessive scraping.
- Robust engineering: proper DTO validation, error handling, logging, and resource cleanup; concurrency handling, deduplication of scrape results, and idempotency where applicable; rate limiting / backoff for the external site and queueing of long-running scrapes.
Scraping (World of Books)
- Target site:
https://www.worldofbooks.com/, using Crawlee + Playwright (or an equivalent headless framework). - Extract and persist: navigation headings (e.g., "Books", "Categories", "Children's Books"); categories & subcategories; product tiles/cards (Title, Author, Price, Image, Product Link, Source ID); and product detail pages (full description, user reviews & ratings if present, related/recommended products, and extra metadata such as publisher, publication date, ISBN).
- Save all scraped data with relationships & unique constraints. Implement deduplication and caching with expiry so repeated scrapes don't overload the source. Provide a way to re-fetch updated product data on demand.
- Ethical scraping: respect robots.txt and terms of service; use rate limiting and delays; implement retries and exponential backoff; cache wherever possible.
Suggested database schema (entities)
navigation — id, title, slug, last_scraped_atcategory — id, navigation_id, parent_id, title, slug, product_count, last_scraped_atproduct — id, source_id, title, price, currency, image_url, source_url, last_scraped_atproduct_detail — product_id (FK), description, specs (json), ratings_avg, reviews_countreview — id, product_id, author, rating, text, created_atscrape_job — id, target_url, target_type, status, started_at, finished_at, error_logview_history — id, user_id (optional), session_id, path_json, created_at- Add indexes on
source_id, last_scraped_at; unique constraints on source_url / source_id.
Non-functional requirements
- Security: sanitize inputs, secure env vars, no committed secrets, proper CORS, minimal rate limiting.
- Performance & caching: DB or Redis caching layer with explicit expiry; avoid re-scraping unchanged pages.
- Observability: logging, basic metrics, error tracking.
- Reliability: queue/worker model for scrapes (don't block the request thread); idempotent jobs.
- Accessibility: semantic HTML, keyboard nav, alt on images, color contrast.
Deliverables
- GitHub repo with:
frontend/ and backend/ folders; CI pipeline (GitHub Actions) for lint/test/build (recommended); README with architecture overview, design decisions, and deployment instructions; database schema + sample seed script; API documentation (Swagger or markdown); tests (unit + a couple of integration); Dockerfiles (bonus, preferred). - Deployed project link(s): production frontend URL, live at submission time.
Acceptance checklist (must pass)
Landing loads navigation headings (via backend from World of Books); drilldown loads categories/subcategories; product grid displays real scraped products; product detail page includes description, reviews/ratings, recommendations; DB persists all scraped objects reliably; on-demand scrape can refresh a product/category; frontend is responsive with an accessibility baseline; README + deploy links + API docs present; repo builds and runs with the provided instructions.
Evaluation rubric (weights)
- Correctness & completeness — 35%
- Architecture & engineering quality — 20%
- Scraping reliability & design — 15%
- UX & accessibility — 10%
- Docs & deploy — 10%
- Tests & CI — 10%
Bonus (highly valued)
Product search + rich filters (price range, rating, author); intelligent caching / refresh strategy (DB-backed TTL, conditional scraping); SWR / React Query with optimistic UI; personalized recommendations or a simple content-based similarity engine; full Docker setup with docker-compose; comprehensive test coverage (unit + e2e); API versioning and OpenAPI/Swagger with examples; CI-based deploy.
Tips & constraints
Be kind to World of Books (delays, backoff, caching). Focus on core features before bonuses. Never commit secrets — use .env.example. Include a fallback seed script so reviewers can test even if scraping fails during review.
Alternate Tasks (Mini-Project Variations)
- (Beginner) Static Books Catalog Explorer — no live scraping. Start from a provided JSON dataset (or scrape
books.toscrape.com, a sandbox built explicitly for practice, just once and commit the output). Build a Next.js + TypeScript + Tailwind frontend with a home page of category headings, a product grid with client-side paging, and a product detail page. There is no backend queue and no live scraping; instead, students learn the shape of the problem — navigation → category → product → detail — and focus purely on clean component structure, responsive layout, skeleton loading states, and accessibility basics (alt text, keyboard nav). This isolates the frontend craft and the data model before any scraping complexity is introduced. - (Beginner–Intermediate) Single-Page Scraper + REST API. Build a minimal NestJS backend with one scraping module that, given a category URL from
books.toscrape.com, scrapes the product tiles (title, price, image, link) once per request and returns them via a single REST endpoint with proper DTO validation and error handling. Persist results to a lightweight Postgres or MongoDB table with a unique constraint on the source URL, and add a simple "cache with expiry" check so a repeated request within N minutes returns the stored copy instead of re-scraping. This variation teaches the core backend loop — scrape, validate, persist, cache — without the full multi-entity schema or background queues, and pairs naturally with variation 1 as a frontend. - (Intermediate–Advanced) The Full Assignment, Faithful. Implement the original brief end to end against World of Books: the four-level drilldown (navigation → categories → products → detail), live on-demand Crawlee + Playwright scraping behind a NestJS API, the full relational schema (navigation, category, product, product_detail, review, scrape_job, view_history) with indexes and unique constraints, a background queue/worker so scrapes never block the request thread, DB-backed TTL caching with deduplication, retries with exponential backoff, a Next.js + React Query frontend with persisted browsing history, plus README, API docs, tests, CI, Dockerfiles, a fallback seed script, and a live deployment. This is the production-grade target and exercises every skill in the rubric simultaneously.
- (Advanced — MERN twist) Express/Mongo Product Explorer with a Worker Service. Re-architect the same platform on a MERN-flavored stack: replace NestJS with an Express + TypeScript API, use MongoDB with Mongoose schemas for navigation/category/product/review documents, and split scraping into a separate worker process driven by a BullMQ + Redis queue so the API container and the scraper container scale independently (wire it all together with docker-compose). The React frontend uses SWR with optimistic updates and a "Refresh data" button that enqueues an on-demand re-scrape and polls job status from a
scrape_job collection. This variation keeps the domain identical but pushes students into queue-driven microservice thinking, MongoDB document modeling, and inter-service communication — the realities of a MERN job. - (Advanced — Agentic AI twist) Agent-Driven Book Discovery & Enrichment. Layer an AI agent on top of the scraped catalog. After the scraper populates the product database, build a NestJS (or FastAPI) endpoint backed by an LLM agent with tools: a
search_catalog tool over your DB, a scrape_detail tool that triggers an on-demand Crawlee scrape for a missing product, and an enrich tool that summarizes reviews and generates content-based "you might also like" recommendations using embeddings/vector similarity. The frontend becomes a conversational explorer: a user asks "find me cozy mystery novels under £10 with good reviews," and the agent plans, queries the cached data, scrapes on demand if coverage is thin, and returns a ranked, explained result set. This twist combines the scraping/caching backbone with tool-calling agent design, retrieval, and recommendation — bridging the full-stack and Agentic AI tracks.
Reference Links
- Crawlee — official site and Crawlee JS Quick Start — the exact scraping library named in the assignment; the quick start shows how
PlaywrightCrawler, requestHandler, enqueueLinks(), and maxRequestsPerCrawl work for reliable, rate-limited crawls. - apify/crawlee on GitHub — source, examples, and the PlaywrightCrawler internals; good for understanding queue management, proxy rotation, and headless/headful modes.
- oxylabs/crawlee-web-scraping-tutorial (GitHub) — a working end-to-end Crawlee scraping example to model the backend scrape module on.
- Full-Stack Development with Next.js and Nest.js (Medium) — walkthrough of wiring a Next.js frontend to a NestJS backend, matching the assignment's required stack.
- Playwright Web Scraping Tutorial (Oxylabs) — fundamentals of scraping JavaScript-rendered pages with a headless browser, useful for handling dynamic product/detail pages.
- Web Scraping With Next.js (Bright Data) — patterns for combining a Next.js frontend with scraped data sources.
- books.toscrape.com — a sandbox bookstore site built specifically for scraping practice; ideal stand-in for the beginner/intermediate variations so students avoid hammering a real commercial site while learning.