Automated Web Search & Scraping System
Playwright + DuckDuckGo API + LLM-driven extraction
A production-ready blueprint for a system that searches the web, visits result pages, extracts structured data with an LLM, and returns clean JSON—fast, reliable, and auditable.
✨ Highlights
- Search-first: Query DuckDuckGo (API or HTML fallback) to discover relevant URLs.
- Headless browsing: Use Playwright to render JavaScript-heavy pages, handle cookies, and emulate devices.
- Smart extraction: An LLM converts messy HTML into structured JSON based on a task schema.
- Resilient pipeline: Timeouts, retries, backoff, and per-domain throttling.
- Pluggable: Swap search providers, add content normalizers, or integrate vector stores.
- Auditable: Persist raw HTML, normalized text, and final JSON with trace IDs.