Curiosity

February 2025—March 2025

Automated Web Search & Scraping System

Playwright + DuckDuckGo API + LLM-driven extraction

A production-ready blueprint for a system that searches the web, visits result pages, extracts structured data with an LLM, and returns clean JSON—fast, reliable, and auditable.

✨ Highlights

Search-first: Query DuckDuckGo (API or HTML fallback) to discover relevant URLs.
Headless browsing: Use Playwright to render JavaScript-heavy pages, handle cookies, and emulate devices.
Smart extraction: An LLM converts messy HTML into structured JSON based on a task schema.
Resilient pipeline: Timeouts, retries, backoff, and per-domain throttling.
Pluggable: Swap search providers, add content normalizers, or integrate vector stores.
Auditable: Persist raw HTML, normalized text, and final JSON with trace IDs.

Technologies

LLMNext.jsExpress.jsWebSocket

Topics

Web ScrapingChatbots

Project Chat

Welcome to the Project Chat!

I'm your AI assistant with knowledge about this project's codebase and documentation.

Example questions:

What technologies does this project use?
How is the application structured?
Can you explain how it works?