This scraper uses AI to pull structured data from any webpage based on natural language descriptions. Instead of writing selectors or worrying about DOM details, you simply describe the information you want and the scraper handles the parsing. It produces clean, validated JSON and works across complex, dynamic pages with ease.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for Structured Extract Scraper you've just found your team — Let's Chat. 👆👆
The Structured Extract Scraper combines browser automation with multiple LLM providers to intelligently extract structured data from websites. It’s ideal for analysts, developers, and automation workflows where writing custom parsing logic is either slow or limiting. By generating schemas automatically and validating outputs, it ensures extracted data remains consistent and reliable.
- Lets you define extraction goals in plain language instead of CSS selectors.
- Uses AI models to understand content and structure your results.
- Handles dynamic pages using full browser rendering.
- Validates extracted results using automatically generated schemas.
- Runs efficiently thanks to the Bun runtime.
| Feature | Description |
|---|---|
| Multi-LLM Extraction | Supports OpenAI, Anthropic, and Google AI models for flexible parsing options. |
| Smart Schema Generation | Auto-creates Zod schemas based on your natural language instructions. |
| High-Performance Runtime | Runs on Bun for significantly faster execution times. |
| Browser Rendering | Uses Playwright to handle dynamic or JavaScript-heavy pages. |
| Structured JSON Output | Ensures validated, structured data ready for downstream use. |
| Standby Mode | Provides an HTTP server interface for repeated, programmatic extraction tasks. |
| Field Name | Field Description |
|---|---|
| description | Natural language description of what data should be extracted. |
| schema | AI-generated schema describing expected output structure. |
| result | Final extracted JSON object that conforms to the schema. |
| modelUsed | The AI model used for extraction. |
| url | Webpage URL from which data was extracted. |
| metadata | Additional internal extraction information. |
[
{
"description": "Extract product name, price, and rating from the page",
"schema": {
"productName": "string",
"price": "string",
"rating": "number"
},
"result": {
"productName": "UltraClean Air Purifier",
"price": "$129.99",
"rating": 4.7
},
"modelUsed": "OpenAI-GPT",
"url": "https://example.com/product/air-purifier",
"metadata": {
"timestamp": "2024-03-20T12:44:11Z"
}
}
]
Structured Extract/
├── src/
│ ├── main.js
│ ├── extractor/
│ │ ├── llm_client.js
│ │ ├── schema_builder.js
│ │ └── ai_parser.js
│ ├── browser/
│ │ ├── playwright_driver.js
│ │ └── page_renderer.js
│ ├── utils/
│ │ ├── validation.js
│ │ └── logger.js
│ └── config/
│ └── settings.example.json
├── server/
│ └── standby_server.js
├── data/
│ ├── sample_prompt.json
│ └── sample_output.json
├── package.json
└── README.md
- Data analysts extract specific product details, metadata, or structured information without building parsers.
- Developers integrate AI-powered extraction into automation pipelines or internal tools.
- Market researchers pull consistent datasets from multiple competing sites using natural language prompts.
- Content teams convert messy web pages into structured data for analysis or republishing.
- API builders use Standby Mode to turn the scraper into an extraction microservice.
Do I need to write selectors?
No—just describe what you want in plain language and the scraper handles the rest.
Which AI models are supported?
OpenAI, Anthropic, and Google AI models.
Does it work on JavaScript-heavy websites?
Yes, Playwright browser automation renders complex pages before extraction.
How is output validated?
Zod schemas are generated dynamically and used to confirm extract accuracy.
Primary Metric:
Completes extraction tasks within seconds thanks to Bun’s fast startup and execution performance.
Reliability Metric:
Maintains strong accuracy by validating all extracted results against AI-generated schemas.
Efficiency Metric:
Lightweight runtime minimizes resource usage even during repeated extraction calls.
Quality Metric:
Produces clean, highly structured JSON outputs suitable for analytics, APIs, and automation pipelines.
