Skip to content

lice-ernier/Structured-Extract-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Structured Extract Scraper

This scraper uses AI to pull structured data from any webpage based on natural language descriptions. Instead of writing selectors or worrying about DOM details, you simply describe the information you want and the scraper handles the parsing. It produces clean, validated JSON and works across complex, dynamic pages with ease.

Bitbash Banner

Telegram   WhatsApp   Gmail   Website

Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for Structured Extract Scraper you've just found your team — Let's Chat. 👆👆

Introduction

The Structured Extract Scraper combines browser automation with multiple LLM providers to intelligently extract structured data from websites. It’s ideal for analysts, developers, and automation workflows where writing custom parsing logic is either slow or limiting. By generating schemas automatically and validating outputs, it ensures extracted data remains consistent and reliable.

Why It’s Powerful

  • Lets you define extraction goals in plain language instead of CSS selectors.
  • Uses AI models to understand content and structure your results.
  • Handles dynamic pages using full browser rendering.
  • Validates extracted results using automatically generated schemas.
  • Runs efficiently thanks to the Bun runtime.

Features

Feature Description
Multi-LLM Extraction Supports OpenAI, Anthropic, and Google AI models for flexible parsing options.
Smart Schema Generation Auto-creates Zod schemas based on your natural language instructions.
High-Performance Runtime Runs on Bun for significantly faster execution times.
Browser Rendering Uses Playwright to handle dynamic or JavaScript-heavy pages.
Structured JSON Output Ensures validated, structured data ready for downstream use.
Standby Mode Provides an HTTP server interface for repeated, programmatic extraction tasks.

What Data This Scraper Extracts

Field Name Field Description
description Natural language description of what data should be extracted.
schema AI-generated schema describing expected output structure.
result Final extracted JSON object that conforms to the schema.
modelUsed The AI model used for extraction.
url Webpage URL from which data was extracted.
metadata Additional internal extraction information.

Example Output

[
  {
    "description": "Extract product name, price, and rating from the page",
    "schema": {
      "productName": "string",
      "price": "string",
      "rating": "number"
    },
    "result": {
      "productName": "UltraClean Air Purifier",
      "price": "$129.99",
      "rating": 4.7
    },
    "modelUsed": "OpenAI-GPT",
    "url": "https://example.com/product/air-purifier",
    "metadata": {
      "timestamp": "2024-03-20T12:44:11Z"
    }
  }
]

Directory Structure Tree

Structured Extract/
├── src/
│   ├── main.js
│   ├── extractor/
│   │   ├── llm_client.js
│   │   ├── schema_builder.js
│   │   └── ai_parser.js
│   ├── browser/
│   │   ├── playwright_driver.js
│   │   └── page_renderer.js
│   ├── utils/
│   │   ├── validation.js
│   │   └── logger.js
│   └── config/
│       └── settings.example.json
├── server/
│   └── standby_server.js
├── data/
│   ├── sample_prompt.json
│   └── sample_output.json
├── package.json
└── README.md

Use Cases

  • Data analysts extract specific product details, metadata, or structured information without building parsers.
  • Developers integrate AI-powered extraction into automation pipelines or internal tools.
  • Market researchers pull consistent datasets from multiple competing sites using natural language prompts.
  • Content teams convert messy web pages into structured data for analysis or republishing.
  • API builders use Standby Mode to turn the scraper into an extraction microservice.

FAQs

Do I need to write selectors?
No—just describe what you want in plain language and the scraper handles the rest.

Which AI models are supported?
OpenAI, Anthropic, and Google AI models.

Does it work on JavaScript-heavy websites?
Yes, Playwright browser automation renders complex pages before extraction.

How is output validated?
Zod schemas are generated dynamically and used to confirm extract accuracy.


Performance Benchmarks and Results

Primary Metric:
Completes extraction tasks within seconds thanks to Bun’s fast startup and execution performance.

Reliability Metric:
Maintains strong accuracy by validating all extracted results against AI-generated schemas.

Efficiency Metric:
Lightweight runtime minimizes resource usage even during repeated extraction calls.

Quality Metric:
Produces clean, highly structured JSON outputs suitable for analytics, APIs, and automation pipelines.


Book a Call Watch on YouTube

Review 1

"Bitbash is a top-tier automation partner, innovative, reliable, and dedicated to delivering real results every time."

Nathan Pennington
Marketer
★★★★★

Review 2

"Bitbash delivers outstanding quality, speed, and professionalism, truly a team you can rely on."

Eliza
SEO Affiliate Expert
★★★★★

Review 3

"Exceptional results, clear communication, and flawless delivery.
Bitbash nailed it."

Syed
Digital Strategist
★★★★★