Skip to content

Madhav-000-s/AIscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Web Scraper

This project is a web application that allows users to scrape content from a website and then use an AI model to extract specific information from that content.

Features

  • Scrape websites using a provided URL.
  • View the cleaned DOM content of the scraped page.
  • Use natural language to specify what information to extract.
  • Leverages Ollama with a local large language model to parse the content.

How It Works

  1. Scraping: The application uses Selenium to fetch the HTML content of the provided URL.
  2. Cleaning: BeautifulSoup is used to parse the HTML, remove <script> and <style> tags, and extract the text content from the body.
  3. Parsing: The user provides a description of the desired information. This description, along with the cleaned text, is sent to a large language model powered by Ollama and LangChain.
  4. Extraction: The AI model processes the text and extracts the information that matches the user's description.

Setup and Installation

Prerequisites

  • Python 3.7+
  • A running instance of Ollama with the Llama3.2:1b model pulled.
  • Google Chrome and chromedriver.exe (matching your Chrome version) in the root of the project directory.

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd <repository-directory>
  2. Create a virtual environment (optional but recommended):

    python -m venv ai
    source ai/bin/activate  # On Windows, use `ai\Scripts\activate`
  3. Install dependencies:

    pip install -r requirements.txt
  4. Download ChromeDriver: Download the version of chromedriver.exe that corresponds to your Google Chrome version and place it in the root directory of the project.

How to Run

  1. Make sure your Ollama instance is running.

  2. Run the Streamlit application:

    streamlit run main.py
  3. Open your web browser and navigate to the URL provided by Streamlit.

Project Structure

.
├── ai/                   # Virtual environment
├── chromedriver.exe      # Selenium WebDriver for Chrome
├── main.py               # Main Streamlit application file
├── parse.py              # Handles parsing with Ollama and LangChain
├── requirements.txt      # Project dependencies
├── scrape.py             # Handles website scraping and cleaning
└── README.md             # This file

About

This project is a web application that allows users to scrape content from a website and then use an AI model to extract specific information from that content.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors