AI Web Scraper

This project is a web application that allows users to scrape content from a website and then use an AI model to extract specific information from that content.

Features

Scrape websites using a provided URL.
View the cleaned DOM content of the scraped page.
Use natural language to specify what information to extract.
Leverages Ollama with a local large language model to parse the content.

How It Works

Scraping: The application uses Selenium to fetch the HTML content of the provided URL.
Cleaning: BeautifulSoup is used to parse the HTML, remove <script> and <style> tags, and extract the text content from the body.
Parsing: The user provides a description of the desired information. This description, along with the cleaned text, is sent to a large language model powered by Ollama and LangChain.
Extraction: The AI model processes the text and extracts the information that matches the user's description.

Setup and Installation

Prerequisites

Python 3.7+
A running instance of Ollama with the Llama3.2:1b model pulled.
Google Chrome and chromedriver.exe (matching your Chrome version) in the root of the project directory.

Installation

Clone the repository:

git clone <repository-url>
cd <repository-directory>

Create a virtual environment (optional but recommended):

python -m venv ai
source ai/bin/activate  # On Windows, use `ai\Scripts\activate`

Install dependencies:
```
pip install -r requirements.txt
```
Download ChromeDriver: Download the version of chromedriver.exe that corresponds to your Google Chrome version and place it in the root directory of the project.

How to Run

Make sure your Ollama instance is running.
Run the Streamlit application:
```
streamlit run main.py
```
Open your web browser and navigate to the URL provided by Streamlit.

Project Structure

.
├── ai/                   # Virtual environment
├── chromedriver.exe      # Selenium WebDriver for Chrome
├── main.py               # Main Streamlit application file
├── parse.py              # Handles parsing with Ollama and LangChain
├── requirements.txt      # Project dependencies
├── scrape.py             # Handles website scraping and cleaning
└── README.md             # This file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Web Scraper

Features

How It Works

Setup and Installation

Prerequisites

Installation

How to Run

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
__pycache__		__pycache__
ai		ai
README.md		README.md
chromedriver.exe		chromedriver.exe
main.py		main.py
parse.py		parse.py
requirements.txt		requirements.txt
scrape.py		scrape.py

Folders and files

Latest commit

History

Repository files navigation

AI Web Scraper

Features

How It Works

Setup and Installation

Prerequisites

Installation

How to Run

Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages