This project is a web application that allows users to scrape content from a website and then use an AI model to extract specific information from that content.
- Scrape websites using a provided URL.
- View the cleaned DOM content of the scraped page.
- Use natural language to specify what information to extract.
- Leverages Ollama with a local large language model to parse the content.
- Scraping: The application uses
Seleniumto fetch the HTML content of the provided URL. - Cleaning:
BeautifulSoupis used to parse the HTML, remove<script>and<style>tags, and extract the text content from the body. - Parsing: The user provides a description of the desired information. This description, along with the cleaned text, is sent to a large language model powered by
OllamaandLangChain. - Extraction: The AI model processes the text and extracts the information that matches the user's description.
- Python 3.7+
- A running instance of Ollama with the
Llama3.2:1bmodel pulled. - Google Chrome and
chromedriver.exe(matching your Chrome version) in the root of the project directory.
-
Clone the repository:
git clone <repository-url> cd <repository-directory>
-
Create a virtual environment (optional but recommended):
python -m venv ai source ai/bin/activate # On Windows, use `ai\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
-
Download ChromeDriver: Download the version of
chromedriver.exethat corresponds to your Google Chrome version and place it in the root directory of the project.
-
Make sure your Ollama instance is running.
-
Run the Streamlit application:
streamlit run main.py
-
Open your web browser and navigate to the URL provided by Streamlit.
.
├── ai/ # Virtual environment
├── chromedriver.exe # Selenium WebDriver for Chrome
├── main.py # Main Streamlit application file
├── parse.py # Handles parsing with Ollama and LangChain
├── requirements.txt # Project dependencies
├── scrape.py # Handles website scraping and cleaning
└── README.md # This file