|
1 |
| -## 💻 Web Scrapping AI Agent |
2 |
| -This Streamlit app allows you to scrape a website using OpenAI API and the scrapegraphai library. Simply provide your OpenAI API key, enter the URL of the website you want to scrape, and specify what you want the AI agent to extract from the website. |
| 1 | +Sure! Here's the updated README with added emojis for a more engaging and visually appealing presentation: |
3 | 2 |
|
4 |
| -### Features |
5 |
| -- Scrape any website by providing the URL |
6 |
| -- Utilize OpenAI's LLMs (GPT-3.5-turbo or GPT-4) for intelligent scraping |
7 |
| -- Customize the scraping task by specifying what you want the AI agent to extract |
| 3 | +--- |
8 | 4 |
|
9 |
| -### How to get Started? |
| 5 | +# 💻 Web Scraping AI Agent |
10 | 6 |
|
11 |
| -1. Clone the GitHub repository |
| 7 | +This **Apify Streamlit app Actor** enables intelligent web scraping using **OpenAI's API** and the `scrapegraphai` library. With this app, you can scrape any website by simply providing the URL and specifying the data you need extracted. You can run it directly on the **Apify platform** for hassle-free scaling and management. 🚀 |
12 | 8 |
|
13 |
| -```bash |
14 |
| -git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git |
15 |
| -cd awesome-llm-apps/advanced_tools_frameworks/web_scrapping_ai_agent |
16 |
| -``` |
17 |
| -2. Install the required dependencies: |
| 9 | +## 💡 Why Apify Actors Are Powerful |
18 | 10 |
|
19 |
| -```bash |
20 |
| -pip install -r requirements.txt |
21 |
| -``` |
22 |
| -3. Get your OpenAI API Key |
| 11 | +Apify Actors provide an easy and efficient way to run your web scraping tasks at scale. They are fully-managed, cloud-based containers designed for tasks like web scraping, automation, and data extraction. [Learn more about Apify Actors in the whitepaper here](https://whitepaper.actor/). 📖 |
23 | 12 |
|
24 |
| -- Sign up for an [OpenAI account](https://platform.openai.com/) (or the LLM provider of your choice) and obtain your API key. |
| 13 | +--- |
25 | 14 |
|
26 |
| -4. Run the Streamlit App |
27 |
| -```bash |
28 |
| -streamlit run ai_scrapper.py |
29 |
| -``` |
| 15 | +## 🌟 Features |
30 | 16 |
|
31 |
| -### How it Works? |
| 17 | +- **Scrape any website** by providing the URL. 🌍 |
| 18 | +- **Leverage OpenAI's LLMs** (GPT-3.5-turbo or GPT-4) for intelligent data extraction. 🤖💬 |
| 19 | +- **Run as an Apify Actor** on the Apify platform for seamless deployment and scaling. ⚡ |
| 20 | +- **Customize your scraping task** by providing specific user prompts. ✍️ |
32 | 21 |
|
33 |
| -- The app prompts you to enter your OpenAI API key, which is used to authenticate and access the OpenAI language models. |
34 |
| -- You can select the desired language model (GPT-3.5-turbo or GPT-4) for the scraping task. |
35 |
| -- Enter the URL of the website you want to scrape in the provided text input field. |
36 |
| -- Specify what you want the AI agent to extract from the website by entering a user prompt. |
37 |
| -- The app creates a SmartScraperGraph object using the provided URL, user prompt, and OpenAI configuration. |
38 |
| -- The SmartScraperGraph object scrapes the website and extracts the requested information using the specified language model. |
39 |
| -- The scraped results are displayed in the app for you to view |
| 22 | +--- |
| 23 | + |
| 24 | +## 🔧 How to Get Started? |
| 25 | + |
| 26 | +### 🅰️ Run as an Apify Actor |
| 27 | + |
| 28 | +See full guide in [Apify Academy](https://docs.apify.com/academy/getting-started/actors)📚 |
| 29 | + |
| 30 | +This project is already set up as an **Apify Actor**, allowing you to easily deploy it on the Apify platform. |
| 31 | + |
| 32 | +1. **Initialize the Apify Actor** (already done in the repository): |
| 33 | + |
| 34 | + ```bash |
| 35 | + apify init |
| 36 | + ``` |
| 37 | + |
| 38 | + This creates `.actor/actor.json` with the configuration and the necessary Dockerfile. 🛠️ |
| 39 | + |
| 40 | +2. **Refactor the Code**: |
| 41 | + |
| 42 | + The code has been refactored to separate the logic into a function that handles input and output independently, improving maintainability and scalability. This function now takes the inputs (URL, prompt, model choice) and returns the extracted output, allowing for better handling in the Apify environment. 🔄 |
| 43 | + |
| 44 | +3. **Build the Actor**: |
| 45 | + |
| 46 | + [Learn more about building an Actor in the Apify Docs](https://docs.apify.com/academy/getting-started/creating-actors#build-an-actor). 🏗️ |
| 47 | + |
| 48 | +4. **Run the Actor**: |
| 49 | + |
| 50 | + [Learn how to run Actors in the Apify console](https://docs.apify.com/academy/getting-started/creating-actors#run-the-actor)📚 |
| 51 | + This will trigger the process on the Apify platform, and you’ll receive logs detailing the results. |
| 52 | + |
| 53 | +--- |
| 54 | + |
| 55 | +### 🅱️ Run Locally |
| 56 | + |
| 57 | +1. **Clone the GitHub Repository:** |
| 58 | + |
| 59 | + ```bash |
| 60 | + git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git |
| 61 | + cd awesome-llm-apps/advanced_tools_frameworks/web_scrapping_ai_agent |
| 62 | + ``` |
| 63 | + |
| 64 | +2. **Install Dependencies:** |
| 65 | + |
| 66 | + ```bash |
| 67 | + pip install -r requirements.txt |
| 68 | + ``` |
| 69 | + |
| 70 | +3. **Get Your OpenAI API Key:** |
| 71 | + - Sign up for an [OpenAI account](https://platform.openai.com/) (or another LLM provider) and obtain your API key. 🔑 |
| 72 | + |
| 73 | +4. **Run the App:** |
| 74 | + |
| 75 | + The app has been refactored to remove Streamlit, and it now focuses on the core logic for scraping. You can run the function directly as a Python script: |
| 76 | + |
| 77 | + ```bash |
| 78 | + python run_ai_scraper.py |
| 79 | + ``` |
| 80 | + |
| 81 | +--- |
| 82 | + |
| 83 | +## 📦 Dockerized for Apify 🐳 |
| 84 | + |
| 85 | +This project includes a **Dockerfile** optimized for Apify deployment, inspired by the official Apify templates: |
| 86 | + |
| 87 | +- **Multi-stage build** to keep the Docker image as small as possible. |
| 88 | +- Installs **Rust** for building the `minify-html` dependency of `scrapegraphai`. |
| 89 | +- **Playwright update** for enhanced web scraping capabilities. |
| 90 | +- **Streamlit removal** to optimize the actor's performance and reduce unnecessary dependencies. |
| 91 | +- **Concurrency Handling**: Apify Actor's entrypoint function is optimized to handle concurrency and process multiple requests in parallel while managing inputs and outputs efficiently. |
| 92 | + |
| 93 | +--- |
| 94 | + |
| 95 | +## 🔍 How It Works |
| 96 | + |
| 97 | +1. Upon starting the actor, you will pass the necessary inputs to the **entrypoint function**. 🚪 |
| 98 | +2. The **entrypoint function** maps Apify input parameters to the scraping logic function, ensuring concurrency is handled efficiently. 🔄 |
| 99 | +3. The **scraping logic function** uses the inputs (URL, OpenAI model, user prompt) and returns the processed data. 🧠 |
| 100 | +4. The result is returned by the actor, which can be saved to Apify datasets or returned as an output for further processing. 📈 |
| 101 | + |
| 102 | +--- |
| 103 | + |
| 104 | +## 📖 Learn More |
| 105 | + |
| 106 | +- Want to understand why **Apify Actors** are the ideal solution for scalable web scraping? Check out the [Apify Whitepaper](https://whitepaper.actor/) for more insights. 📜 |
| 107 | + |
| 108 | +--- |
| 109 | + |
| 110 | +This **Web Scraping AI Agent** is perfect for AI-powered data extraction, whether you're conducting research, automating workflows, or gathering business intelligence. With Apify’s platform, you can deploy and scale the app with ease. 🚀 |
| 111 | + |
| 112 | +--- |
| 113 | + |
| 114 | +### Additional Notes for Apify Actor Development |
| 115 | + |
| 116 | +- **Code Refactor**: The code has been refactored to separate the logic into a function that accepts inputs and returns outputs, facilitating better scalability and reusability in the Apify Actor environment. 🛠️ |
| 117 | +- **Concurrency**: The entrypoint function handles multiple requests concurrently, allowing the actor to scale efficiently. 🔄 |
| 118 | +- **No Streamlit**: Streamlit has been removed from the app as the primary focus is on backend scraping logic, which is now suitable for Apify actor deployment. ❌ |
0 commit comments