Skip to content

Commit 8abdc49

Browse files
committed
docs: update README for Apify Actor deployment and enhance presentation with emojis
1 parent e57d00d commit 8abdc49

File tree

1 file changed

+105
-30
lines changed
  • advanced_tools_frameworks/web_scrapping_ai_agent

1 file changed

+105
-30
lines changed
Lines changed: 105 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,114 @@
1-
## 💻 Web Scrapping AI Agent
2-
This Streamlit app allows you to scrape a website using OpenAI API and the scrapegraphai library. Simply provide your OpenAI API key, enter the URL of the website you want to scrape, and specify what you want the AI agent to extract from the website.
1+
# 💻 Web Scraping AI Agent
32

4-
### Features
5-
- Scrape any website by providing the URL
6-
- Utilize OpenAI's LLMs (GPT-3.5-turbo or GPT-4) for intelligent scraping
7-
- Customize the scraping task by specifying what you want the AI agent to extract
3+
This **Apify Streamlit app Actor** enables intelligent web scraping using **OpenAI's API** and the `scrapegraphai` library. With this app, you can scrape any website by simply providing the URL and specifying the data you need extracted. You can run it directly on the **Apify platform** for hassle-free scaling and management. 🚀
84

9-
### How to get Started?
5+
## 💡 Why Apify Actors Are Powerful
106

11-
1. Clone the GitHub repository
7+
Apify Actors provide an easy and efficient way to run your web scraping tasks at scale. They are fully-managed, cloud-based containers designed for tasks like web scraping, automation, and data extraction. [Learn more about Apify Actors in the whitepaper here](https://whitepaper.actor/). 📖
128

13-
```bash
14-
git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git
15-
cd awesome-llm-apps/advanced_tools_frameworks/web_scrapping_ai_agent
16-
```
17-
2. Install the required dependencies:
9+
---
1810

19-
```bash
20-
pip install -r requirements.txt
21-
```
22-
3. Get your OpenAI API Key
11+
## 🌟 Features
2312

24-
- Sign up for an [OpenAI account](https://platform.openai.com/) (or the LLM provider of your choice) and obtain your API key.
13+
- **Scrape any website** by providing the URL. 🌍
14+
- **Leverage OpenAI's LLMs** (GPT-3.5-turbo or GPT-4) for intelligent data extraction. 🤖💬
15+
- **Run as an Apify Actor** on the Apify platform for seamless deployment and scaling. ⚡
16+
- **Customize your scraping task** by providing specific user prompts. ✍️
2517

26-
4. Run the Streamlit App
27-
```bash
28-
streamlit run ai_scrapper.py
29-
```
18+
---
3019

31-
### How it Works?
20+
## 🔧 How to Get Started?
3221

33-
- The app prompts you to enter your OpenAI API key, which is used to authenticate and access the OpenAI language models.
34-
- You can select the desired language model (GPT-3.5-turbo or GPT-4) for the scraping task.
35-
- Enter the URL of the website you want to scrape in the provided text input field.
36-
- Specify what you want the AI agent to extract from the website by entering a user prompt.
37-
- The app creates a SmartScraperGraph object using the provided URL, user prompt, and OpenAI configuration.
38-
- The SmartScraperGraph object scrapes the website and extracts the requested information using the specified language model.
39-
- The scraped results are displayed in the app for you to view
22+
### 🅰️ Run as an Apify Actor
23+
24+
See full guide in [Apify Academy](https://docs.apify.com/academy/getting-started/actors)📚
25+
26+
This project is already set up as an **Apify Actor**, allowing you to easily deploy it on the Apify platform.
27+
28+
1. **Initialize the Apify Actor** (already done in the repository):
29+
30+
```bash
31+
apify init
32+
```
33+
34+
This creates `.actor/actor.json` with the configuration and the necessary Dockerfile. 🛠️
35+
36+
2. **Refactor the Code**:
37+
38+
The code has been refactored to separate the logic into a function that handles input and output independently, improving maintainability and scalability. This function now takes the inputs (URL, prompt, model choice) and returns the extracted output, allowing for better handling in the Apify environment. 🔄
39+
40+
3. **Build the Actor**:
41+
42+
[Learn more about building an Actor in the Apify Docs](https://docs.apify.com/academy/getting-started/creating-actors#build-an-actor). 🏗️
43+
44+
4. **Run the Actor**:
45+
46+
[Learn how to run Actors in the Apify console](https://docs.apify.com/academy/getting-started/creating-actors#run-the-actor)📚
47+
This will trigger the process on the Apify platform, and you’ll receive logs detailing the results.
48+
49+
---
50+
51+
### 🅱️ Run Locally
52+
53+
1. **Clone the GitHub Repository:**
54+
55+
```bash
56+
git clone https://github.com/Shubhamsaboo/awesome-llm-apps.git
57+
cd awesome-llm-apps/advanced_tools_frameworks/web_scrapping_ai_agent
58+
```
59+
60+
2. **Install Dependencies:**
61+
62+
```bash
63+
pip install -r requirements.txt
64+
```
65+
66+
3. **Get Your OpenAI API Key:**
67+
- Sign up for an [OpenAI account](https://platform.openai.com/) (or another LLM provider) and obtain your API key. 🔑
68+
69+
4. **Run the App:**
70+
71+
The app has been refactored to remove Streamlit, and it now focuses on the core logic for scraping. You can run the function directly as a Python script:
72+
73+
```bash
74+
python run_ai_scraper.py
75+
```
76+
77+
---
78+
79+
## 📦 Dockerized for Apify 🐳
80+
81+
This project includes a **Dockerfile** optimized for Apify deployment, inspired by the official Apify templates:
82+
83+
- **Multi-stage build** to keep the Docker image as small as possible.
84+
- Installs **Rust** for building the `minify-html` dependency of `scrapegraphai`.
85+
- **Playwright update** for enhanced web scraping capabilities.
86+
- **Streamlit removal** to optimize the actor's performance and reduce unnecessary dependencies.
87+
- **Concurrency Handling**: Apify Actor's entrypoint function is optimized to handle concurrency and process multiple requests in parallel while managing inputs and outputs efficiently.
88+
89+
---
90+
91+
## 🔍 How It Works
92+
93+
1. Upon starting the actor, you will pass the necessary inputs to the **entrypoint function**. 🚪
94+
2. The **entrypoint function** maps Apify input parameters to the scraping logic function, ensuring concurrency is handled efficiently. 🔄
95+
3. The **scraping logic function** uses the inputs (URL, OpenAI model, user prompt) and returns the processed data. 🧠
96+
4. The result is returned by the actor, which can be saved to Apify datasets or returned as an output for further processing. 📈
97+
98+
---
99+
100+
## 📖 Learn More
101+
102+
- Want to understand why **Apify Actors** are the ideal solution for scalable web scraping? Check out the [Apify Whitepaper](https://whitepaper.actor/) for more insights. 📜
103+
104+
---
105+
106+
This **Web Scraping AI Agent** is perfect for AI-powered data extraction, whether you're conducting research, automating workflows, or gathering business intelligence. With Apify’s platform, you can deploy and scale the app with ease. 🚀
107+
108+
---
109+
110+
### Additional Notes for Apify Actor Development
111+
112+
- **Code Refactor**: The code has been refactored to separate the logic into a function that accepts inputs and returns outputs, facilitating better scalability and reusability in the Apify Actor environment. 🛠️
113+
- **Concurrency**: The entrypoint function handles multiple requests concurrently, allowing the actor to scale efficiently. 🔄
114+
- **No Streamlit**: Streamlit has been removed from the app as the primary focus is on backend scraping logic, which is now suitable for Apify actor deployment. ❌

0 commit comments

Comments
 (0)