Skip to content

Conversation

MatousMarik
Copy link
Collaborator

@MatousMarik MatousMarik commented Mar 13, 2025

Polished PR

  • by apify init initialize Apify:
    • add .actor/actor.json
    • add .gitignore
  • add Apify requirement
  • converts app to function
  • add main entry point
    • run scraper in a separate thread
    • collect inputs from Apify platform
    • save and log results
  • add input_schema
  • add minimalistic Dockerfile
  • update README

- by running apify-cli command: `apify init`
- update .gitignore
- also run scraper in a separate thread
    - SmartScraperGraph uses asyncio.run internally so it can't be ran within other asyncio.run
Inspired by our Apify templates e.g.:
- https://github.com/apify/actor-templates/tree/master/templates/python-playwright
- https://github.com/apify/actor-templates/tree/master/templates/python-start

Features:
- build `minify-html` with Rust
- optimize Dockerfile by adding multi-stage builds and virtual environment
    - to reduce image size
@MatousMarik MatousMarik added the enhancement New feature or request label Mar 13, 2025
@MatousMarik MatousMarik self-assigned this Mar 13, 2025
@MatousMarik
Copy link
Collaborator Author

@MatousMarik
Copy link
Collaborator Author

Actorification – Web Scraping AI Agent as an Apify Actor 🚀

Hey Shubham,

We love the improvements in this Web Scraping AI Agent! 🚀 This PR fully optimizes the app for deployment on Apify, enabling seamless scaling, automation, and concurrency handling—all while keeping the local version intact. Apify’s Actor model makes it easy to run web scraping tasks in a serverless, managed environment. (Learn more in the Actor Whitepaper).

🔥 What’s in this PR?

This PR enhances the Web Scraping AI Agent by:

Refactoring the app to remove Streamlit and focus purely on backend scraping logic.
Optimizing for Apify with an entrypoint function that handles input, processes tasks concurrently, and returns structured output.
Adding .actor/actor.json for Apify configuration.
Implementing a minimal Dockerfile to streamline deployment.
Enhancing concurrency support, allowing multiple scraping jobs to run efficiently in parallel.
Updating README with Apify deployment steps and removing unnecessary Streamlit references.

🚀 How to Deploy on Apify?

This project is already structured as an Apify Actor, making deployment incredibly simple:

  1. Why Apify? If you need a scalable, serverless solution for web scraping, Apify is perfect. It provides managed cloud infrastructure, built-in scheduling, API integration, and seamless automation.
  2. Create an Apify Account (if you haven’t yet) – Sign up here.
  3. Fork this Repository and push it to your GitHub.
  4. Connect GitHub to Apify – In the Apify Console, go to Actors → Create New → Import from GitHub.
  5. Build and Run the Actor – Apify will handle everything, and you’ll get structured output for your scraping tasks.

📖 Learn more about Actor Development in the Apify Docs.

🏗️ Key Enhancements

  • No more Streamlit – The app is now a fully backend-focused AI scraper, ideal for Apify automation.
  • Improved modularity – Scraping logic is now neatly separated into a function that accepts inputs and returns structured outputs.
  • Concurrent execution – The Apify entrypoint is designed to handle multiple requests efficiently.
  • Lightweight Dockerfile – Optimized for fast Apify deployment.

🎨 How It Looks After Deployment

Here's how the Apify Actor Console will look once the Web Scraping AI Agent is deployed and running:

1️⃣ Actor Readme in Apify

image

2️⃣ Input Configuration for Web Scraping

image

3️⃣ Logs & Results After Execution

image


🔗 Related PR: Alternative version of this PR


This PR makes it easier than ever to deploy, run, and scale this AI-powered web scraping tool on Apify. Looking forward to your thoughts! 🚀

@MatousMarik
Copy link
Collaborator Author

@tomasjindra
What do you think about this new message? I won't create a new PR yet; it would look the same...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants