Skip to content

A simple FastAPI service that fetches web pages and converts them to clean markdown. The service can be run either locally or deployed to Modal's serverless platform.

Notifications You must be signed in to change notification settings

EndlessHoper/modalcrawl4ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

URL to Markdown Scraping Service

A simple FastAPI service that fetches web pages and converts them to clean markdown. The service can be run either locally or deployed to Modal's serverless platform.

Features

  • Converts web pages to clean, readable markdown
  • Handles JavaScript-rendered content
  • Configurable viewport settings
  • Retries on failure
  • Available as both local FastAPI service and Modal serverless deployment
  • Secure API key authentication for Modal deployment

Setup

Local Development

  1. Make sure you have Python 3.8+ installed
  2. Install dependencies:
    pip install -r requirements.txt
  3. Install browser dependencies for Playwright:
    playwright install chromium
    playwright install-deps chromium

Modal Deployment

  1. Install Modal:
    pip install modal
  2. Set up Modal account and authenticate:
    modal setup
  3. Configure your API key:
    • Create a secret in Modal dashboard named "MYAPI" with key "APIKEY"
    • In your modalscraper.py, add the secret to your function:
      @app.function(
          image=image,
          secrets=[modal.Secret.from_name("MYAPI")]
      )
      @modal.asgi_app()
      def fastapi_app():
          return web_app
    • Access the API key in your code using:
      import os
      api_key = os.environ["APIKEY"]

Running the Service

Local Development

Start the service locally:

python scraper.py

The service will be available at http://localhost:8000. You can access the API documentation at http://localhost:8000/docs.

Modal Deployment

Deploy to Modal's serverless platform:

modal deploy modalscraper.py

After successful deployment, Modal will provide you with a URL where your service is accessible.

API Usage

The service exposes a single endpoint:

GET /scrape?url=<webpage_url>

Authentication

When using the Modal deployment, include your API key in the request header:

curl -H "Authorization: Bearer YOUR_API_KEY" "https://your-modal-url/scrape?url=https://example.com"

For local development, authentication is disabled by default.

Example Requests

Using Python with Environment Variables

First, install the required packages:

pip install requests python-dotenv

Create a .env file:

API_KEY=your-modal-api-key

Then use this Python script:

import requests
import os
from dotenv import load_dotenv

# Load API key from environment
load_dotenv()
API_KEY = os.getenv("API_KEY")

# Configuration
BASE_URL = "https://example-app.modal.run"
TEST_URL = "https://example.com"

# Set up headers with authentication
headers = {
    "Authorization": f"Bearer {API_KEY}"
}

# Make the request
response = requests.get(
    f"{BASE_URL}/scrape",
    headers=headers,
    params={"url": TEST_URL}
)

# Print the markdown content
result = response.json()
print(result["markdown"])

Using Command Line

Local development:

# Using curl (bash/cmd)
curl "http://localhost:8000/scrape?url=https://example.com"

# Using PowerShell
Invoke-RestMethod -Uri "http://localhost:8000/scrape?url=https://example.com"

Modal deployment:

# Using curl (bash/cmd)
curl -H "Authorization: Bearer YOUR_API_KEY" "https://your-modal-url/scrape?url=https://example.com"

# Using PowerShell
$headers = @{
    "Authorization" = "Bearer YOUR-ACTUAL-API-KEY"
}
Invoke-RestMethod -Uri "YOUR-MODAL-URL/scrape?url=https://example.com" -Headers $headers

The response will be JSON containing the markdown conversion of the webpage:

{
    "markdown": "# Example Domain\n\nThis domain is..."
}

Project Structure

  • scraper.py - Local FastAPI service implementation
  • modalscraper.py - Modal serverless implementation
  • requirements.txt - Python dependencies

Error Handling

The service will return:

  • 200 OK - Successfully scraped and converted page
  • 400 Bad Request - Failed to extract content after multiple attempts
  • 401 Unauthorized - Invalid or missing API key. This can happen if:
    • The Authorization header is missing
    • The API key format is incorrect (should be "Bearer YOUR-API-KEY")
    • The provided API key doesn't match the one configured in Modal
  • 500 Internal Server Error - Server-side errors, including:
    • API key not configured on server
    • Unexpected errors during processing

Contributing

Feel free to open issues or submit pull requests for improvements.

About

A simple FastAPI service that fetches web pages and converts them to clean markdown. The service can be run either locally or deployed to Modal's serverless platform.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages