Meta Community Standards Scraper

A comprehensive Python-based scraper for extracting and analyzing Meta's community standards policies from their transparency website. Available both as standalone scripts and as Azure Functions for serverless deployment.

🚀 Features

Multiple Output Formats: JSON, plain text, and structured data
Azure Functions Support: Serverless HTTP API endpoints
Comprehensive Coverage: All 27+ Meta community standards sections
Structured Data: Organized headings, paragraphs, lists, and links
Error Handling: Robust error handling and retry mechanisms
Local & Cloud: Run locally or deploy to Azure

📁 Project Structure

Scripts

meta_scraper_json.py - Original script with JSON output
meta_scraper_updated.py - Enhanced version with better parsing
simple_meta_scraper.py - Lightweight version
function_app.py - Azure Functions version (recommended)

Azure Functions

function_app.py - Main Azure Function with HTTP endpoints
host.json - Azure Functions host configuration
local.settings.json - Local development settings
deploy.ps1 - PowerShell deployment script

Documentation

AZURE_FUNCTION_README.md - Detailed Azure Functions documentation
THIRD_PARTY_LICENSES.md - License information for dependencies

Testing & Utilities

test_azure_function.py - Test script for Azure Functions
test_setup.py - Environment testing
debug_requests.py - Network debugging utilities

🔧 Quick Start

Option 1: Azure Functions (Recommended)

Install dependencies:
```
pip install -r requirements.txt
```
Run locally:
```
func start
```

Test the endpoints:

# Get all sections
curl "http://localhost:7071/api/meta_scraper"

# Get specific sections
curl "http://localhost:7071/api/meta_scraper?sections=Spam,Misinformation"

# Get single section
curl "http://localhost:7071/api/meta_scraper_single?section=Spam"

Option 2: Standalone Script

python meta_scraper_json.py

🌐 Azure Functions API

Endpoints

`/api/meta_scraper`

Scrape multiple sections with flexible options.

Parameters:

sections (optional): Comma-separated section names
include_main (optional): Include main page (default: true)
format (optional): "json" or "summary" (default: json)

Examples:

GET /api/meta_scraper
GET /api/meta_scraper?sections=Spam,Misinformation&format=summary
GET /api/meta_scraper?include_main=false

`/api/meta_scraper_single`

Scrape a single section.

Parameters:

section (required): Section name to scrape
url (optional): Custom URL to scrape

Examples:

GET /api/meta_scraper_single?section=Spam
GET /api/meta_scraper_single?section=Custom&url=https://example.com/policy

📋 Available Sections

Coordinating Harm and Promoting Crime
Dangerous Organisations and Individuals
Fraud, Scams and Deceptive Practices
Restricted Goods and Services
Violence and Incitement
Adult Sexual Exploitation
Bullying and Harassment
Child Sexual Exploitation, Abuse and Nudity
Human Exploitation
Suicide, Self-Injury and Eating Disorders
Adult Nudity and Sexual Activity
Adult Sexual Solicitation and Sexually Explicit Language
Hateful Conduct
Privacy Violations
Violent and Graphic Content
Account Integrity
Authentic Identity Representation
Cybersecurity
Inauthentic Behavior
Memorialisation
Misinformation
Spam
Third-Party Intellectual Property Infringement
Using Meta Intellectual Property and Licences
Additional Protection of Minors
Locally Illegal Content, Products or Services
User Requests

🚀 Deploy to Azure

Prerequisites

Quick Deploy

# Login to Azure
az login

# Run deployment script
.\deploy.ps1

Manual Deployment

# Create resource group
az group create --name rg-meta-scraper --location "East US"

# Create storage account  
az storage account create --name metascraperstorage --location "East US" --resource-group rg-meta-scraper --sku Standard_LRS

# Create function app
az functionapp create --resource-group rg-meta-scraper --consumption-plan-location "East US" --runtime python --runtime-version 3.11 --functions-version 4 --name meta-scraper-function --storage-account metascraperstorage --os-type Linux

# Deploy
func azure functionapp publish meta-scraper-function

📊 Response Format

Full JSON Response

{
  "scraping_session": {
    "timestamp": "2025-07-30T10:30:00",
    "total_sections": 5,
    "successful_sections": 4,
    "failed_sections": 1,
    "success_rate": 80.0
  },
  "data": {
    "sections": {
      "Spam": {
        "metadata": {
          "section_name": "Spam",
          "url": "https://...",
          "scraped_at": "2025-07-30T10:30:00",
          "status": "success"
        },
        "content": {
          "title": "Spam",
          "raw_text": "...",
          "structured_content": {
            "headings": [...],
            "paragraphs": [...],
            "lists": [...],
            "links": [...]
          }
        },
        "statistics": {
          "character_count": 8500,
          "word_count": 1200,
          "paragraph_count": 25,
          "heading_count": 5
        }
      }
    }
  }
}

🔍 Testing

Run the test suite:

# Test Azure Functions locally
python test_azure_function.py

# Test environment setup
python test_setup.py

📄 License & Attribution

This project uses several open-source libraries. See THIRD_PARTY_LICENSES.md for complete license information and attributions.

Key Dependencies

Beautiful Soup (MIT License) - HTML parsing
Requests (Apache 2.0) - HTTP requests
Azure Functions (MIT License) - Serverless framework

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

⚠️ Important Notes

Rate Limiting: Be respectful of Meta's servers
Terms of Service: Ensure compliance with Meta's terms
Data Usage: Review Meta's data usage policies
Timeout: Azure Functions have a 10-minute timeout limit

🔧 Troubleshooting

Common Issues

403 Errors: Update User-Agent string
Timeout Issues: Reduce number of sections per request
Memory Issues: Use summary format for large datasets
Cold Start: Azure Functions may have initial delays

Getting Help

Check Azure Function logs
Verify target URLs are accessible
Monitor for rate limiting
Review error messages in responses

📈 Monitoring

When deployed to Azure:

Use Application Insights for monitoring
Track success rates and performance
Set up alerts for failures
Monitor function execution times

⭐ If this project helps you, please consider giving it a star!

Step 2: Download This Tool

Download all files from this repository to a folder on your computer
Remember where you saved them!

Step 3: Install Required Components

Open Command Prompt (Windows) or Terminal (Mac/Linux)
Navigate to the folder where you saved the files
Type: pip install -r requirements.txt
Press Enter and wait for it to finish

Step 4: Run the Scraper

In the same Command Prompt/Terminal, type: python meta_scraper_updated.py
Press Enter
Wait for it to finish (takes about 2-3 minutes)
You'll see progress updates as it downloads each section

What You'll Find in Your Results

After the scraper finishes, you'll have a new folder called meta_standards_output with these files:

📊 Summary Files

summary.txt - Your main report! Shows what was downloaded successfully
meta_standards_summary.json - Technical version (you can ignore this)

📄 Content Files

00_main_page.txt - Meta's main Community Standards page
01_Coordinating_Harm_and_Promoting_Crime.txt - First policy section
02_Dangerous_Organisations_and_Individuals.txt - Second policy section
...and so on through 27_User_Requests.txt

🔍 How to Read Your Results

Start with summary.txt - This tells you:
- How many sections were downloaded successfully
- Which sections (if any) failed to download
- Success rate percentage
Browse the numbered files - Each contains:
- The full text of one policy section
- The official title from Meta's website
- The web address where it came from
- How many characters of content were captured

📈 Understanding the Summary Report

When you open summary.txt, you'll see something like:

META COMMUNITY STANDARDS SCRAPING SUMMARY
============================================================

Total sections attempted: 27
Successfully scraped: 27
Failed: 0
Success rate: 100.0%

SUCCESSFUL SECTIONS:
------------------------------
✓ Coordinating Harm and Promoting Crime (13,344 chars)
✓ Dangerous Organisations and Individuals (24,009 chars)
...

Total sections attempted: How many policy sections the tool tried to download
Successfully scraped: How many actually worked
Success rate: Percentage that worked (100% is perfect!)
Character count: Shows how much content was captured (more = better)

Complete List of Policy Sections Downloaded

The tool downloads all 27 sections of Meta's Community Standards:

Coordinating Harm and Promoting Crime - Rules about planning illegal activities
Dangerous Organisations and Individuals - Policies on terrorists, criminals, etc.
Fraud, Scams and Deceptive Practices - Rules against scams and fake schemes
Restricted Goods and Services - What you can't sell on Meta platforms
Violence and Incitement - Rules about violent content and threats
Adult Sexual Exploitation - Policies protecting adults from sexual exploitation
Bullying and Harassment - Rules against bullying and harassment
Child Sexual Exploitation, Abuse and Nudity - Strong protections for children
Human Exploitation - Rules against human trafficking and exploitation
Suicide, Self-Injury and Eating Disorders - Mental health protection policies
Adult Nudity and Sexual Activity - Rules about adult content
Adult Sexual Solicitation and Sexually Explicit Language - Sexual conduct rules
Hateful Conduct - Policies against hate speech and discrimination
Privacy Violations - Rules protecting people's privacy
Violent and Graphic Content - Policies on disturbing visual content
Account Integrity - Rules about authentic accounts
Authentic Identity Representation - Requirements for real identity
Cybersecurity - Protection against hacking and cyber threats
Inauthentic Behavior - Rules against fake engagement and manipulation
Memorialisation - Policies for accounts of deceased users
Misinformation - Rules against false information
Spam - Policies against unwanted content and messages
Third-Party Intellectual Property Infringement - Copyright protection rules
Using Meta Intellectual Property and Licences - Rules about using Meta's content
Additional Protection of Minors - Extra safety measures for young users
Locally Illegal Content, Products or Services - Country-specific legal requirements
User Requests - How Meta handles user reports and requests

Troubleshooting (If Something Goes Wrong)

❌ "Python is not recognized" Error

Problem: Python isn't installed properly Solution:

Reinstall Python from python.org
Make sure to check "Add Python to PATH" during installation
Restart your computer

❌ "pip is not recognized" Error

Problem: Python package manager isn't working Solution: Try python -m pip install -r requirements.txt instead

❌ Some Sections Show "Failed to scrape"

Problem: Meta might have changed their website Solution: This is normal - the tool will get most sections even if a few fail

❌ Very Slow or Stuck

Problem: The tool waits 2 seconds between downloads to be respectful Solution: This is normal! Just wait - it takes 2-3 minutes total

❌ Empty or Very Short Files

Problem: Meta's website might be temporarily down or changed Solution: Try running the tool again later

For Advanced Users

Technical Details

Uses requests library for web downloads
Uses BeautifulSoup for reading website content
Waits 2 seconds between downloads to be respectful to Meta's servers
Saves content in both JSON (for programs) and TXT (for humans) formats

Customizing the Tool

You can modify meta_scraper_updated.py to:

Change which sections to download
Adjust the delay time between downloads
Modify the output format
Add new sections if Meta creates them

Files in This Project

meta_scraper_updated.py - The main tool (this is what you run)
requirements.txt - List of components Python needs to install
README.md - This instruction file
Other files - Various versions and test files (you can ignore these)

Legal and Ethical Use

✅ This tool is designed for:

Educational research
Policy analysis
Academic study
Personal reference

✅ Important notes:

Only downloads publicly available information
Respects Meta's servers with reasonable delays
Downloads from Meta's official transparency pages
Does not bypass any security or access controls

⚠️ Please use responsibly:

Don't run this tool excessively (once per day maximum)
Respect the content and don't redistribute without permission
Check Meta's terms of service for any restrictions
Use the downloaded content ethically and legally

Support and Issues

If you encounter problems:

Check the troubleshooting section above first
Make sure you have a stable internet connection
Try running the tool again (sometimes temporary network issues occur)
Check that you're using Python 3.8 or newer

The tool is designed to be robust and handle most common issues automatically.

Last Updated: July 2025
Tool Version: 2.0 (Individual file output with 100% success rate)
Tested On: Windows 10/11, macOS, Linux

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/chatmodes		.github/chatmodes
infra		infra
meta_standards_json		meta_standards_json
meta_standards_output		meta_standards_output
.gitignore		.gitignore
AZURE_FUNCTION_README.md		AZURE_FUNCTION_README.md
README.md		README.md
STORAGE_DOCUMENTATION.md		STORAGE_DOCUMENTATION.md
THIRD_PARTY_LICENSES.md		THIRD_PARTY_LICENSES.md
azure.yaml		azure.yaml
deploy.ps1		deploy.ps1
function_app.py		function_app.py
host.json		host.json
meta_scraper_json.py		meta_scraper_json.py
meta_scraper_updated.py		meta_scraper_updated.py
requirements.txt		requirements.txt
test_function_local.py		test_function_local.py
test_response_3_20250802_170011.json		test_response_3_20250802_170011.json
test_setup.py		test_setup.py
test_urls.py		test_urls.py

KnowledgeRatio/rog-evasiveai

Folders and files

Latest commit

History

Repository files navigation