A comprehensive Python-based scraper for extracting and analyzing Meta's community standards policies from their transparency website. Available both as standalone scripts and as Azure Functions for serverless deployment.
- Multiple Output Formats: JSON, plain text, and structured data
- Azure Functions Support: Serverless HTTP API endpoints
- Comprehensive Coverage: All 27+ Meta community standards sections
- Structured Data: Organized headings, paragraphs, lists, and links
- Error Handling: Robust error handling and retry mechanisms
- Local & Cloud: Run locally or deploy to Azure
meta_scraper_json.py
- Original script with JSON outputmeta_scraper_updated.py
- Enhanced version with better parsingsimple_meta_scraper.py
- Lightweight versionfunction_app.py
- Azure Functions version (recommended)
function_app.py
- Main Azure Function with HTTP endpointshost.json
- Azure Functions host configurationlocal.settings.json
- Local development settingsdeploy.ps1
- PowerShell deployment script
AZURE_FUNCTION_README.md
- Detailed Azure Functions documentationTHIRD_PARTY_LICENSES.md
- License information for dependencies
test_azure_function.py
- Test script for Azure Functionstest_setup.py
- Environment testingdebug_requests.py
- Network debugging utilities
-
Install dependencies:
pip install -r requirements.txt
-
Run locally:
func start
-
Test the endpoints:
# Get all sections curl "http://localhost:7071/api/meta_scraper" # Get specific sections curl "http://localhost:7071/api/meta_scraper?sections=Spam,Misinformation" # Get single section curl "http://localhost:7071/api/meta_scraper_single?section=Spam"
python meta_scraper_json.py
Scrape multiple sections with flexible options.
Parameters:
sections
(optional): Comma-separated section namesinclude_main
(optional): Include main page (default: true)format
(optional): "json" or "summary" (default: json)
Examples:
GET /api/meta_scraper
GET /api/meta_scraper?sections=Spam,Misinformation&format=summary
GET /api/meta_scraper?include_main=false
Scrape a single section.
Parameters:
section
(required): Section name to scrapeurl
(optional): Custom URL to scrape
Examples:
GET /api/meta_scraper_single?section=Spam
GET /api/meta_scraper_single?section=Custom&url=https://example.com/policy
- Coordinating Harm and Promoting Crime
- Dangerous Organisations and Individuals
- Fraud, Scams and Deceptive Practices
- Restricted Goods and Services
- Violence and Incitement
- Adult Sexual Exploitation
- Bullying and Harassment
- Child Sexual Exploitation, Abuse and Nudity
- Human Exploitation
- Suicide, Self-Injury and Eating Disorders
- Adult Nudity and Sexual Activity
- Adult Sexual Solicitation and Sexually Explicit Language
- Hateful Conduct
- Privacy Violations
- Violent and Graphic Content
- Account Integrity
- Authentic Identity Representation
- Cybersecurity
- Inauthentic Behavior
- Memorialisation
- Misinformation
- Spam
- Third-Party Intellectual Property Infringement
- Using Meta Intellectual Property and Licences
- Additional Protection of Minors
- Locally Illegal Content, Products or Services
- User Requests
- Azure CLI
- Azure Functions Core Tools
- Azure subscription
# Login to Azure
az login
# Run deployment script
.\deploy.ps1
# Create resource group
az group create --name rg-meta-scraper --location "East US"
# Create storage account
az storage account create --name metascraperstorage --location "East US" --resource-group rg-meta-scraper --sku Standard_LRS
# Create function app
az functionapp create --resource-group rg-meta-scraper --consumption-plan-location "East US" --runtime python --runtime-version 3.11 --functions-version 4 --name meta-scraper-function --storage-account metascraperstorage --os-type Linux
# Deploy
func azure functionapp publish meta-scraper-function
{
"scraping_session": {
"timestamp": "2025-07-30T10:30:00",
"total_sections": 5,
"successful_sections": 4,
"failed_sections": 1,
"success_rate": 80.0
},
"data": {
"sections": {
"Spam": {
"metadata": {
"section_name": "Spam",
"url": "https://...",
"scraped_at": "2025-07-30T10:30:00",
"status": "success"
},
"content": {
"title": "Spam",
"raw_text": "...",
"structured_content": {
"headings": [...],
"paragraphs": [...],
"lists": [...],
"links": [...]
}
},
"statistics": {
"character_count": 8500,
"word_count": 1200,
"paragraph_count": 25,
"heading_count": 5
}
}
}
}
}
Run the test suite:
# Test Azure Functions locally
python test_azure_function.py
# Test environment setup
python test_setup.py
This project uses several open-source libraries. See THIRD_PARTY_LICENSES.md
for complete license information and attributions.
- Beautiful Soup (MIT License) - HTML parsing
- Requests (Apache 2.0) - HTTP requests
- Azure Functions (MIT License) - Serverless framework
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
- Rate Limiting: Be respectful of Meta's servers
- Terms of Service: Ensure compliance with Meta's terms
- Data Usage: Review Meta's data usage policies
- Timeout: Azure Functions have a 10-minute timeout limit
- 403 Errors: Update User-Agent string
- Timeout Issues: Reduce number of sections per request
- Memory Issues: Use summary format for large datasets
- Cold Start: Azure Functions may have initial delays
- Check Azure Function logs
- Verify target URLs are accessible
- Monitor for rate limiting
- Review error messages in responses
When deployed to Azure:
- Use Application Insights for monitoring
- Track success rates and performance
- Set up alerts for failures
- Monitor function execution times
β If this project helps you, please consider giving it a star!
- Download all files from this repository to a folder on your computer
- Remember where you saved them!
- Open Command Prompt (Windows) or Terminal (Mac/Linux)
- Navigate to the folder where you saved the files
- Type:
pip install -r requirements.txt
- Press Enter and wait for it to finish
- In the same Command Prompt/Terminal, type:
python meta_scraper_updated.py
- Press Enter
- Wait for it to finish (takes about 2-3 minutes)
- You'll see progress updates as it downloads each section
After the scraper finishes, you'll have a new folder called meta_standards_output
with these files:
summary.txt
- Your main report! Shows what was downloaded successfullymeta_standards_summary.json
- Technical version (you can ignore this)
00_main_page.txt
- Meta's main Community Standards page01_Coordinating_Harm_and_Promoting_Crime.txt
- First policy section02_Dangerous_Organisations_and_Individuals.txt
- Second policy section- ...and so on through
27_User_Requests.txt
-
Start with
summary.txt
- This tells you:- How many sections were downloaded successfully
- Which sections (if any) failed to download
- Success rate percentage
-
Browse the numbered files - Each contains:
- The full text of one policy section
- The official title from Meta's website
- The web address where it came from
- How many characters of content were captured
When you open summary.txt
, you'll see something like:
META COMMUNITY STANDARDS SCRAPING SUMMARY
============================================================
Total sections attempted: 27
Successfully scraped: 27
Failed: 0
Success rate: 100.0%
SUCCESSFUL SECTIONS:
------------------------------
β Coordinating Harm and Promoting Crime (13,344 chars)
β Dangerous Organisations and Individuals (24,009 chars)
...
- Total sections attempted: How many policy sections the tool tried to download
- Successfully scraped: How many actually worked
- Success rate: Percentage that worked (100% is perfect!)
- Character count: Shows how much content was captured (more = better)
The tool downloads all 27 sections of Meta's Community Standards:
- Coordinating Harm and Promoting Crime - Rules about planning illegal activities
- Dangerous Organisations and Individuals - Policies on terrorists, criminals, etc.
- Fraud, Scams and Deceptive Practices - Rules against scams and fake schemes
- Restricted Goods and Services - What you can't sell on Meta platforms
- Violence and Incitement - Rules about violent content and threats
- Adult Sexual Exploitation - Policies protecting adults from sexual exploitation
- Bullying and Harassment - Rules against bullying and harassment
- Child Sexual Exploitation, Abuse and Nudity - Strong protections for children
- Human Exploitation - Rules against human trafficking and exploitation
- Suicide, Self-Injury and Eating Disorders - Mental health protection policies
- Adult Nudity and Sexual Activity - Rules about adult content
- Adult Sexual Solicitation and Sexually Explicit Language - Sexual conduct rules
- Hateful Conduct - Policies against hate speech and discrimination
- Privacy Violations - Rules protecting people's privacy
- Violent and Graphic Content - Policies on disturbing visual content
- Account Integrity - Rules about authentic accounts
- Authentic Identity Representation - Requirements for real identity
- Cybersecurity - Protection against hacking and cyber threats
- Inauthentic Behavior - Rules against fake engagement and manipulation
- Memorialisation - Policies for accounts of deceased users
- Misinformation - Rules against false information
- Spam - Policies against unwanted content and messages
- Third-Party Intellectual Property Infringement - Copyright protection rules
- Using Meta Intellectual Property and Licences - Rules about using Meta's content
- Additional Protection of Minors - Extra safety measures for young users
- Locally Illegal Content, Products or Services - Country-specific legal requirements
- User Requests - How Meta handles user reports and requests
Problem: Python isn't installed properly Solution:
- Reinstall Python from python.org
- Make sure to check "Add Python to PATH" during installation
- Restart your computer
Problem: Python package manager isn't working
Solution: Try python -m pip install -r requirements.txt
instead
Problem: Meta might have changed their website Solution: This is normal - the tool will get most sections even if a few fail
Problem: The tool waits 2 seconds between downloads to be respectful Solution: This is normal! Just wait - it takes 2-3 minutes total
Problem: Meta's website might be temporarily down or changed Solution: Try running the tool again later
- Uses
requests
library for web downloads - Uses
BeautifulSoup
for reading website content - Waits 2 seconds between downloads to be respectful to Meta's servers
- Saves content in both JSON (for programs) and TXT (for humans) formats
You can modify meta_scraper_updated.py
to:
- Change which sections to download
- Adjust the delay time between downloads
- Modify the output format
- Add new sections if Meta creates them
meta_scraper_updated.py
- The main tool (this is what you run)requirements.txt
- List of components Python needs to installREADME.md
- This instruction file- Other files - Various versions and test files (you can ignore these)
β This tool is designed for:
- Educational research
- Policy analysis
- Academic study
- Personal reference
β Important notes:
- Only downloads publicly available information
- Respects Meta's servers with reasonable delays
- Downloads from Meta's official transparency pages
- Does not bypass any security or access controls
- Don't run this tool excessively (once per day maximum)
- Respect the content and don't redistribute without permission
- Check Meta's terms of service for any restrictions
- Use the downloaded content ethically and legally
If you encounter problems:
- Check the troubleshooting section above first
- Make sure you have a stable internet connection
- Try running the tool again (sometimes temporary network issues occur)
- Check that you're using Python 3.8 or newer
The tool is designed to be robust and handle most common issues automatically.
Last Updated: July 2025
Tool Version: 2.0 (Individual file output with 100% success rate)
Tested On: Windows 10/11, macOS, Linux