This guide explains how to populate and maintain the Law7 database with legal documents from official Russian government sources.
The Law7 data pipeline uses only official Russian government sources for legal documents:
| Source | URL | Purpose | Documents |
|---|---|---|---|
| pravo.gov.ru | http://pravo.gov.ru/ | Official Russian legal publication portal (primary API source) | Federal laws, presidential decrees, government resolutions |
| kremlin.ru | http://kremlin.ru/ | Presidential administration website | Constitution, presidential decrees (KONST_RF) |
| government.ru | http://government.ru/ | Government website | Government resolutions, procedure codes (APK_RF, GPK_RF, UPK_RF) |
pravo.gov.ru (Primary)
- Official portal for legal publications
- Provides REST API for document access
- Contains federal laws, decrees, resolutions from 2011-present
- Document metadata includes: type, number, date, signing authority
- API endpoint:
http://publication.pravo.gov.ru/api
kremlin.ru
- Source for the Russian Constitution
- Presidential decrees and orders
- Used for importing KONST_RF code
government.ru
- Government resolutions and orders
- Procedural codes (arbitration, civil procedure, criminal procedure)
- Used for importing APK_RF, GPK_RF, UPK_RF codes
All documents are:
- Sourced from official government websites only
- Hashed for verification
- Timestamped with download date
- Cross-referenced with permanent URLs
- Updated as amendments are published
For AI assistants via MCP: You only need these 3 steps to get started.
# 0. Setup MCP settings for AI
# 1. Start Docker services (PostgreSQL, Qdrant, Redis)
cd docker && docker-compose up -d
# 2. Import base legal codes (~6 hours, ~6,700 articles across 23 codes)
poetry run python scripts/import/import_base_code.py --all
# 3. Build and start the MCP server
npm run build
npm startWhat you get after these 3 steps:
- ✅ 23 consolidated Russian legal codes (Civil, Labor, Criminal, etc.)
- ✅
get-code-structure- Browse code structure and articles - ✅
get-article-version- Get specific article text - ✅
get-statistics- Database statistics
Optional: For full semantic search across 157k+ amendment documents:
# Step 4: Fetch amendment documents from API (~6 hours for 2022-2026)
poetry run python scripts/sync/initial_sync.py --start-date 2022-01-01
# Step 5: Generate embeddings for semantic search (~2-3 hours)
poetry run python scripts/sync/content_sync.py --recreate-collectionAfter Steps 4-5, you also get:
- ✅
query-laws- Semantic search across all documents
The data pipeline consists of three main stages:
Quick Start Order (recommended for first-time users):
Step 1: Import Base Codes (import_base_code.py) → Import 23 legal codes (~6 hours, ~6,700 articles)
Step 2: Document Sync (initial_sync.py) → Fetch amendment documents from API (partial: ~6h, full: 100+h)
Step 3: Content + Embeddings (content_sync.py) → Extract text and generate vectors (~2-3 hours)
Note: Step 1 (Import Base Codes) is independent and can be run alone to get the foundational legal codes (~6,700 articles across 23 codes like Civil, Labor, Criminal). Steps 2-3 add amendment document coverage (partial 2022-2026: ~157k documents in ~6h, full 2011-present: ~1.6M documents in 100+ hours).
# Start services
cd docker && docker-compose up -d
# Verify services are running
docker-compose ps
# Should show: postgres (5433), qdrant (6333), redis (6380)Quick Start: Run this first to get the 19 core legal codes that most users need (Civil Code, Labor Code, Criminal Code, etc.). This is completely independent from the document sync steps below.
# List all available codes
poetry run python scripts/import/import_base_code.py --list
# Import all codes
poetry run python scripts/import/import_base_code.py --all
# Import specific code
poetry run python scripts/import/import_base_code.py --code GK_RF
# Show all available options
poetry run python scripts/import/import_base_code.py --helpAvailable options:
| Option | Purpose |
|---|---|
--code CODE |
Import specific code (e.g., GK_RF, TK_RF, UK_RF) |
--all |
Import all 23 codes |
--source {auto,kremlin,pravo,government} |
Source to use (default: auto) |
--list |
List all available codes |
--verbose |
Enable verbose logging |
Independence Note: This step is completely independent from Steps 2-3. It can be run anytime, even in parallel.
Imports the foundational Russian legal codes from official sources.
Available codes:
| Code | Name (Russian) | Name (English) | kremlin.ru | pravo.gov.ru | government.ru |
|---|---|---|---|---|---|
| KONST_RF | Конституция Российской Федерации | Constitution | ✅ | ✅ | - |
| GK_RF | Гражданский кодекс | Civil Code Part 1 | ✅ | ✅ | - |
| GK_RF_2 | Гражданский кодекс ч.2 | Civil Code Part 2 | ✅ | ✅ | - |
| GK_RF_3 | Гражданский кодекс ч.3 | Civil Code Part 3 | ✅ | ✅ | - |
| GK_RF_4 | Гражданский кодекс ч.4 | Civil Code Part 4 | ✅ | ✅ | - |
| UK_RF | Уголовный кодекс | Criminal Code | ✅ | ✅ | - |
| TK_RF | Трудовой кодекс | Labor Code | ✅ | ✅ | - |
| NK_RF | Налоговый кодекс | Tax Code Part 1 | ✅ | ✅ | ✅ |
| NK_RF_2 | Налоговый кодекс ч.2 | Tax Code Part 2 | ✅ | ✅ | ✅ |
| KoAP_RF | Кодекс об административных правонарушениях | Administrative Code | ✅ | ✅ | ✅ |
| SK_RF | Семейный кодекс | Family Code | ✅ | ✅ | - |
| ZhK_RF | Жилищный кодекс | Housing Code | ✅ | ✅ | - |
| ZK_RF | Земельный кодекс | Land Code | ✅ | ✅ | ✅ |
| APK_RF | Арбитражный процессуальный кодекс | Arbitration Procedure Code | ✅ | - | ✅ |
| GPK_RF | Гражданский процессуальный кодекс | Civil Procedure Code | ✅ | - | ✅ |
| UPK_RF | Уголовно-процессуальный кодекс | Criminal Procedure Code | ✅ | - | ✅ |
| BK_RF | Бюджетный кодекс | Budget Code | - | - | ✅ |
| GRK_RF | Градостроительный кодекс | Urban Planning Code | - | - | ✅ |
| UIK_RF | Уголовно-исполнительный кодекс | Criminal Executive Code | - | - | ✅ |
| VZK_RF | Воздушный кодекс | Air Code | - | - | ✅ |
| VDK_RF | Водный кодекс | Water Code | - | - | ✅ |
| LK_RF | Лесной кодекс | Forest Code | - | - | ✅ |
| KAS_RF | Кодекс административного судопроизводства | Administrative Procedure Code | ✅ | - | - |
Import System Features:
- Automatic source fallback (kremlin → pravo → government)
- Context-based article number validation for fractional articles
- Quality checking to detect source formatting errors
- Hybrid validation using surrounding articles and known ranges
Current Status:
- ✅ All 23 codes (23 code identifiers) registered in database
- ✅ 22 codes imported with 6,232 articles total
⚠️ KAS_RF (Administrative Procedure Code) not yet imported- ✅ Metadata stored in
consolidated_codestable - ✅ Article validation with context-aware correction
Article counts by code:
| Code | Articles | Code | Articles | Code | Articles |
|---|---|---|---|---|---|
| GK_RF_2 | 667 | NK_RF_2 | 522 | UK_RF | 524 |
| GK_RF | 554 | APK_RF | 402 | TK_RF | 505 |
| GPK_RF | 471 | BK_RF | 110 | ZhK_RF | 227 |
| GK_RF_4 | 322 | ZK_RF | 181 | KoAP_RF | 224 |
| GK_RF_3 | 102 | UIK_RF | 222 | VZK_RF | 163 |
| KONST_RF | 137 | SK_RF | 73 | VDK_RF | 79 |
| NK_RF | 250 | UPK_RF | 184 | GRK_RF | 100 |
| LK_RF | 213 |
Fetches document metadata from pravo.gov.ru and stores in PostgreSQL.
poetry run python scripts/sync/initial_sync.py --start-date YYYY-MM-DD --end-date YYYY-MM-DDOptions:
--start-date- First document date to fetch (default: 2020-01-01)--end-date- Last document date to fetch (default: today)--block- Filter by publication block (e.g.,president,government,all)--batch-size- Documents per batch (default: 30, API max: 30)--daily- Run daily sync (yesterday to today)
Examples:
# Sync all documents from 2022 onwards
poetry run python scripts/sync/initial_sync.py --start-date 2022-01-01
# Sync only federal government documents
poetry run python scripts/sync/initial_sync.py --block government
# Daily sync (for cron/scheduler)
poetry run python scripts/sync/initial_sync.py --dailyCurrent Status:
- ✅ 1,134,985 documents synced (2019-2026)
- ✅ 47,871 documents with full text content (4.22% coverage)
⚠️ Date range filtering has issues - see TODO.md- 📊 Document distribution by year:
- 2026: 10,998 (partial year)
- 2025: 177,080
- 2024: 176,091
- 2023: 178,453
- 2022: 178,685
- 2021: 163,509
- 2020: 162,315
- 2019: 87,854 (partial)
Extracts document text from API metadata and generates embeddings for semantic search.
# Parse content AND generate embeddings in one pass
poetry run python scripts/sync/content_sync.py --recreate-collectionCurrent Status:
- ✅ 47,871 documents with full text (4.22% of total)
⚠️ Embeddings: Only 265 vectors in Qdrant (test data)⚠️ Full content sync needed for semantic search
Estimated time for full sync (1.1M documents):
- Content parsing: 4-6 hours
- Embeddings (RTX 3060): 12-20 hours
- Total: ~16-26 hours
# Test with 100 documents first
poetry run python scripts/sync/content_sync.py --limit 100 --recreate-collection
# Only parse content (skip embeddings) - for testing
poetry run python scripts/sync/content_sync.py --skip-embeddings
# Only generate embeddings from existing content (resume mode)
poetry run python scripts/sync/content_sync.py --skip-content --recreate-collectionAll Options:
| Option | Purpose |
|---|---|
--limit N |
Process only N documents (testing) |
--skip-content |
Skip parsing, use existing content |
--skip-embeddings |
Skip embedding generation |
--recreate-collection |
Clear Qdrant and start fresh |
- Fetches documents from PostgreSQL (by publish_date DESC)
- Parses content from API metadata (title, name, complexName)
- Stores content in
document_contenttable - Generates embeddings using deepvk/USER2-base (GPU-accelerated)
- Stores vectors in Qdrant for semantic search
- Automatic cleanup - memory management, GPU cache clearing
The script includes automatic memory cleanup:
- Clears embedding cache after each document
- Forces garbage collection every 50 documents
- Clears CUDA cache (if using GPU)
- Logs memory usage throughout
Expected memory usage:
- Model loading: ~500MB (CPU) / ~800MB (GPU)
- Per document: ~10-50MB spikes during processing
- Peaks around 2-3GB with RTX 3060
# Watch the logs
poetry run python scripts/sync/content_sync.py --recreate-collection
# You'll see output like:
# [DOC 1/156000] Постановление Правительства... (0 chars, type: xxx)
# [DOC 2/156000] Указ Президента... (0 chars, type: xxx)
# ...
# [MEMORY] After 0 documents: 1234.5 MB
# Batch complete. Total chunks: 1234# Check content in database
docker exec law7-postgres psql -U law7 -d law7 -c "
SELECT
COUNT(*) as total,
COUNT(full_text) as with_content
FROM document_content;
"
# Check Qdrant collection
curl http://localhost:6333/collections/law_chunks# Build and test MCP server
npm run build
npx @modelcontextprotocol/inspector node dist/index.jsTest queries in MCP Inspector:
-
List all codes:
{"name": "get-code-structure"} -
Get specific code structure:
{ "name": "get-code-structure", "arguments": {"code_id": "TK_RF", "include_articles": true, "article_limit": 10} } -
Get specific article:
{ "name": "get-article-version", "arguments": {"code_id": "TK_RF", "article_number": "80"} } -
Search laws:
{ "name": "query-laws", "arguments": {"query": "трудовой договор", "limit": 5} } -
Get statistics:
{"name": "get-statistics"}
# Document counts by year
docker exec law7-postgres psql -U law7 -d law7 -c "
SELECT
EXTRACT(YEAR FROM publish_date) as year,
COUNT(*) as count
FROM documents
GROUP BY EXTRACT(YEAR FROM publish_date)
ORDER BY year DESC;
"
# Content coverage
docker exec law7-postgres psql -U law7 -d law7 -c "
SELECT
COUNT(*) as total_documents,
COUNT(dc.full_text) as with_content,
COUNT(dc.full_text) * 100.0 / COUNT(*) as coverage_percent
FROM documents d
LEFT JOIN document_content dc ON d.id = dc.document_id;
"
# Code article counts
docker exec law7-postgres psql -U law7 -d law7 -c "
SELECT
code_id,
COUNT(*) as articles,
COUNT(*) FILTER (WHERE is_current = true) as current,
COUNT(*) FILTER (WHERE is_repealed = true) as repealed
FROM code_article_versions
GROUP BY code_id
ORDER BY code_id;
"Fix: Already fixed - documents.name is now TEXT type
Solutions:
- Reduce
EMBEDDING_BATCH_SIZEin.env(default: 32) - Close other GPU applications
- Use CPU: set
EMBEDDING_DEVICE=cpuin.env
Fix: Ensure Docker services are running:
cd docker && docker-compose up -dCheck: API metadata structure - some documents only have titles
Solution: The parser uses title || name || complexName fallback
Set up a cron job to sync new documents daily:
# Add to crontab (crontab -e)
0 2 * * * cd /path/to/law7 && poetry run python scripts/sync/initial_sync.py --dailyIf you update the embedding model:
poetry run python scripts/sync/content_sync.py --recreate-collectionThe project includes comprehensive backup/restore scripts for PostgreSQL and Qdrant:
cd docker
./backup.sh # Create full backup (PostgreSQL + Qdrant)
./restore.sh law7_backup_YYYYMMDD_HHMMSS # Restore from backup
./check-backup.sh backups/law7_backup_*.tar.gz # Verify backup integritySee docs/BACKUP_RESTORE.md for complete backup/restore documentation.
Quick manual backup (PostgreSQL only):
# Backup PostgreSQL
docker exec law7-postgres pg_dump -U law7 law7 > backup.sql
# Restore
docker exec -i law7-postgres psql -U law7 law7 < backup.sqlThe data pipeline includes scrapers for official ministry letters and interpretations from Russian government agencies.
| Agency | Short Name | Source | Status |
|---|---|---|---|
| Ministry of Finance | Минфин | minfin.gov.ru | ✅ Complete |
| Federal Tax Service | ФНС | nalog.gov.ru | ✅ Complete |
| Federal Labor Service | Роструд | rostrud.gov.ru | ✅ Complete |
# Import from specific agency (last 5 years)
poetry run python scripts/country_modules/russia/import/import_ministry_letters.py --agency minfin
# Import from Rostrud
poetry run python scripts/country_modules/russia/import/import_ministry_letters.py --agency rostrud --since 2020-01-01
# Import without date filter
poetry run python scripts/country_modules/russia/import/import_ministry_letters.py --agency rostrud --all
# Import from all agencies
poetry run python scripts/country_modules/russia/import/import_ministry_letters.py
# Test with limit
poetry run python scripts/country_modules/russia/import/import_ministry_letters.py --agency rostrud --limit 10Options:
--agency {minfin,fns,rostrud}- Specific agency to import--since YYYY-MM-DD- Only import letters after this date--all- Import all letters without date filter--limit N- Limit to N letters (testing)--source {answers,general_documents}- Minfin source (answers or general documents)
Features:
- Batch inserts (500 records per batch)
- Upsert with unique constraint to avoid duplicates
- Rate limiting (10s pause every 100 documents)
- Resume capability for failed documents
Current Status:
- ✅ Minfin: ~30,000 general documents + 11 Q&A topics
- ✅ FNS: Search API integration with Actual-only filter
- ✅ Rostrud: Documents listing with pagination (PAGEN_1)
- ✅ Database table:
official_interpretations
# Sync Constitutional Court decisions
poetry run python scripts/sync/court_sync.py --start-date 2022-01-01 --end-date 2024-12-31Status: Selenium WebDriver implementation with Russian IP requirement
The SUDRF scraper (sudrf_scraper.py) fetches court decisions from the Russian
State Automated System "Justice" (ГАС РФ "Правосудие").
Implementation:
- Selenium WebDriver with Chrome (via webdriver-manager)
- Official search portal: https://sudrf.ru/index.php?id=300&searchtype=sp
- Anti-detection Chrome options (headless, no-sandbox, disable-blink-features)
- Form submission pattern for search queries
- Result parsing with BeautifulSoup
Current Limitations:
-
Russian IP Required 🇷🇺
- SUDRF blocks access from outside Russia
- Error: "недоступна" (unavailable)
- Solution: Use Russian VPS (Yandex Cloud, Selectel)
-
Strong Anti-Bot Protection
- Browser fingerprinting detects automation
- Even Selenium with anti-detection measures may be blocked
- May require CAPTCHA solving
-
Geographic Restrictions
- Blocks non-Russian IP addresses at infrastructure level
- Cannot bypass with headers alone
Production Recommendations:
| Approach | Description | Complexity |
|---|---|---|
| Russian Server | Run scraper from Russian VPS/proxy | Medium |
| Commercial API | parser-api.com/sudrf, api-assist.com/api/sudrf | Low |
| Regional Portals | Use individual court websites (less restricted) | Medium |
| Official Access | Institutional API access from SUDRF | High |
Testing (from Russian IP only):
# Test SUDRF scraper (requires Russian IP)
poetry run python scripts/tests/test_sudrf_scraper.py
# Expected output: Found N decisions
# If blocked: "недоступна - ошибка 403"Code Reference:
Database Table: court_decisions
- Use GPU for embeddings - 10x faster than CPU
- Batch size: 30 is API max, don't increase
- Embedding batch size: 32 works well for RTX 3060 12GB
- Skip documents >100KB - Automatically skipped to prevent timeouts
- Memory monitoring - Script logs usage every 50 docs
| Script | Purpose |
|---|---|
scripts/sync/initial_sync.py |
Fetch document metadata |
scripts/sync/content_sync.py |
Parse content + generate embeddings |
scripts/sync/fetch_amendment_content.py |
Fetch detailed amendment text |
scripts/sync/court_sync.py |
Fetch court decisions from pravo.gov.ru |
scripts/import/import_base_code.py |
Import base legal codes |
scripts/country_modules/russia/import/import_ministry_letters.py |
Import ministry letters (Minfin, FNS, Rostrud) |
scripts/crawler/pravo_api_client.py |
API client for pravo.gov.ru |
scripts/country_modules/russia/scrapers/ministry_scraper.py |
Scraper for ministry letters (Phase 7C) |
scripts/country_modules/russia/scrapers/sudrf_scraper.py |
Scraper for SUDRF general jurisdiction courts |
scripts/parser/html_parser.py |
Parse content from API metadata |
scripts/parser/court_decision_parser.py |
Parse court decisions and extract legal citations |
scripts/indexer/embeddings.py |
Generate embeddings with deepvk/USER2-base |
scripts/indexer/qdrant_indexer.py |
Store embeddings in Qdrant |
docker/backup.sh |
Create database backup (PostgreSQL + Qdrant) |
docker/restore.sh |
Restore database from backup |
docker/check-backup.sh |
Verify backup integrity |