Law7 Data Pipeline Guide

This guide explains how to populate and maintain the Law7 database with legal documents from official Russian government sources.

Official Data Sources

The Law7 data pipeline uses only official Russian government sources for legal documents:

Source	URL	Purpose	Documents
pravo.gov.ru	http://pravo.gov.ru/	Official Russian legal publication portal (primary API source)	Federal laws, presidential decrees, government resolutions
kremlin.ru	http://kremlin.ru/	Presidential administration website	Constitution, presidential decrees (KONST_RF)
government.ru	http://government.ru/	Government website	Government resolutions, procedure codes (APK_RF, GPK_RF, UPK_RF)

Source Details

pravo.gov.ru (Primary)

Official portal for legal publications
Provides REST API for document access
Contains federal laws, decrees, resolutions from 2011-present
Document metadata includes: type, number, date, signing authority
API endpoint: http://publication.pravo.gov.ru/api

kremlin.ru

Source for the Russian Constitution
Presidential decrees and orders
Used for importing KONST_RF code

government.ru

Government resolutions and orders
Procedural codes (arbitration, civil procedure, criminal procedure)
Used for importing APK_RF, GPK_RF, UPK_RF codes

Data Quality

All documents are:

Sourced from official government websites only
Hashed for verification
Timestamped with download date
Cross-referenced with permanent URLs
Updated as amendments are published

Minimum Setup for AI Users

For AI assistants via MCP: You only need these 3 steps to get started.

# 0. Setup MCP settings for AI

# 1. Start Docker services (PostgreSQL, Qdrant, Redis)
cd docker && docker-compose up -d

# 2. Import base legal codes (~6 hours, ~6,700 articles across 23 codes)
poetry run python scripts/import/import_base_code.py --all

# 3. Build and start the MCP server
npm run build
npm start

What you get after these 3 steps:

✅ 23 consolidated Russian legal codes (Civil, Labor, Criminal, etc.)
✅ get-code-structure - Browse code structure and articles
✅ get-article-version - Get specific article text
✅ get-statistics - Database statistics

Optional: For full semantic search across 157k+ amendment documents:

# Step 4: Fetch amendment documents from API (~6 hours for 2022-2026)
poetry run python scripts/sync/initial_sync.py --start-date 2022-01-01

# Step 5: Generate embeddings for semantic search (~2-3 hours)
poetry run python scripts/sync/content_sync.py --recreate-collection

After Steps 4-5, you also get:

✅ query-laws - Semantic search across all documents

Overview

The data pipeline consists of three main stages:

Quick Start Order (recommended for first-time users):

Step 1: Import Base Codes (import_base_code.py)  → Import 23 legal codes (~6 hours, ~6,700 articles)
Step 2: Document Sync (initial_sync.py)         → Fetch amendment documents from API (partial: ~6h, full: 100+h)
Step 3: Content + Embeddings (content_sync.py)   → Extract text and generate vectors (~2-3 hours)

Note: Step 1 (Import Base Codes) is independent and can be run alone to get the foundational legal codes (~6,700 articles across 23 codes like Civil, Labor, Criminal). Steps 2-3 add amendment document coverage (partial 2022-2026: ~157k documents in ~6h, full 2011-present: ~1.6M documents in 100+ hours).

Prerequisites

# Start services
cd docker && docker-compose up -d

# Verify services are running
docker-compose ps
# Should show: postgres (5433), qdrant (6333), redis (6380)

Step 1: Import Base Legal Codes (Quick Start - Do This First!)

Quick Start: Run this first to get the 19 core legal codes that most users need (Civil Code, Labor Code, Criminal Code, etc.). This is completely independent from the document sync steps below.

# List all available codes
poetry run python scripts/import/import_base_code.py --list

# Import all codes
poetry run python scripts/import/import_base_code.py --all

# Import specific code
poetry run python scripts/import/import_base_code.py --code GK_RF

# Show all available options
poetry run python scripts/import/import_base_code.py --help

Available options:

Option	Purpose
`--code CODE`	Import specific code (e.g., GK_RF, TK_RF, UK_RF)
`--all`	Import all 23 codes
`--source {auto,kremlin,pravo,government}`	Source to use (default: auto)
`--list`	List all available codes
`--verbose`	Enable verbose logging

Independence Note: This step is completely independent from Steps 2-3. It can be run anytime, even in parallel.

Imports the foundational Russian legal codes from official sources.

Available codes:

Code	Name (Russian)	Name (English)	kremlin.ru	pravo.gov.ru	government.ru
KONST_RF	Конституция Российской Федерации	Constitution	✅	✅	-
GK_RF	Гражданский кодекс	Civil Code Part 1	✅	✅	-
GK_RF_2	Гражданский кодекс ч.2	Civil Code Part 2	✅	✅	-
GK_RF_3	Гражданский кодекс ч.3	Civil Code Part 3	✅	✅	-
GK_RF_4	Гражданский кодекс ч.4	Civil Code Part 4	✅	✅	-
UK_RF	Уголовный кодекс	Criminal Code	✅	✅	-
TK_RF	Трудовой кодекс	Labor Code	✅	✅	-
NK_RF	Налоговый кодекс	Tax Code Part 1	✅	✅	✅
NK_RF_2	Налоговый кодекс ч.2	Tax Code Part 2	✅	✅	✅
KoAP_RF	Кодекс об административных правонарушениях	Administrative Code	✅	✅	✅
SK_RF	Семейный кодекс	Family Code	✅	✅	-
ZhK_RF	Жилищный кодекс	Housing Code	✅	✅	-
ZK_RF	Земельный кодекс	Land Code	✅	✅	✅
APK_RF	Арбитражный процессуальный кодекс	Arbitration Procedure Code	✅	-	✅
GPK_RF	Гражданский процессуальный кодекс	Civil Procedure Code	✅	-	✅
UPK_RF	Уголовно-процессуальный кодекс	Criminal Procedure Code	✅	-	✅
BK_RF	Бюджетный кодекс	Budget Code	-	-	✅
GRK_RF	Градостроительный кодекс	Urban Planning Code	-	-	✅
UIK_RF	Уголовно-исполнительный кодекс	Criminal Executive Code	-	-	✅
VZK_RF	Воздушный кодекс	Air Code	-	-	✅
VDK_RF	Водный кодекс	Water Code	-	-	✅
LK_RF	Лесной кодекс	Forest Code	-	-	✅
KAS_RF	Кодекс административного судопроизводства	Administrative Procedure Code	✅	-	-

Import System Features:

Automatic source fallback (kremlin → pravo → government)
Context-based article number validation for fractional articles
Quality checking to detect source formatting errors
Hybrid validation using surrounding articles and known ranges

Current Status:

✅ All 23 codes (23 code identifiers) registered in database
✅ 22 codes imported with 6,232 articles total
⚠️ KAS_RF (Administrative Procedure Code) not yet imported
✅ Metadata stored in consolidated_codes table
✅ Article validation with context-aware correction

Article counts by code:

Code	Articles	Code	Articles	Code	Articles
GK_RF_2	667	NK_RF_2	522	UK_RF	524
GK_RF	554	APK_RF	402	TK_RF	505
GPK_RF	471	BK_RF	110	ZhK_RF	227
GK_RF_4	322	ZK_RF	181	KoAP_RF	224
GK_RF_3	102	UIK_RF	222	VZK_RF	163
KONST_RF	137	SK_RF	73	VDK_RF	79
NK_RF	250	UPK_RF	184	GRK_RF	100
LK_RF	213

Step 2: Document Metadata Sync

Fetches document metadata from pravo.gov.ru and stores in PostgreSQL.

poetry run python scripts/sync/initial_sync.py --start-date YYYY-MM-DD --end-date YYYY-MM-DD

Options:

--start-date - First document date to fetch (default: 2020-01-01)
--end-date - Last document date to fetch (default: today)
--block - Filter by publication block (e.g., president, government, all)
--batch-size - Documents per batch (default: 30, API max: 30)
--daily - Run daily sync (yesterday to today)

Examples:

# Sync all documents from 2022 onwards
poetry run python scripts/sync/initial_sync.py --start-date 2022-01-01

# Sync only federal government documents
poetry run python scripts/sync/initial_sync.py --block government

# Daily sync (for cron/scheduler)
poetry run python scripts/sync/initial_sync.py --daily

Current Status:

✅ 1,134,985 documents synced (2019-2026)
✅ 47,871 documents with full text content (4.22% coverage)
⚠️ Date range filtering has issues - see TODO.md
📊 Document distribution by year:
- 2026: 10,998 (partial year)
- 2025: 177,080
- 2024: 176,091
- 2023: 178,453
- 2022: 178,685
- 2021: 163,509
- 2020: 162,315
- 2019: 87,854 (partial)

Step 3: Content Parsing + Embeddings

Extracts document text from API metadata and generates embeddings for semantic search.

Quick Start (Recommended)

# Parse content AND generate embeddings in one pass
poetry run python scripts/sync/content_sync.py --recreate-collection

Current Status:

✅ 47,871 documents with full text (4.22% of total)
⚠️ Embeddings: Only 265 vectors in Qdrant (test data)
⚠️ Full content sync needed for semantic search

Estimated time for full sync (1.1M documents):

Content parsing: 4-6 hours
Embeddings (RTX 3060): 12-20 hours
Total: ~16-26 hours

Advanced Options

# Test with 100 documents first
poetry run python scripts/sync/content_sync.py --limit 100 --recreate-collection

# Only parse content (skip embeddings) - for testing
poetry run python scripts/sync/content_sync.py --skip-embeddings

# Only generate embeddings from existing content (resume mode)
poetry run python scripts/sync/content_sync.py --skip-content --recreate-collection

All Options:

Option	Purpose
`--limit N`	Process only N documents (testing)
`--skip-content`	Skip parsing, use existing content
`--skip-embeddings`	Skip embedding generation
`--recreate-collection`	Clear Qdrant and start fresh

What the Script Does

Fetches documents from PostgreSQL (by publish_date DESC)
Parses content from API metadata (title, name, complexName)
Stores content in document_content table
Generates embeddings using deepvk/USER2-base (GPU-accelerated)
Stores vectors in Qdrant for semantic search
Automatic cleanup - memory management, GPU cache clearing

Memory Management

The script includes automatic memory cleanup:

Clears embedding cache after each document
Forces garbage collection every 50 documents
Clears CUDA cache (if using GPU)
Logs memory usage throughout

Expected memory usage:

Model loading: ~500MB (CPU) / ~800MB (GPU)
Per document: ~10-50MB spikes during processing
Peaks around 2-3GB with RTX 3060

Monitoring Progress

# Watch the logs
poetry run python scripts/sync/content_sync.py --recreate-collection

# You'll see output like:
# [DOC 1/156000] Постановление Правительства... (0 chars, type: xxx)
# [DOC 2/156000] Указ Президента... (0 chars, type: xxx)
# ...
# [MEMORY] After 0 documents: 1234.5 MB
# Batch complete. Total chunks: 1234

Checking Results

# Check content in database
docker exec law7-postgres psql -U law7 -d law7 -c "
SELECT
  COUNT(*) as total,
  COUNT(full_text) as with_content
FROM document_content;
"

# Check Qdrant collection
curl http://localhost:6333/collections/law_chunks

Verification

MCP Tools Testing

# Build and test MCP server
npm run build
npx @modelcontextprotocol/inspector node dist/index.js

Test queries in MCP Inspector:

List all codes:
```
{"name": "get-code-structure"}
```

Get specific code structure:

{
  "name": "get-code-structure",
  "arguments": {"code_id": "TK_RF", "include_articles": true, "article_limit": 10}
}

Get specific article:

{
  "name": "get-article-version",
  "arguments": {"code_id": "TK_RF", "article_number": "80"}
}

Search laws:

{
  "name": "query-laws",
  "arguments": {"query": "трудовой договор", "limit": 5}
}

Get statistics:
```
{"name": "get-statistics"}
```

Database Statistics

# Document counts by year
docker exec law7-postgres psql -U law7 -d law7 -c "
SELECT
  EXTRACT(YEAR FROM publish_date) as year,
  COUNT(*) as count
FROM documents
GROUP BY EXTRACT(YEAR FROM publish_date)
ORDER BY year DESC;
"

# Content coverage
docker exec law7-postgres psql -U law7 -d law7 -c "
SELECT
  COUNT(*) as total_documents,
  COUNT(dc.full_text) as with_content,
  COUNT(dc.full_text) * 100.0 / COUNT(*) as coverage_percent
FROM documents d
LEFT JOIN document_content dc ON d.id = dc.document_id;
"

# Code article counts
docker exec law7-postgres psql -U law7 -d law7 -c "
SELECT
  code_id,
  COUNT(*) as articles,
  COUNT(*) FILTER (WHERE is_current = true) as current,
  COUNT(*) FILTER (WHERE is_repealed = true) as repealed
FROM code_article_versions
GROUP BY code_id
ORDER BY code_id;
"

Troubleshooting

Issue: "value too long for type character varying(1000)"

Fix: Already fixed - documents.name is now TEXT type

Issue: "CUDA out of memory"

Solutions:

Reduce EMBEDDING_BATCH_SIZE in .env (default: 32)
Close other GPU applications
Use CPU: set EMBEDDING_DEVICE=cpu in .env

Issue: "Connection refused" errors

Fix: Ensure Docker services are running:

cd docker && docker-compose up -d

Issue: Content parsing returns empty strings

Check: API metadata structure - some documents only have titles Solution: The parser uses title || name || complexName fallback

Maintenance

Daily Sync (Recommended)

Set up a cron job to sync new documents daily:

# Add to crontab (crontab -e)
0 2 * * * cd /path/to/law7 && poetry run python scripts/sync/initial_sync.py --daily

Regenerate Embeddings

If you update the embedding model:

poetry run python scripts/sync/content_sync.py --recreate-collection

Backup Database

The project includes comprehensive backup/restore scripts for PostgreSQL and Qdrant:

cd docker
./backup.sh                 # Create full backup (PostgreSQL + Qdrant)
./restore.sh law7_backup_YYYYMMDD_HHMMSS    # Restore from backup
./check-backup.sh backups/law7_backup_*.tar.gz   # Verify backup integrity

See docs/BACKUP_RESTORE.md for complete backup/restore documentation.

Quick manual backup (PostgreSQL only):

# Backup PostgreSQL
docker exec law7-postgres pg_dump -U law7 law7 > backup.sql

# Restore
docker exec -i law7-postgres psql -U law7 law7 < backup.sql

Ministry Letters Import (Phase 7C)

The data pipeline includes scrapers for official ministry letters and interpretations from Russian government agencies.

Supported Agencies

Agency	Short Name	Source	Status
Ministry of Finance	Минфин	minfin.gov.ru	✅ Complete
Federal Tax Service	ФНС	nalog.gov.ru	✅ Complete
Federal Labor Service	Роструд	rostrud.gov.ru	✅ Complete

Import Ministry Letters

# Import from specific agency (last 5 years)
poetry run python scripts/country_modules/russia/import/import_ministry_letters.py --agency minfin

# Import from Rostrud
poetry run python scripts/country_modules/russia/import/import_ministry_letters.py --agency rostrud --since 2020-01-01

# Import without date filter
poetry run python scripts/country_modules/russia/import/import_ministry_letters.py --agency rostrud --all

# Import from all agencies
poetry run python scripts/country_modules/russia/import/import_ministry_letters.py

# Test with limit
poetry run python scripts/country_modules/russia/import/import_ministry_letters.py --agency rostrud --limit 10

Options:

--agency {minfin,fns,rostrud} - Specific agency to import
--since YYYY-MM-DD - Only import letters after this date
--all - Import all letters without date filter
--limit N - Limit to N letters (testing)
--source {answers,general_documents} - Minfin source (answers or general documents)

Features:

Batch inserts (500 records per batch)
Upsert with unique constraint to avoid duplicates
Rate limiting (10s pause every 100 documents)
Resume capability for failed documents

Current Status:

✅ Minfin: ~30,000 general documents + 11 Q&A topics
✅ FNS: Search API integration with Actual-only filter
✅ Rostrud: Documents listing with pagination (PAGEN_1)
✅ Database table: official_interpretations

Court Decisions (Phase 7)

Constitutional Court (pravo.gov.ru)

# Sync Constitutional Court decisions
poetry run python scripts/sync/court_sync.py --start-date 2022-01-01 --end-date 2024-12-31

General Jurisdiction Courts (SUDRF)

Status: Selenium WebDriver implementation with Russian IP requirement

The SUDRF scraper (sudrf_scraper.py) fetches court decisions from the Russian State Automated System "Justice" (ГАС РФ "Правосудие").

Implementation:

Selenium WebDriver with Chrome (via webdriver-manager)
Official search portal: https://sudrf.ru/index.php?id=300&searchtype=sp
Anti-detection Chrome options (headless, no-sandbox, disable-blink-features)
Form submission pattern for search queries
Result parsing with BeautifulSoup

Current Limitations:

Russian IP Required 🇷🇺
- SUDRF blocks access from outside Russia
- Error: "недоступна" (unavailable)
- Solution: Use Russian VPS (Yandex Cloud, Selectel)
Strong Anti-Bot Protection
- Browser fingerprinting detects automation
- Even Selenium with anti-detection measures may be blocked
- May require CAPTCHA solving
Geographic Restrictions
- Blocks non-Russian IP addresses at infrastructure level
- Cannot bypass with headers alone

Production Recommendations:

Approach	Description	Complexity
Russian Server	Run scraper from Russian VPS/proxy	Medium
Commercial API	parser-api.com/sudrf, api-assist.com/api/sudrf	Low
Regional Portals	Use individual court websites (less restricted)	Medium
Official Access	Institutional API access from SUDRF	High

Testing (from Russian IP only):

# Test SUDRF scraper (requires Russian IP)
poetry run python scripts/tests/test_sudrf_scraper.py

# Expected output: Found N decisions
# If blocked: "недоступна - ошибка 403"

Code Reference:

Database Table: court_decisions

Performance Tips

Use GPU for embeddings - 10x faster than CPU
Batch size: 30 is API max, don't increase
Embedding batch size: 32 works well for RTX 3060 12GB
Skip documents >100KB - Automatically skipped to prevent timeouts
Memory monitoring - Script logs usage every 50 docs

File Reference

Script	Purpose
`scripts/sync/initial_sync.py`	Fetch document metadata
`scripts/sync/content_sync.py`	Parse content + generate embeddings
`scripts/sync/fetch_amendment_content.py`	Fetch detailed amendment text
`scripts/sync/court_sync.py`	Fetch court decisions from pravo.gov.ru
`scripts/import/import_base_code.py`	Import base legal codes
`scripts/country_modules/russia/import/import_ministry_letters.py`	Import ministry letters (Minfin, FNS, Rostrud)
`scripts/crawler/pravo_api_client.py`	API client for pravo.gov.ru
`scripts/country_modules/russia/scrapers/ministry_scraper.py`	Scraper for ministry letters (Phase 7C)
`scripts/country_modules/russia/scrapers/sudrf_scraper.py`	Scraper for SUDRF general jurisdiction courts
`scripts/parser/html_parser.py`	Parse content from API metadata
`scripts/parser/court_decision_parser.py`	Parse court decisions and extract legal citations
`scripts/indexer/embeddings.py`	Generate embeddings with deepvk/USER2-base
`scripts/indexer/qdrant_indexer.py`	Store embeddings in Qdrant
`docker/backup.sh`	Create database backup (PostgreSQL + Qdrant)
`docker/restore.sh`	Restore database from backup
`docker/check-backup.sh`	Verify backup integrity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Law7 Data Pipeline Guide

Official Data Sources

Source Details

Data Quality

Minimum Setup for AI Users

Overview

Prerequisites

Step 1: Import Base Legal Codes (Quick Start - Do This First!)

Step 2: Document Metadata Sync

Step 3: Content Parsing + Embeddings

Quick Start (Recommended)

Advanced Options

What the Script Does

Memory Management

Monitoring Progress

Checking Results

Verification

MCP Tools Testing

Database Statistics

Troubleshooting

Issue: "value too long for type character varying(1000)"

Issue: "CUDA out of memory"

Issue: "Connection refused" errors

Issue: Content parsing returns empty strings

Maintenance

Daily Sync (Recommended)

Regenerate Embeddings

Backup Database

Ministry Letters Import (Phase 7C)

Supported Agencies

Import Ministry Letters

Court Decisions (Phase 7)

Constitutional Court (pravo.gov.ru)

General Jurisdiction Courts (SUDRF)

Performance Tips

File Reference

FilesExpand file tree

DATA_PIPELINE.md

Latest commit

History

DATA_PIPELINE.md

File metadata and controls

Law7 Data Pipeline Guide

Official Data Sources

Source Details

Data Quality

Minimum Setup for AI Users

Overview

Prerequisites

Step 1: Import Base Legal Codes (Quick Start - Do This First!)

Step 2: Document Metadata Sync

Step 3: Content Parsing + Embeddings

Quick Start (Recommended)

Advanced Options

What the Script Does

Memory Management

Monitoring Progress

Checking Results

Verification

MCP Tools Testing

Database Statistics

Troubleshooting

Issue: "value too long for type character varying(1000)"

Issue: "CUDA out of memory"

Issue: "Connection refused" errors

Issue: Content parsing returns empty strings

Maintenance

Daily Sync (Recommended)

Regenerate Embeddings

Backup Database

Ministry Letters Import (Phase 7C)

Supported Agencies

Import Ministry Letters

Court Decisions (Phase 7)

Constitutional Court (pravo.gov.ru)

General Jurisdiction Courts (SUDRF)

Performance Tips

File Reference