Skip to content

Latest commit

 

History

History
623 lines (480 loc) · 21 KB

File metadata and controls

623 lines (480 loc) · 21 KB

Law7 Data Pipeline Guide

This guide explains how to populate and maintain the Law7 database with legal documents from official Russian government sources.

Official Data Sources

The Law7 data pipeline uses only official Russian government sources for legal documents:

Source URL Purpose Documents
pravo.gov.ru http://pravo.gov.ru/ Official Russian legal publication portal (primary API source) Federal laws, presidential decrees, government resolutions
kremlin.ru http://kremlin.ru/ Presidential administration website Constitution, presidential decrees (KONST_RF)
government.ru http://government.ru/ Government website Government resolutions, procedure codes (APK_RF, GPK_RF, UPK_RF)

Source Details

pravo.gov.ru (Primary)

  • Official portal for legal publications
  • Provides REST API for document access
  • Contains federal laws, decrees, resolutions from 2011-present
  • Document metadata includes: type, number, date, signing authority
  • API endpoint: http://publication.pravo.gov.ru/api

kremlin.ru

  • Source for the Russian Constitution
  • Presidential decrees and orders
  • Used for importing KONST_RF code

government.ru

  • Government resolutions and orders
  • Procedural codes (arbitration, civil procedure, criminal procedure)
  • Used for importing APK_RF, GPK_RF, UPK_RF codes

Data Quality

All documents are:

  • Sourced from official government websites only
  • Hashed for verification
  • Timestamped with download date
  • Cross-referenced with permanent URLs
  • Updated as amendments are published

Minimum Setup for AI Users

For AI assistants via MCP: You only need these 3 steps to get started.

# 0. Setup MCP settings for AI

# 1. Start Docker services (PostgreSQL, Qdrant, Redis)
cd docker && docker-compose up -d

# 2. Import base legal codes (~6 hours, ~6,700 articles across 23 codes)
poetry run python scripts/import/import_base_code.py --all

# 3. Build and start the MCP server
npm run build
npm start

What you get after these 3 steps:

  • ✅ 23 consolidated Russian legal codes (Civil, Labor, Criminal, etc.)
  • get-code-structure - Browse code structure and articles
  • get-article-version - Get specific article text
  • get-statistics - Database statistics

Optional: For full semantic search across 157k+ amendment documents:

# Step 4: Fetch amendment documents from API (~6 hours for 2022-2026)
poetry run python scripts/sync/initial_sync.py --start-date 2022-01-01

# Step 5: Generate embeddings for semantic search (~2-3 hours)
poetry run python scripts/sync/content_sync.py --recreate-collection

After Steps 4-5, you also get:

  • query-laws - Semantic search across all documents

Overview

The data pipeline consists of three main stages:

Quick Start Order (recommended for first-time users):

Step 1: Import Base Codes (import_base_code.py)  → Import 23 legal codes (~6 hours, ~6,700 articles)
Step 2: Document Sync (initial_sync.py)         → Fetch amendment documents from API (partial: ~6h, full: 100+h)
Step 3: Content + Embeddings (content_sync.py)   → Extract text and generate vectors (~2-3 hours)

Note: Step 1 (Import Base Codes) is independent and can be run alone to get the foundational legal codes (~6,700 articles across 23 codes like Civil, Labor, Criminal). Steps 2-3 add amendment document coverage (partial 2022-2026: ~157k documents in ~6h, full 2011-present: ~1.6M documents in 100+ hours).

Prerequisites

# Start services
cd docker && docker-compose up -d

# Verify services are running
docker-compose ps
# Should show: postgres (5433), qdrant (6333), redis (6380)

Step 1: Import Base Legal Codes (Quick Start - Do This First!)

Quick Start: Run this first to get the 19 core legal codes that most users need (Civil Code, Labor Code, Criminal Code, etc.). This is completely independent from the document sync steps below.

# List all available codes
poetry run python scripts/import/import_base_code.py --list

# Import all codes
poetry run python scripts/import/import_base_code.py --all

# Import specific code
poetry run python scripts/import/import_base_code.py --code GK_RF

# Show all available options
poetry run python scripts/import/import_base_code.py --help

Available options:

Option Purpose
--code CODE Import specific code (e.g., GK_RF, TK_RF, UK_RF)
--all Import all 23 codes
--source {auto,kremlin,pravo,government} Source to use (default: auto)
--list List all available codes
--verbose Enable verbose logging

Independence Note: This step is completely independent from Steps 2-3. It can be run anytime, even in parallel.


Imports the foundational Russian legal codes from official sources.

Available codes:

Code Name (Russian) Name (English) kremlin.ru pravo.gov.ru government.ru
KONST_RF Конституция Российской Федерации Constitution -
GK_RF Гражданский кодекс Civil Code Part 1 -
GK_RF_2 Гражданский кодекс ч.2 Civil Code Part 2 -
GK_RF_3 Гражданский кодекс ч.3 Civil Code Part 3 -
GK_RF_4 Гражданский кодекс ч.4 Civil Code Part 4 -
UK_RF Уголовный кодекс Criminal Code -
TK_RF Трудовой кодекс Labor Code -
NK_RF Налоговый кодекс Tax Code Part 1
NK_RF_2 Налоговый кодекс ч.2 Tax Code Part 2
KoAP_RF Кодекс об административных правонарушениях Administrative Code
SK_RF Семейный кодекс Family Code -
ZhK_RF Жилищный кодекс Housing Code -
ZK_RF Земельный кодекс Land Code
APK_RF Арбитражный процессуальный кодекс Arbitration Procedure Code -
GPK_RF Гражданский процессуальный кодекс Civil Procedure Code -
UPK_RF Уголовно-процессуальный кодекс Criminal Procedure Code -
BK_RF Бюджетный кодекс Budget Code - -
GRK_RF Градостроительный кодекс Urban Planning Code - -
UIK_RF Уголовно-исполнительный кодекс Criminal Executive Code - -
VZK_RF Воздушный кодекс Air Code - -
VDK_RF Водный кодекс Water Code - -
LK_RF Лесной кодекс Forest Code - -
KAS_RF Кодекс административного судопроизводства Administrative Procedure Code - -

Import System Features:

  • Automatic source fallback (kremlin → pravo → government)
  • Context-based article number validation for fractional articles
  • Quality checking to detect source formatting errors
  • Hybrid validation using surrounding articles and known ranges

Current Status:

  • ✅ All 23 codes (23 code identifiers) registered in database
  • ✅ 22 codes imported with 6,232 articles total
  • ⚠️ KAS_RF (Administrative Procedure Code) not yet imported
  • ✅ Metadata stored in consolidated_codes table
  • ✅ Article validation with context-aware correction

Article counts by code:

Code Articles Code Articles Code Articles
GK_RF_2 667 NK_RF_2 522 UK_RF 524
GK_RF 554 APK_RF 402 TK_RF 505
GPK_RF 471 BK_RF 110 ZhK_RF 227
GK_RF_4 322 ZK_RF 181 KoAP_RF 224
GK_RF_3 102 UIK_RF 222 VZK_RF 163
KONST_RF 137 SK_RF 73 VDK_RF 79
NK_RF 250 UPK_RF 184 GRK_RF 100
LK_RF 213

Step 2: Document Metadata Sync

Fetches document metadata from pravo.gov.ru and stores in PostgreSQL.

poetry run python scripts/sync/initial_sync.py --start-date YYYY-MM-DD --end-date YYYY-MM-DD

Options:

  • --start-date - First document date to fetch (default: 2020-01-01)
  • --end-date - Last document date to fetch (default: today)
  • --block - Filter by publication block (e.g., president, government, all)
  • --batch-size - Documents per batch (default: 30, API max: 30)
  • --daily - Run daily sync (yesterday to today)

Examples:

# Sync all documents from 2022 onwards
poetry run python scripts/sync/initial_sync.py --start-date 2022-01-01

# Sync only federal government documents
poetry run python scripts/sync/initial_sync.py --block government

# Daily sync (for cron/scheduler)
poetry run python scripts/sync/initial_sync.py --daily

Current Status:

  • ✅ 1,134,985 documents synced (2019-2026)
  • ✅ 47,871 documents with full text content (4.22% coverage)
  • ⚠️ Date range filtering has issues - see TODO.md
  • 📊 Document distribution by year:
    • 2026: 10,998 (partial year)
    • 2025: 177,080
    • 2024: 176,091
    • 2023: 178,453
    • 2022: 178,685
    • 2021: 163,509
    • 2020: 162,315
    • 2019: 87,854 (partial)

Step 3: Content Parsing + Embeddings

Extracts document text from API metadata and generates embeddings for semantic search.

Quick Start (Recommended)

# Parse content AND generate embeddings in one pass
poetry run python scripts/sync/content_sync.py --recreate-collection

Current Status:

  • ✅ 47,871 documents with full text (4.22% of total)
  • ⚠️ Embeddings: Only 265 vectors in Qdrant (test data)
  • ⚠️ Full content sync needed for semantic search

Estimated time for full sync (1.1M documents):

  • Content parsing: 4-6 hours
  • Embeddings (RTX 3060): 12-20 hours
  • Total: ~16-26 hours

Advanced Options

# Test with 100 documents first
poetry run python scripts/sync/content_sync.py --limit 100 --recreate-collection

# Only parse content (skip embeddings) - for testing
poetry run python scripts/sync/content_sync.py --skip-embeddings

# Only generate embeddings from existing content (resume mode)
poetry run python scripts/sync/content_sync.py --skip-content --recreate-collection

All Options:

Option Purpose
--limit N Process only N documents (testing)
--skip-content Skip parsing, use existing content
--skip-embeddings Skip embedding generation
--recreate-collection Clear Qdrant and start fresh

What the Script Does

  1. Fetches documents from PostgreSQL (by publish_date DESC)
  2. Parses content from API metadata (title, name, complexName)
  3. Stores content in document_content table
  4. Generates embeddings using deepvk/USER2-base (GPU-accelerated)
  5. Stores vectors in Qdrant for semantic search
  6. Automatic cleanup - memory management, GPU cache clearing

Memory Management

The script includes automatic memory cleanup:

  • Clears embedding cache after each document
  • Forces garbage collection every 50 documents
  • Clears CUDA cache (if using GPU)
  • Logs memory usage throughout

Expected memory usage:

  • Model loading: ~500MB (CPU) / ~800MB (GPU)
  • Per document: ~10-50MB spikes during processing
  • Peaks around 2-3GB with RTX 3060

Monitoring Progress

# Watch the logs
poetry run python scripts/sync/content_sync.py --recreate-collection

# You'll see output like:
# [DOC 1/156000] Постановление Правительства... (0 chars, type: xxx)
# [DOC 2/156000] Указ Президента... (0 chars, type: xxx)
# ...
# [MEMORY] After 0 documents: 1234.5 MB
# Batch complete. Total chunks: 1234

Checking Results

# Check content in database
docker exec law7-postgres psql -U law7 -d law7 -c "
SELECT
  COUNT(*) as total,
  COUNT(full_text) as with_content
FROM document_content;
"

# Check Qdrant collection
curl http://localhost:6333/collections/law_chunks

Verification

MCP Tools Testing

# Build and test MCP server
npm run build
npx @modelcontextprotocol/inspector node dist/index.js

Test queries in MCP Inspector:

  1. List all codes:

    {"name": "get-code-structure"}
  2. Get specific code structure:

    {
      "name": "get-code-structure",
      "arguments": {"code_id": "TK_RF", "include_articles": true, "article_limit": 10}
    }
  3. Get specific article:

    {
      "name": "get-article-version",
      "arguments": {"code_id": "TK_RF", "article_number": "80"}
    }
  4. Search laws:

    {
      "name": "query-laws",
      "arguments": {"query": "трудовой договор", "limit": 5}
    }
  5. Get statistics:

    {"name": "get-statistics"}

Database Statistics

# Document counts by year
docker exec law7-postgres psql -U law7 -d law7 -c "
SELECT
  EXTRACT(YEAR FROM publish_date) as year,
  COUNT(*) as count
FROM documents
GROUP BY EXTRACT(YEAR FROM publish_date)
ORDER BY year DESC;
"

# Content coverage
docker exec law7-postgres psql -U law7 -d law7 -c "
SELECT
  COUNT(*) as total_documents,
  COUNT(dc.full_text) as with_content,
  COUNT(dc.full_text) * 100.0 / COUNT(*) as coverage_percent
FROM documents d
LEFT JOIN document_content dc ON d.id = dc.document_id;
"

# Code article counts
docker exec law7-postgres psql -U law7 -d law7 -c "
SELECT
  code_id,
  COUNT(*) as articles,
  COUNT(*) FILTER (WHERE is_current = true) as current,
  COUNT(*) FILTER (WHERE is_repealed = true) as repealed
FROM code_article_versions
GROUP BY code_id
ORDER BY code_id;
"

Troubleshooting

Issue: "value too long for type character varying(1000)"

Fix: Already fixed - documents.name is now TEXT type

Issue: "CUDA out of memory"

Solutions:

  • Reduce EMBEDDING_BATCH_SIZE in .env (default: 32)
  • Close other GPU applications
  • Use CPU: set EMBEDDING_DEVICE=cpu in .env

Issue: "Connection refused" errors

Fix: Ensure Docker services are running:

cd docker && docker-compose up -d

Issue: Content parsing returns empty strings

Check: API metadata structure - some documents only have titles Solution: The parser uses title || name || complexName fallback

Maintenance

Daily Sync (Recommended)

Set up a cron job to sync new documents daily:

# Add to crontab (crontab -e)
0 2 * * * cd /path/to/law7 && poetry run python scripts/sync/initial_sync.py --daily

Regenerate Embeddings

If you update the embedding model:

poetry run python scripts/sync/content_sync.py --recreate-collection

Backup Database

The project includes comprehensive backup/restore scripts for PostgreSQL and Qdrant:

cd docker
./backup.sh                 # Create full backup (PostgreSQL + Qdrant)
./restore.sh law7_backup_YYYYMMDD_HHMMSS    # Restore from backup
./check-backup.sh backups/law7_backup_*.tar.gz   # Verify backup integrity

See docs/BACKUP_RESTORE.md for complete backup/restore documentation.

Quick manual backup (PostgreSQL only):

# Backup PostgreSQL
docker exec law7-postgres pg_dump -U law7 law7 > backup.sql

# Restore
docker exec -i law7-postgres psql -U law7 law7 < backup.sql

Ministry Letters Import (Phase 7C)

The data pipeline includes scrapers for official ministry letters and interpretations from Russian government agencies.

Supported Agencies

Agency Short Name Source Status
Ministry of Finance Минфин minfin.gov.ru ✅ Complete
Federal Tax Service ФНС nalog.gov.ru ✅ Complete
Federal Labor Service Роструд rostrud.gov.ru ✅ Complete

Import Ministry Letters

# Import from specific agency (last 5 years)
poetry run python scripts/country_modules/russia/import/import_ministry_letters.py --agency minfin

# Import from Rostrud
poetry run python scripts/country_modules/russia/import/import_ministry_letters.py --agency rostrud --since 2020-01-01

# Import without date filter
poetry run python scripts/country_modules/russia/import/import_ministry_letters.py --agency rostrud --all

# Import from all agencies
poetry run python scripts/country_modules/russia/import/import_ministry_letters.py

# Test with limit
poetry run python scripts/country_modules/russia/import/import_ministry_letters.py --agency rostrud --limit 10

Options:

  • --agency {minfin,fns,rostrud} - Specific agency to import
  • --since YYYY-MM-DD - Only import letters after this date
  • --all - Import all letters without date filter
  • --limit N - Limit to N letters (testing)
  • --source {answers,general_documents} - Minfin source (answers or general documents)

Features:

  • Batch inserts (500 records per batch)
  • Upsert with unique constraint to avoid duplicates
  • Rate limiting (10s pause every 100 documents)
  • Resume capability for failed documents

Current Status:

  • ✅ Minfin: ~30,000 general documents + 11 Q&A topics
  • ✅ FNS: Search API integration with Actual-only filter
  • ✅ Rostrud: Documents listing with pagination (PAGEN_1)
  • ✅ Database table: official_interpretations

Court Decisions (Phase 7)

Constitutional Court (pravo.gov.ru)

# Sync Constitutional Court decisions
poetry run python scripts/sync/court_sync.py --start-date 2022-01-01 --end-date 2024-12-31

General Jurisdiction Courts (SUDRF)

Status: Selenium WebDriver implementation with Russian IP requirement

The SUDRF scraper (sudrf_scraper.py) fetches court decisions from the Russian State Automated System "Justice" (ГАС РФ "Правосудие").

Implementation:

  • Selenium WebDriver with Chrome (via webdriver-manager)
  • Official search portal: https://sudrf.ru/index.php?id=300&searchtype=sp
  • Anti-detection Chrome options (headless, no-sandbox, disable-blink-features)
  • Form submission pattern for search queries
  • Result parsing with BeautifulSoup

Current Limitations:

  1. Russian IP Required 🇷🇺

    • SUDRF blocks access from outside Russia
    • Error: "недоступна" (unavailable)
    • Solution: Use Russian VPS (Yandex Cloud, Selectel)
  2. Strong Anti-Bot Protection

    • Browser fingerprinting detects automation
    • Even Selenium with anti-detection measures may be blocked
    • May require CAPTCHA solving
  3. Geographic Restrictions

    • Blocks non-Russian IP addresses at infrastructure level
    • Cannot bypass with headers alone

Production Recommendations:

Approach Description Complexity
Russian Server Run scraper from Russian VPS/proxy Medium
Commercial API parser-api.com/sudrf, api-assist.com/api/sudrf Low
Regional Portals Use individual court websites (less restricted) Medium
Official Access Institutional API access from SUDRF High

Testing (from Russian IP only):

# Test SUDRF scraper (requires Russian IP)
poetry run python scripts/tests/test_sudrf_scraper.py

# Expected output: Found N decisions
# If blocked: "недоступна - ошибка 403"

Code Reference:

Database Table: court_decisions


Performance Tips

  1. Use GPU for embeddings - 10x faster than CPU
  2. Batch size: 30 is API max, don't increase
  3. Embedding batch size: 32 works well for RTX 3060 12GB
  4. Skip documents >100KB - Automatically skipped to prevent timeouts
  5. Memory monitoring - Script logs usage every 50 docs

File Reference

Script Purpose
scripts/sync/initial_sync.py Fetch document metadata
scripts/sync/content_sync.py Parse content + generate embeddings
scripts/sync/fetch_amendment_content.py Fetch detailed amendment text
scripts/sync/court_sync.py Fetch court decisions from pravo.gov.ru
scripts/import/import_base_code.py Import base legal codes
scripts/country_modules/russia/import/import_ministry_letters.py Import ministry letters (Minfin, FNS, Rostrud)
scripts/crawler/pravo_api_client.py API client for pravo.gov.ru
scripts/country_modules/russia/scrapers/ministry_scraper.py Scraper for ministry letters (Phase 7C)
scripts/country_modules/russia/scrapers/sudrf_scraper.py Scraper for SUDRF general jurisdiction courts
scripts/parser/html_parser.py Parse content from API metadata
scripts/parser/court_decision_parser.py Parse court decisions and extract legal citations
scripts/indexer/embeddings.py Generate embeddings with deepvk/USER2-base
scripts/indexer/qdrant_indexer.py Store embeddings in Qdrant
docker/backup.sh Create database backup (PostgreSQL + Qdrant)
docker/restore.sh Restore database from backup
docker/check-backup.sh Verify backup integrity