Skip to content

VLM pipeline to extract executive compensation data from SEC DEF 14A proxy statements. Uses VLM for table classification and LLM for structured data extraction from 100K+ filings (2005-2022).

License

Notifications You must be signed in to change notification settings

pierpierpy/Execcomp-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

66 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Execcomp-AI

AI-powered pipeline to extract executive compensation data from SEC DEF 14A proxy statements.

πŸ“Š Current Progress & Statistics (click to expand)

🚧 Work in Progress - Processing 100K+ SEC filings

Pipeline Stats Compensation Stats Top 10 Document Breakdown

πŸ‘‰ Dataset: pierjoe/execcomp-ai-sample


Why This Project?

Executive compensation data from SEC filings is valuable for academic research, corporate governance analysis, and market studies. However, extracting this information at scale is surprisingly difficult:

Challenge Description
Unstructured documents DEF 14A proxy statements are filed as HTML or plain text, with compensation tables embedded among hundreds of pages of legal text, footnotes, and varying formats
Format changes over time SEC disclosure rules changed in 2006 β€” pre-2006 tables have different column names ("Securities Underlying Options" vs "Option Awards"), different structures, and often lack a Total column
Tables break across pages A single Summary Compensation Table often spans multiple pages, and PDF parsers extract them as separate fragments that need to be intelligently merged
Similar tables cause confusion Each proxy contains multiple compensation-related tables (director compensation, equity grants, pension benefits) that look similar to the Summary Compensation Table but contain different data
No clean dataset exists Services like ExecuComp provide curated data but are expensive and limited in coverage. Raw SEC filings are free but require significant processing

This project automates the entire extraction pipeline using vision-language models, producing structured JSON from raw filings with minimal manual intervention.


Overview

Schema

Extracts Summary Compensation Tables from 100K+ SEC filings (2005-2022) using:

  • MinerU for PDF table extraction (images + HTML)
  • Qwen3-VL-32B for classification and structured extraction
  • Qwen3-VL-4B (fine-tuned) for post-processing false positive filtering
SEC Filing β†’ PDF β†’ MinerU β†’ VLM Classification β†’ Extraction β†’ Post-Processing β†’ HF Dataset
                              (Qwen3-32B)         (Qwen3-32B)   (Qwen3-4B)

Key Features

Challenge Solution
Tables split across pages Merge based on is_header_only flag + bbox proximity
Pre-2006 vs Post-2006 formats Column mapping with synonyms
Funds (no exec comp) Auto-skip when SIC = NULL
Resume after interruption Central tracker + skip processed docs
Parallel processing 3-level: MinerU, classification, extraction
Status tracking pipeline_tracker.json as single source of truth

Quick Start

1. Installation

pip install -r requirements.txt
sudo apt-get install wkhtmltopdf

Requirements:

  • Python 3.10+
  • GPU with 40GB+ VRAM (or adjust tensor parallelism)
  • Key dependencies: vllm, openai, aiohttp, datasets, huggingface_hub, pdfkit

2. Start Servers

# GPU 0,1: Qwen3-VL for classification/extraction
CUDA_VISIBLE_DEVICES=0,1 vllm serve Qwen/Qwen3-VL-32B-Instruct \
    --tensor-parallel-size 2 --port 8000 --max-model-len 32768

# GPU 2,3: MinerU for PDF processing  
CUDA_VISIBLE_DEVICES=2,3 mineru-openai-server \
    --engine vllm --port 30000 --tensor-parallel-size 2

3. Run Pipeline

# Show current status
python scripts/pipeline.py

# Process up to 1000 documents total
python scripts/pipeline.py 1000

To continue processing pending documents, edit CONTINUE_MODE = True in the script.

4. Post-Processing & Upload

# Build dataset with analysis, threshold analysis, and save locally
python scripts/post_processing.py

# Build and push to HuggingFace
python scripts/post_processing.py --push

Configuration in script:

  • RUN_ANALYSIS = True - Generate stats images
  • RUN_THRESHOLD_ANALYSIS = True - Analyze optimal threshold
  • SCT_PROBABILITY_THRESHOLD = None - Filter threshold (None = keep all)

Example

Input: CAMPBELL SOUP DEF 14A 2019

Extracted Table:

Sample Table

Output JSON:

[...
{
    "name": "Luca Mignini",
    "title": "Former Executive Vice President - Strategic Initiatives",
    "fiscal_year": 2018,
    "salary": 747433,
    "bonus": 50000,
    "stock_awards": 1731729,
    "option_awards": 363899,
    "non_equity_incentive": 0,
    "change_in_pension": 0,
    "other_compensation": 228655,
    "total": 3121716
}
...
]

Project Structure

stuff/
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ pipeline.py          # Main extraction pipeline
β”‚   β”œβ”€β”€ post_processing.py   # Build HF dataset with analysis
β”‚   └── fix_pending.py       # Find and fix pending documents
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ vlm/                 # VLM classification & extraction
β”‚   β”œβ”€β”€ processing/          # PDF conversion, MinerU, table extraction
β”‚   β”œβ”€β”€ io/                  # Results saving & visualization
β”‚   β”œβ”€β”€ tracking/            # Pipeline tracker (central status)
β”‚   └── analysis/            # Stats, charts, threshold analysis
β”œβ”€β”€ notebooks/
β”‚   └── pipeline.ipynb       # Interactive development
β”œβ”€β”€ data/
β”‚   └── DEF14A_all.jsonl     # Filing metadata (local)
β”œβ”€β”€ pipeline_tracker.json    # Central tracking file
β”œβ”€β”€ output/                  # Processed results per document
└── pdfs/                    # Downloaded PDF files

Output Structure

output/{cik}_{year}_{accession}/
β”œβ”€β”€ metadata.json               # Document metadata
β”œβ”€β”€ extraction_results.json     # βœ… Extracted compensation data
β”œβ”€β”€ classification_results.json # Table classifications
β”œβ”€β”€ no_sct_found.json          # (if no SCT found)
└── {doc_id}/
    β”œβ”€β”€ *_content_list.json    # MinerU parse results
    └── vlm/                    # Table images

Scripts Reference

pipeline.py - Main Pipeline

# Show status only
python scripts/pipeline.py

# Process N documents total (adds new if needed)
python scripts/pipeline.py 10000

To process pending documents, edit CONTINUE_MODE = True in the script.

The pipeline tracks all documents in pipeline_tracker.json:

  • Phases: pdf_created β†’ mineru_done β†’ classified β†’ extracted
  • Status: complete, no_sct, fund, pending

post_processing.py - Build Final Dataset

Builds the HuggingFace dataset with sct_probability scores from a fine-tuned binary classifier. This is the single script for dataset creation - runs analysis, threshold optimization, and uploads.

# Build dataset locally (with full analysis output)
python scripts/post_processing.py

# Build and push to HuggingFace (includes README + images)
python scripts/post_processing.py --push

Configuration (edit in script):

  • CLASSIFIER_MODEL_PATH - Path to fine-tuned classifier
  • CLASSIFIER_DEVICE - GPU device (e.g., "cuda:0")
  • SCT_PROBABILITY_THRESHOLD - Filter threshold (None = keep all)
  • RUN_ANALYSIS - Generate pipeline stats and charts
  • RUN_THRESHOLD_ANALYSIS - Find optimal threshold for single-SCT
  • HF_REPO - HuggingFace repository name

Outputs:

  • Stats images in docs/ (pipeline, compensation, charts)
  • Threshold analysis plot (docs/analysis_threshold.png)
  • Recommended threshold printed to console
  • Dataset saved locally and pushed to HF (with --push)

fix_pending.py - Fix Failed Documents

# Show pending documents
python scripts/fix_pending.py

# Delete and reprocess pending
python scripts/fix_pending.py --fix

Categories:

  • No PDF: HTML download failed
  • No MinerU: MinerU processing failed
  • Not classified: Has tables but VLM classification failed

Analysis Module

Stats and threshold analysis are in src/analysis/ and called by post_processing.py.

Generated images in docs/:

  • stats_pipeline.png - Pipeline statistics
  • stats_compensation.png - Compensation statistics
  • stats_top10.png - Top 10 highest paid executives
  • stats_breakdown.png - Compensation breakdown by component
  • chart_pipeline.png - Document breakdown pie charts
  • chart_by_year.png - Tables by year
  • chart_distribution.png - Compensation distribution
  • chart_trends.png - Trends over time
  • analysis_threshold.png - Threshold optimization plot

Output includes:

  • sct_probability: Float 0-1, probability that table is a real SCT
  • Statistics on how many duplicates the classifier can disambiguate

Central Tracker

All pipeline status is stored in pipeline_tracker.json:

{
  "last_updated": "2026-01-03T12:00:00",
  "documents": {
    "1002037_2016_0001437749-16-024320": {
      "cik": "1002037",
      "company_name": "ACME Corp",
      "year": 2016,
      "sic": "1234",
      "phases": {
        "pdf_created": "2026-01-01T10:00:00",
        "mineru_done": "2026-01-01T10:05:00",
        "classified": "2026-01-02T15:30:00",
        "extracted": "2026-01-02T15:31:00"
      },
      "status": "complete",
      "sct_tables": ["images/table_15.jpg"]
    }
  }
}

To rebuild tracker from files:

from src.tracking import Tracker
tracker = Tracker()
tracker.rebuild_from_files()

Configuration

Edit variables at the top of scripts/pipeline.py:

SEED = 42424242                   # Random seed for reproducibility
VLM_BASE_URL = "http://localhost:8000/v1"
VLM_MODEL = "Qwen/Qwen3-VL-32B-Instruct"

MINERU_MAX_CONCURRENT = 8         # Concurrent MinerU processes
DOC_MAX_CONCURRENT = 16           # Concurrent document processing

Check Status

python scripts/pipeline.py

Output:

==================================================
PIPELINE TRACKER STATUS
==================================================
Total documents: 8,015

By status:
  Complete (with SCT): 6,108
  No SCT found:        430
  Funds (skipped):     1,477
  Pending:             0

By phase completed:
  [1] PDF created:     8,015
  [2] MinerU done:     8,015
  [3] VLM processed:   6,538
      β†’ Found SCT:     6,108
      β†’ No SCT:        430
==================================================

OpenAI Compatible

The pipeline uses OpenAI-compatible APIs for classification and extraction. You can swap local models with cloud APIs:

# Local (vLLM)
client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

# OpenAI / Azure / Anthropic / any OpenAI-compatible endpoint
client = AsyncOpenAI(api_key="sk-...")
MODEL = "gpt-4o"  # or any vision model

Only MinerU requires local GPU for PDF table extraction. Everything else works with cloud APIs.

About

VLM pipeline to extract executive compensation data from SEC DEF 14A proxy statements. Uses VLM for table classification and LLM for structured data extraction from 100K+ filings (2005-2022).

Topics

Resources

License

Stars

Watchers

Forks

Contributors