A comprehensive Python toolkit for extracting structured data from unstructured text using language models. DELM provides a configurable, scalable pipeline with built-in cost tracking, caching, and evaluation capabilities.
- Multi-format Support: TXT, HTML, MD, DOCX, PDF, CSV, Excel, Parquet, Feather
- Progressive Schema System: Simple → Nested → Multiple schemas for any complexity
- Multi-Provider Support: OpenAI, Anthropic, Google, Groq, Together AI, Fireworks AI
- Smart Processing: Configurable text splitting, relevance scoring, and filtering
- Cost Optimization: Built-in cost tracking, caching, and budget management
- Batch Processing: Parallel execution with checkpointing and resume capabilities
- Comprehensive Evaluation: Performance metrics and cost analysis tools
# Clone the repository
git clone https://github.com/your-org/delm.git
cd delm
# Install from source
pip install -e .
from pathlib import Path
from delm import DELM, DELMConfig
# Load configuration from YAML
config = DELMConfig.from_yaml("example.config.yaml")
# Initialize DELM
delm = DELM(
config=config,
experiment_name="my_experiment",
experiment_directory=Path("experiments"),
)
# Process data
df = delm.prep_data("data/input.txt")
results = delm.process_via_llm()
# Get results
final_df = delm.get_extraction_results_df()
cost_summary = delm.get_cost_summary()
DELM uses two configuration files:
1. Pipeline Configuration (config.yaml
)
llm_extraction:
provider: "openai"
name: "gpt-4o-mini"
temperature: 0.0
batch_size: 10
track_cost: true
max_budget: 50.0
data_preprocessing:
target_column: "text"
splitting:
type: "ParagraphSplit"
scoring:
type: "KeywordScorer"
keywords: ["price", "forecast", "guidance"]
schema:
spec_path: "schema_spec.yaml"
2. Schema Specification (schema_spec.yaml
)
schema_type: "nested"
container_name: "commodities"
variables:
- name: "commodity_type"
description: "Type of commodity mentioned"
data_type: "string"
required: true
allowed_values: ["oil", "gas", "copper", "gold"]
- name: "price_value"
description: "Price mentioned in text"
data_type: "number"
required: false
DELM supports three levels of schema complexity:
Extract key-value pairs from each text chunk:
schema_type: "simple"
variables:
- name: "price"
description: "Price mentioned"
data_type: "number"
- name: "company"
description: "Company name"
data_type: "string"
Extract structured objects with multiple fields:
schema_type: "nested"
container_name: "commodities"
variables:
- name: "type"
description: "Commodity type"
data_type: "string"
- name: "price"
description: "Price value"
data_type: "number"
Extract multiple independent schemas simultaneously:
schema_type: "multiple"
commodities:
schema_type: "nested"
container_name: "commodities"
variables: [...]
companies:
schema_type: "nested"
container_name: "companies"
variables: [...]
- Earnings call transcript analysis
- Commodity price forecasting
- Financial report parsing
- Market sentiment analysis
- Academic paper analysis
- Survey response processing
- Interview transcript coding
- Literature review automation
- Customer feedback analysis
- Product review extraction
- Competitor analysis
- Market research automation
Type | Description | Example |
---|---|---|
string |
Text values | "Apple Inc." |
number |
Floating-point numbers | 150.5 |
integer |
Whole numbers | 2024 |
boolean |
True/False values | true |
[string] |
List of strings | ["oil", "gas"] |
[number] |
List of numbers | [100, 200, 300] |
[integer] |
List of integers | [1, 2, 3, 4] |
[boolean] |
List of booleans | [true, false, true] |
# Get cost summary after extraction
cost_summary = delm.get_cost_summary()
print(f"Total cost: ${cost_summary['total_cost_usd']}")
Reuses api responses from identical calls. Ensures no wasted api credits for certain experiment re-runs.
semantic_cache:
backend: "sqlite" # sqlite, lmdb, filesystem
path: ".delm_cache"
max_size_mb: 512
data_preprocessing:
scoring:
type: "KeywordScorer"
keywords: ["price", "forecast", "guidance"]
pandas_score_filter: "delm_score >= 0.7"
data_preprocessing:
splitting:
type: "ParagraphSplit" # Split by paragraphs
# type: "FixedWindowSplit" # Split by sentence count
# window: 5
# stride: 2
# type: "RegexSplit" # Custom regex pattern
# pattern: "\n\n"
Estimate total cost of your current configuration setup before running the full extraction.
from delm.utils.cost_estimation import estimate_input_token_cost, estimate_total_cost
# Estimate input token costs without API calls
input_cost = estimate_input_token_cost(
config="config.yaml",
data_source="data.csv"
)
print(f"Input token cost: ${input_cost:.2f}")
# Estimate total costs using API calls on a sample
total_cost = estimate_total_cost(
config="config.yaml",
data_source="data.csv",
sample_size=100
)
print(f"Estimated total cost: ${total_cost:.2f}")
Estimate the performance of your current configuration before running the full extraction.
from delm.utils.performance_estimation import estimate_performance
# Evaluate against human-labeled data
metrics, expected_and_extracted_df = estimate_performance(
config="config.yaml",
data_source="test_data.csv",
expected_extraction_output_df=human_labeled_df,
true_json_column="expected_json",
matching_id_column="id",
record_sample_size=50 # Optional: limit sample size
)
# Display performance metrics
for key, value in metrics.items():
precision = value.get("precision", 0)
recall = value.get("recall", 0)
f1 = value.get("f1", 0)
print(f"{key:<30} Precision: {precision:.3f} Recall: {recall:.3f} F1: {f1:.3f}")
# Extract commodity prices from earnings calls
config = DELMConfig.from_yaml("examples/commodity_schema.yaml")
delm = DELM(config=config, experiment_name="commodity_extraction")
# Process earnings call transcripts
df = delm.prep_data("earnings_calls.csv")
extraction_results = delm.process_via_llm()
llm_extraction.provider
: LLM provider (openai, anthropic, google, etc.)llm_extraction.name
: Model name (gpt-4o-mini, claude-3-sonnet, etc.)schema.spec_path
: Path to schema specification file
llm_extraction.temperature
: 0.0 (deterministic)llm_extraction.batch_size
: 10 (records per batch)llm_extraction.max_workers
: 1 (concurrent workers)llm_extraction.track_cost
: true (cost tracking)semantic_cache.backend
: "sqlite" (cache backend)
- DataProcessor: Handles loading, splitting, and scoring
- SchemaManager: Manages schema loading and validation
- ExtractionManager: Orchestrates LLM extraction
- ExperimentManager: Handles experiment state and checkpointing
- CostTracker: Monitors API costs and budgets
- SplitStrategy: Text chunking (Paragraph, FixedWindow, Regex)
- RelevanceScorer: Content scoring (Keyword, Fuzzy)
- SchemaRegistry: Schema type management
- estimate_input_token_cost: Estimate input token costs without API calls
- estimate_total_cost: Estimate total costs using API calls on a sample to
- estimate_performance: Evaluate extraction performance against human-labeled data
Format | Extension | Requirements |
---|---|---|
Text | .txt |
Built-in |
HTML/Markdown | .html , .md |
beautifulsoup4 |
Word Documents | .docx |
python-docx |
.pdf |
marker (OCR) |
|
CSV | .csv |
pandas |
Excel | .xlsx |
openpyxl |
Parquet | .parquet |
pyarrow |
Feather | .feather |
pyarrow |
- Schema Reference - Detailed schema configuration guide
- Configuration Examples - Complete configuration templates
- Schema Examples - Schema specification templates
- Built on Instructor for structured outputs
- Uses Marker for PDF processing
- Developed at the Center for Applied AI at Chicago Booth
DELM v0.2.0 - Making data extraction with LLMs accessible, reliable, and cost-effective.