Tokenizer-free, span-aware encoder architecture inspired by X-bar theory
X-Spanformer replaces static tokenization with adaptive, corpus-driven vocabulary induction via expectation-maximization and Viterbi approximation. The system learns to segment and encode overlapping spans directly from raw UTF-8 input—code, natural language, or hybrid text—using factorized pointer networks and controller fusion mechanisms for downstream transformer integration.
- Tokenizer-Free Architecture – adaptive vocabulary induction via hybrid Unigram-LM with EM and Viterbi approximation
- Factorized Pointer Networks – independent span boundary prediction for start and end positions
- Multi-Domain Compositional Spans – supports code, prose, mixed content with X-bar hierarchical structure
- Controller Fusion Integration – multiple injection pathways (bias, prefix, gated-FFN) for downstream transformers
- ONNX-Native Implementation – fully differentiable with linear-time convolutional contextual encoding
- Chunk-Based Storage System – efficient compressed storage with automatic validation and resume capabilities
- Comprehensive Integrity Checking – ensures no missing sequences with gap detection and repair
- Fast Sequence Introspection – millisecond loading of individual sequences for analysis and debugging
Training examples follow this schema:
{
"input": ["The", " ", "quick", " ", "brown", " ", "fox", "."],
"type": "natural_language",
"span_labels": [
{ "span": [0, 0], "label": "determiner", "role": "specifier", "text": "The" },
{ "span": [2, 4], "label": "adjective_phrase", "role": "modifier", "text": "quick brown" },
{ "span": [6, 6], "label": "noun", "role": "subject", "text": "fox" }
]
}For full details, see /examples and the companion compiler agent ox-bar.
Our preprocessing pipeline consists of two main stages:
To generate semantically coherent pretraining data without tokenizers, we use the pdf2seg package:
pip install pdf2segProcess scanned or structured PDFs into entropy-minimized text spans:
# Generate JSONL segments from PDFs
uv run -m x_spanformer.pipelines.pdf2jsonl \
-i input_pdfs/ \
-o data/pretraining/out \
--name pretrainingGenerate a hybrid Unigram-LM vocabulary from the JSONL segments using the Adaptive Unigram-LM Vocabulary Induction algorithm (Section 3.1):
# Induce vocabulary from JSONL segments
uv run -m x_spanformer.pipelines.jsonl2vocab \
-i data/pretraining/out \
-o data/vocab/outThis implements the mathematical formulation from Section 3.1 of our paper, featuring:
- EM with Viterbi approximation for optimal segmentation with adaptive pruning based on perplexity and OOV thresholds
- Whitespace-aware tokenization ensuring strict separation between whitespace and content tokens
- Frequency-based candidate filtering retaining top-M substrings up to maximum length L_max
- Statistical validation with baseline/final perplexity comparison and coverage metrics
- Schema-validated vocabulary pieces using
VocabPieceandVocabStatsmodels - Multi-stage artifact generation for transparency and debugging
- Consolidated corpus output (
corpus.jsonl) ready for downstream vocab2embedding processing
Transform vocabulary into contextualized embeddings and span candidates using Section 3.2 algorithms:
# Generate embeddings from vocabulary and text sequences
uv run -m x_spanformer.pipelines.vocab2embedding \
--vocab data/vocab/out/vocab.jsonl \
--input data/vocab/out/corpus.jsonl \
--output data/embedding \
--config config/pipelines/vocab2embedding.yaml
# Parallel processing with multiple workers for high throughput
uv run -m x_spanformer.pipelines.vocab2embedding \
--vocab data/vocab/out/vocab.jsonl \
--input data/vocab/out/corpus.jsonl \
--output data/embedding \
--workers 4 \
--config config/pipelines/vocab2embedding.yamlThis implements the unified algorithm from Section 3.2, featuring:
- Forward-backward soft probability computation P[t,i] = (α_t × p(u_i) × β_{t+|u_i|}) / α_{T+1} for probabilistic piece assignment
- Vocabulary-aware Xavier initialization with probability-adjusted embedding variance for rare pieces
- Multi-scale dilated convolutional encoding for contextual representation (kernels [3,5,7], dilations [1,2,4])
- Factorized span candidate generation with vocabulary-informed filtering based on alignment and compositional potential
Output Structure (Chunk-Based):
data/embedding/
├── chunks/ # Compressed chunk storage
│ ├── embeddings_000001.npz # Sequences 1-100
│ ├── embeddings_000002.npz # Sequences 101-200
│ └── embeddings_000052.npz # Final chunk (partial)
├── metadata.json # Global metadata and chunk information
└── embedding.log # Processing log with stage-by-stage validation
Key Features:
- Automatic Resume: Validates existing chunks and continues from where processing left off
- Final Integrity Verification: Ensures all sequences are processed correctly with comprehensive gap detection
- Efficient Analysis Tools: Sequence introspector with fast single-sequence loading from chunks
- Performance Optimization: Optional components can be disabled for storage efficiency
Analyze processed embeddings with the integrated sequence introspector:
# Basic sequence analysis
uv run -m x_spanformer.embedding.sequence_introspector \
--id 1 --output data/embedding
# Detailed statistical analysis with span coverage
uv run -m x_spanformer.embedding.sequence_introspector \
--id 5 --output data/embedding --analyze
# Check total processed sequences
uv run -m x_spanformer.embedding.sequence_introspector \
--id 1 --output data/embedding --list-total
# Verbose output (complete arrays)
uv run -m x_spanformer.embedding.sequence_introspector \
--id 10 --output data/embedding --analyze --verboseThe introspector efficiently loads individual sequences from chunk files without decompressing entire chunks, providing:
- Fast Single-Sequence Loading: Loads specific sequences from compressed chunks in milliseconds
- Comprehensive Analysis: Embedding quality metrics, span coverage statistics, and array shape validation
- Chunk Storage Information: Storage efficiency, compression ratios, and chunk contribution estimates
- Statistical Insights: Mean/std analysis, sparsity detection, and span length distribution
The pipeline outputs both vocab.jsonl (final vocabulary with probabilities) and vocab_stats.json (comprehensive training statistics), enabling detailed analysis of the vocabulary induction process.
All pipelines utilize shared utilities from x_spanformer.pipelines.shared for consistent text processing and schema validation, eliminating code duplication and ensuring data format consistency across the preprocessing workflow.
Use the output as either raw training strings (for unsupervised Phase I) or compile with oxbar to produce labeled span records.
This enables X-Spanformer to bootstrap span boundaries from real-world documents with high structural signal, without relying on brittle tokenization.
Generate X-bar span annotations for supervised training of the factorized pointer network boundary predictor (Section 3.3):
# Annotate sequences for span boundary training with improved logging
uv run -m x_spanformer.pipelines.span_annotator \
--corpus data/vocab/corpus.jsonl \
--output data/annotations \
--range 1-100 \
--model llama3.2:3b \
--temperature 0.2 \
--timeout 180.0 \
--verbose
# Parallel processing for large-scale annotation
uv run -m x_spanformer.pipelines.span_annotator \
--corpus data/vocab/corpus.jsonl \
--output data/annotations \
--range 1-1000 &
uv run -m x_spanformer.pipelines.span_annotator \
--corpus data/vocab/corpus.jsonl \
--output data/annotations \
--range 1001-2000 &This implements production-grade agentic X-bar span annotation with enhanced robustness, featuring:
- Enhanced JSON Parsing Robustness: Handles truncated LLM responses, malformed JSON, and case-insensitive matching
- Advanced Label Cleaning System: Comprehensive word span validation with pattern-based filtering
- Independent Boundary Prediction: Generates training targets for start/end position classification using factorized linear heads
- X-bar Hierarchical Structure: Domain-specific classifier extraction based on linguistic phrase structure theory
- Intelligent Logging: Aggregated counts replace repetitive debug messages for cleaner output
- Word Span Validation: Supports percentages, abbreviations, expressions, and complex patterns
- Position-wise Binary Classification: Creates sigmoid-normalized boundary probabilities for BCE loss training
- Multi-label Span Support: Handles overlapping spans at different hierarchical levels (word → phrase → clause)
- Production Validation: Zero position errors across 1,703 spans in 56 sequences (August 2025)
Production Results (August 2025):
- 60,558 clean annotations from 61,053 original spans (99.2% retention rate)
- 495 invalid word spans automatically filtered using pattern-based validation
- 352 labels mapped from invalid to valid categories with aggregated logging
- Zero validation errors in position encoding and text extraction
- Perfect alignment with factorized pointer network requirements
- Enhanced reliability with automatic recovery from LLM response issues
Output Structure (Individual Working Files):
data/annotations/
├── working/ # Individual sequence annotations
│ ├── sequence-00000001.json # Working file for sequence 1
│ └── sequence-00000056.json # Working file for sequence 56 (latest)
├── consolidated/ # Final training data
│ └── annotations.jsonl # All successful annotations
├── metadata.json # Global progress and statistics
└── annotations.log # Processing log with enhanced error handling
Key Features:
- Enhanced JSON Parsing: Robust handling of truncated LLM responses and malformed JSON
- Advanced Label Cleaning: Comprehensive word span validation with pattern-based filtering
- Bidirectional Context: Built on X-Spanformer's position-wise embedding architecture where each H[t] contains bidirectional contextual information
- Boundary Detection Training: Generates binary targets for start/end position prediction (not span-level embeddings)
- Multi-label Support: BCE loss handles overlapping spans at different hierarchical levels
- Production Validation: Zero position or text extraction errors across all generated spans
- Intelligent Logging: Aggregated counts replace repetitive debug messages for cleaner output
Recent Enhancements (August 2025):
- Advanced Label Cleaning: Pattern-based word span validation supporting percentages, abbreviations, and expressions
- Intelligent Logging System: Aggregated counts replace thousands of repetitive debug messages
- Enhanced Word Span Patterns: Support for decimals ("3.14"), percentages ("2.7%"), abbreviations ("Dr."), expressions ("[83]", "(t)", "|s|")
- Production Cleaning Results: 99.2% retention rate with 495 spans filtered and 352 labels mapped
- Logging Optimization:
Selected 1000 sequences (1 to 1000) out of 1000 requestedinstead of massive sequence lists - Performance: Reduced I/O overhead and log file size while maintaining essential debugging information
- Scalability: Handles large sequence ranges without log bloat or memory issues
- Error Recovery: Enhanced JSON parsing robustness with automatic truncation detection and malformed JSON recovery
X-Spanformer includes comprehensive test coverage organized into focused categories for maintainability and clear separation of concerns.
-
tests/pipelines/- Data processing pipeline teststest_pipelines_pdf2jsonl.py- PDF→JSONL conversion with AI judgingtest_pipelines_jsonl2vocab.py- Vocabulary induction (Section 3.1)test_pipeline_span_annotator.py- Span annotation pipeline tests with label cleaning validationtest_pipelines_vocab2embedding.py- Seed embeddings & span generation (Section 3.2)test_integration_vocab2embedding.py- End-to-end integration validation
-
tests/embedding/- Embedding analysis utilities (Section 3.2)test_pipeline.py- Complete vocab2embedding pipeline validationtest_sequence_introspector.py- Chunk-based sequence loading teststest_embedding_chunk.py- Chunk management and validation tests
-
tests/schema/- Pydantic schema validationtest_schema.py- Basic schema validationtest_schema_comprehensive.py- Comprehensive schema teststest_schema_vocab.py- Vocabulary schema validation
-
tests/agents/- AI agent and content processingtest_agents.py- Base agent functionalitytest_span_annotator.py- Span annotation pipeline teststest_e2e_ollama_client.py- Ollama client integration
-
tests/xbar/- X-bar theory and label cleaning teststest_xbar_map.py- Label cleaning and word span validation teststest_xbar_annotator.py- X-bar annotation logic tests
-
tests/config/- Configuration system teststest_span_annotator_config.py- Configuration loading with logging support
-
tests/core/- Core utilities and configurationtest_config_loader.py- Configuration loadingtest_error_handling.py- Error handling validationtest_rich_utils.py- Console output utilitiestest_vocab_*.py- Vocabulary processing utilities
# Run all tests
python -m pytest tests/
# Run specific test categories
python -m pytest tests/embedding/ # Embedding tests (Section 3.2)
python -m pytest tests/pipelines/ # Pipeline tests (Sections 3.1, 3.2)
python -m pytest tests/schema/ # Schema validation tests
# Run with verbose output and coverage
python -m pytest tests/ -v --cov=x_spanformer
# Test specific pipeline components
python -m pytest tests/embedding/test_pipeline.py -v
python -m pytest tests/pipelines/test_pipelines_vocab2embedding.py -v- Mathematical Correctness - Validates Section 3.1/3.2 algorithms (EM convergence, forward-backward consistency, Xavier initialization)
- Integration Testing - End-to-end pipeline validation with synthetic and real data
- Schema Validation - Pydantic model testing with edge cases and comprehensive coverage
- Synthetic Data Generation - Automated test data creation for consistent, reproducible testing
- Modular Architecture - Organized by functionality for easy navigation and maintenance
X-Spanformer integrates with large language models through a robust async client interface:
from x_spanformer.agents.ollama_client import chat
# Core chat function for LLM communication
response = await chat(
model="llama3.2:3b",
conversation=[
{"role": "user", "content": "Analyze this text for spans..."}
],
system="You are an expert linguistic annotator.",
temperature=0.2,
timeout=180.0
)Features:
- Async Communication: Non-blocking LLM interactions for high throughput
- Conversation History: Multi-turn context preservation for complex annotation tasks
- Temperature Control: Creativity vs consistency tuning for different annotation strategies
- Timeout Management: Prevents hanging on slow responses with configurable limits
- Error Handling: Comprehensive connection and response error recovery mechanisms
X-Spanformer implements the complete architecture described in "X-Spanformer: A Tokenizer-Free, Span-Aware Encoder Inspired by X-Bar Theory" (Rawson & Chrzanowski, 2025):
- Hybrid Unigram-LM: EM with Viterbi approximation for optimal segmentation
- Entropy-based pruning: Adaptive vocabulary size control with perplexity thresholds
- Whitespace-aware tokenization: Strict separation between content and whitespace tokens
- Forward-backward probability computation: P[t,i] = (α_t × p(u_i) × β_{t+|u_i|}) / α_{T+1}
- Vocabulary-aware Xavier initialization: Probability-adjusted embedding variance
- Multi-scale dilated convolutions: Contextual encoding with kernels [3,5,7], dilations [1,2,4]
- Independent boundary prediction: Separate linear heads for start/end positions
- Multi-label span support: BCE loss for overlapping hierarchical spans
- X-bar hierarchical structure: Word → phrase → clause level annotations
x-spanformer/
├── x_spanformer/
│ ├── pipelines/ # Data processing pipelines (CLI interfaces)
│ │ ├── shared/ # Shared utilities for consistent processing
│ │ │ ├── text_processor.py # Text splitting and processing utilities
│ │ │ └── jsonl_processor.py # JSONL file handling and corpus management
│ │ ├── pdf2jsonl.py # PDF → JSONL conversion with AI judging
│ │ ├── jsonl2vocab.py # Hybrid Unigram-LM vocabulary induction (Section 3.1)
│ │ ├── vocab2embedding.py # Section 3.2: Seed embeddings & span generation
│ │ ├── span_annotator.py # Section 3.3: X-bar span annotation (PRODUCTION READY)
│ │ └── repo2jsonl.py # GitHub repository → JSONL conversion
│ ├── agents/ # AI agents and LLM integration
│ │ ├── ollama_client.py # Async LLM client with error handling
│ │ └── session/ # Session management for multi-turn conversations
│ ├── xbar/ # X-bar theory implementation
│ │ ├── xbar_annotator.py # Core X-bar annotation logic
│ │ ├── xbar_json.py # Enhanced JSON parsing with robustness
│ │ └── analyze_annotations.py # Annotation analysis tools
│ ├── benchmarks/ # Performance benchmarking tools
│ │ ├── benchmark_vocab2embedding.py # Vocab2embedding pipeline benchmark
│ │ ├── benchmark_vocab2embedding.md # Comprehensive usage documentation
│ │ └── README.md # Benchmarks package overview
│ ├── embedding/ # Embedding analysis & utilities (Section 3.2)
│ │ ├── embedding_utils.py # Loading, analysis, quality metrics
│ │ ├── span_analysis.py # Span patterns, hierarchy, coverage
│ │ ├── embedding_viz.py # Visualization tools (optional deps)
│ │ ├── analyze_results.py # CLI analysis workflows
│ │ ├── sequence_introspector.py # Efficient single-sequence chunk loading
│ │ ├── embedding_chunk.py # Chunk management and validation
│ │ └── test_pipeline.py # Pipeline validation
│ ├── schema/ # Pydantic data models and validation
│ │ ├── pretrain_record.py # Training data schema
│ │ ├── vocab.py # Vocabulary piece and statistics schemas
│ │ └── ... # Other schema definitions
│ ├── controllers/ # Span controller logic
│ └── views/ # Data visualization and inspection
├── config/ # Pipeline configurations
│ ├── agents/ # Agent configurations (judge_agent.yaml)
│ └── pipelines/ # YAML configs for data processing
├── data/ # Training and vocabulary data
│ ├── pretraining/ # Raw segments from PDF processing
│ ├── vocab/ # Vocabulary induction outputs
│ ├── embedding/ # Chunk-based embedding storage
│ ├── annotations/ # X-bar span annotations (Section 3.3)
│ └── benchmarks/ # Performance benchmark results (timestamped)
├── docs/ # Documentation and paper materials
│ ├── vocab_induction.md # Section 3.1 documentation
│ ├── seed_embeddings.md # Section 3.2 documentation
│ ├── pretraining_schema.md # Data format specifications
│ └── paper/ # LaTeX source and compiled paper
├── tests/ # Unit tests and integration tests
│ ├── pipelines/ # Pipeline-specific tests (PDF→JSONL, vocab induction, embeddings)
│ ├── embedding/ # Embedding module tests (Section 3.2 validation)
│ ├── agents/ # AI agent tests (Ollama client, session management)
│ ├── xbar/ # X-bar theory tests (annotation, JSON parsing)
│ ├── embedding/ # Embedding module tests (Section 3.2 validation) │ ├── schema/ # Pydantic schema validation tests │ ├── agents/ # AI agent and content judging tests │ └── core/ # Core utilities and configuration tests └── examples/ # Sample data and usage examples
---
## 🧪 Pipeline Tools
### Core Pipelines
- **`pdf2jsonl.py`** — Convert PDFs to validated JSONL segments with AI content judging
- **`jsonl2vocab.py`** — Induce hybrid Unigram-LM vocabulary using EM + Viterbi with adaptive pruning
- **`vocab2embedding.py`** — Generate seed embeddings and span candidates (Section 3.2: forward-backward algorithm, vocabulary-aware Xavier initialization, multi-scale contextualization)
- **`repo2jsonl.py`** — Export GitHub repositories to JSONL with shallow cloning and AI judging
### Shared Utilities
- **`shared/text_processor.py`** — Unified corpus loading and text processing across all pipelines for consistency and maintainability
### Validation & Analysis
- **Schema validation** — Pydantic models ensure data consistency across pipelines
- **Rich console output** — Detailed progress tracking and statistics reporting
- **Incremental processing** — Resume interrupted runs and process new data efficiently
- **Dependency management** — All dependencies from `pyproject.toml` are assumed available (matplotlib, seaborn, pandas, gitpython, pdf2seg, etc.)
### Configuration
- **YAML-based configs** — Hyperparameter tuning for vocabulary induction and content judging
- **Modular architecture** — Easy to extend with new processing stages and validation rules
---
## 🔬 Performance Benchmarking
X-Spanformer includes a comprehensive benchmarking infrastructure for scientific performance analysis and optimization tracking of pipeline components.
### Benchmarks Package
The `x_spanformer.benchmarks` package provides scientific measurement capabilities with:
- **Statistical Analysis**: Multiple runs with mean, standard deviation, and confidence intervals
- **Stage Breakdown**: Detailed timing for pipeline components (forward-backward, seed embedding, convolution, candidate generation)
- **Parallel Processing Analysis**: Compare sequential vs multi-worker performance scaling
- **Historical Tracking**: Timestamped results for optimization progress monitoring
- **Profiling Support**: Optional cProfile integration for bottleneck identification
### Vocab2Embedding Benchmark
Performance analysis for the vocab2embedding pipeline (Section 3.2):
```bash
# Quick performance check (5 runs, 10 sequences)
python -m x_spanformer.benchmarks.benchmark_vocab2embedding \
--vocab data/vocab/out/vocab.jsonl \
--input data/pretraining/out/jsonl/dataset.jsonl \
--config config/pipelines/vocab2embedding.yaml
# Scientific analysis with profiling (10 runs, 50 sequences)
python -m x_spanformer.benchmarks.benchmark_vocab2embedding \
--vocab data/vocab/out/vocab.jsonl \
--input data/pretraining/out/jsonl/dataset.jsonl \
--config config/pipelines/vocab2embedding.yaml \
--output data/benchmarks \
--runs 10 --sequences 50 --profile
# Parallel processing benchmark (compare 1 vs 4 workers)
python -m x_spanformer.benchmarks.benchmark_vocab2embedding \
--vocab data/vocab/out/vocab.jsonl \
--input data/pretraining/out/jsonl/dataset.jsonl \
--config config/pipelines/vocab2embedding.yaml \
--runs 5 --sequences 20 --workers 4
Results are automatically saved with timestamps for historical tracking:
data/benchmarks/
├── vocab2embedding_benchmark_20250723_171732.json
├── vocab2embedding_benchmark_20250723_180145.json
└── vocab2embedding_benchmark_20250723_184521.json
Example Performance Metrics:
- Sequential Processing (1 worker): 46.7s ± 2.8s for 12 sequences
- Parallel Processing (4 workers): 29.8s ± 1.1s for 12 sequences (36% speedup)
- Candidates per Sequence: ~4,500-5,000 (comprehensive coverage)
- Stage Breakdown: 40% candidate generation, 40% forward-backward algorithm
- GPU Memory Scaling: 4 workers ≈ 4× GPU memory usage per worker
- Chunk Storage Efficiency: ~30-60MB per 100-sequence chunk with compression
- Resume Performance: Near-instant startup with existing chunk validation
- Introspection Speed: <100ms single-sequence loading from chunks
- Optimization Targets: Automatically identifies bottlenecks for targeted improvements
# 1. Baseline measurement before optimization
python -m x_spanformer.benchmarks.benchmark_vocab2embedding \
--vocab data/vocab/out/vocab.jsonl \
--input data/pretraining/out/jsonl/dataset.jsonl \
--runs 3 --sequences 5
# 2. Make code optimizations...
# 3. Validate improvements with detailed analysis
python -m x_spanformer.benchmarks.benchmark_vocab2embedding \
--vocab data/vocab/out/vocab.jsonl \
--input data/pretraining/out/jsonl/dataset.jsonl \
--runs 10 --sequences 20 --profile
# 4. Test parallel processing scaling
python -m x_spanformer.benchmarks.benchmark_vocab2embedding \
--vocab data/vocab/out/vocab.jsonl \
--input data/pretraining/out/jsonl/dataset.jsonl \
--runs 5 --sequences 20 --workers 1
python -m x_spanformer.benchmarks.benchmark_vocab2embedding \
--vocab data/vocab/out/vocab.jsonl \
--input data/pretraining/out/jsonl/dataset.jsonl \
--runs 5 --sequences 20 --workers 4Documentation: See x_spanformer/benchmarks/benchmark_vocab2embedding.md for comprehensive usage guide.
The embedding module provides comprehensive utilities for working with vocab2embedding pipeline (Section 3.2) outputs, enabling analysis, visualization, and debugging of vocabulary-to-embedding transformations.
embedding_utils.py— Core utilities for loading and analyzing embeddingsspan_analysis.py— Advanced span pattern analysis with hierarchy detectionembedding_viz.py— Rich visualization tools (matplotlib and seaborn assumed available)analyze_results.py— Command-line analysis workflowstest_pipeline.py— Comprehensive pipeline validation
from x_spanformer.embedding import (
load_embedding_results,
analyze_embedding_quality,
SpanAnalyzer
)
# Load vocab2embedding results
result = load_embedding_results("data/embeddings", sequence_id=1)
# Analyze embedding quality
quality = analyze_embedding_quality(result['contextual_embeddings'])
print(f"Mean norm: {quality['mean_embedding_norm']:.3f}")
# Analyze span coverage patterns
sequence = result['metadata']['sequence']
candidates = result['metadata']['span_candidates']
analyzer = SpanAnalyzer(sequence, candidates)
coverage = analyzer.compute_coverage_statistics()
print(f"Coverage: {coverage['coverage_density']:.1%}")# Analyze specific sequence
python -m x_spanformer.embedding.analyze_results data/embeddings/ --sequence-id 1
# Batch analysis across sequences
python -m x_spanformer.embedding.analyze_results data/embeddings/ --batch --max-sequences 10
# Export embeddings to numpy
python -m x_spanformer.embedding.analyze_results data/embeddings/ --export contextual# Test complete pipeline with synthetic data
python x_spanformer/embedding/test_pipeline.pyExpected Output:
🧪 Testing vocab2embedding pipeline
✅ Pipeline initialized successfully
✅ Processed sequence: 'the quick brown fox'
Number of candidates: 112
✅ Embedding quality analysis: Mean norm: 16.816
✅ Span coverage analysis: Coverage density: 100.0%
🎉 All tests passed successfully!
- Quality Assessment — Embedding norms, variance ratios, similarity analysis
- Span Pattern Analysis — Hierarchy detection, coverage gaps, overlap patterns
- Visualization Suite — Heatmaps, PCA plots, span coverage maps (matplotlib/seaborn integration)
- Chunk-Based Loading — Efficient single-sequence access from compressed chunk storage
- Batch Processing — Aggregate statistics across multiple sequences
- Export Capabilities — Numpy format, JSON metadata, comprehensive reporting
- Fast Introspection — Millisecond loading times with sequence introspector tool
This module bridges Section 3.2 outputs with downstream X-Spanformer components, providing essential debugging and analysis capabilities for span-aware embedding research.
Segment PDF documents into structured clauses using OCR + spaCy:
pdf2seg -i paper.pdf -o spans/Ideal for extracting domain-specific clause boundaries from scientific papers, REPL transcripts, or code-heavy PDFs. The output is then processed by our pdf2jsonl pipeline for validation and schema conformance.
Generate structured span-labeled records using local LLMs:
oxbar compile input.txt --type mixed --output spans.jsonSupports retry logic, confidence scoring, and mode switching. Complements our vocabulary induction by providing supervised span labels for training data.
- Linguistics: X-bar phrase structure theory for hierarchical span organization and compositional boundaries
- Statistical Methods: Expectation-maximization with Viterbi approximation for optimal vocabulary segmentation
- Neural Architecture: Factorized pointer networks with linear-time convolutional contextual encoding
- Integration Mechanisms: Multiple controller fusion pathways (bias injection, prefix tuning, gated-FFN modulation)
We welcome span explorers, linguistically curious devs, and tokenizer skeptics.
Ways to help:
- Label new examples using
oxbaror manual annotations - Extend the span role taxonomy for underrepresented domains (e.g., REPLs, math, RST)
- Build new controller fusion heads or injection pathways
- Analyze span induction across language families, treebanks, or doc formats
- Visualize structural routing dynamics in longer sequences
Start with CONTRIBUTING.md to onboard.
This research and code are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0).
Copyright (c) 2025
TAU SYSTEMS by NAXZYU CORP.
https://zenodo.org/records/15750962
@misc{rawson2025xspanformer,
title = {X-Spanformer: Tokenizer-Free Span Induction with Structural Fusion},
author = {Rawson, Kara and Chrzanowski, Aimee},
year = {2025},
publisher = {Zenodo},
doi = {10.5281/zenodo.15750962},
url = {https://zenodo.org/records/15750962}
}