A comprehensive Domain-Specific Language (DSL) compiler implemented in Rust for modeling media outlets, companies, and their relationships.
This project implements a complete compiler pipeline for the MediaLanguage DSL, which is designed to model complex media industry relationships, company hierarchies, and temporal data. The compiler follows the standard pipeline: Lexical Analysis → Parsing → Semantic Analysis → Code Generation.
mdsl-rs/
├── src/
│ ├── lexer/ # Tokenization and lexical analysis
│ ├── parser/ # Syntax analysis and AST construction
│ ├── semantic/ # Symbol tables and type checking
│ ├── ir/ # Intermediate representation
│ ├── codegen/ # SQL and Cypher code generation
│ ├── error.rs # Comprehensive error handling
│ └── main.rs # CLI application
├── examples/ # Sample MediaLanguage files
└── tests/ # Test files and validation
cd mdsl-rs
cargo build
# Tokenize a MediaLanguage file
cargo run -- lex examples/simple_example.mdsl
# Parse to AST
cargo run -- parse examples/simple_example.mdsl
# Generate SQL (feature-gated)
cargo run --features sql-codegen -- sql examples/simple_example.mdsl
# Generate Cypher (feature-gated)
cargo run --features cypher-codegen -- cypher examples/simple_example.mdsl
Start here to understand how source code becomes tokens:
// src/lexer/scanner.rs - Main lexer implementation
pub struct Lexer<'a> {
source: &'a str,
chars: Peekable<Chars<'a>>,
position: SourcePosition,
// ...
}
Key Features:
- Character-by-character scanning
- Keyword recognition (case-insensitive)
- String literal parsing with escape sequences
- Comment handling (
//
,/* */
,#
) - Annotation parsing (
@identifier
) - Variable reference parsing (
$identifier
)
Example:
LET austria_region = "Österreich gesamt";
Becomes tokens: [Keyword(Let), Identifier("austria_region"), Assign, String("Österreich gesamt"), Semicolon]
The AST defines the structure of parsed MediaLanguage code:
pub enum Statement {
Import(ImportStatement),
Variable(VariableDeclaration),
Unit(UnitDeclaration),
Vocabulary(VocabularyDeclaration),
Family(FamilyDeclaration),
Template(TemplateDeclaration),
Data(DataDeclaration),
Relationship(RelationshipDeclaration),
Comment(CommentStatement),
}
Key Concepts:
- Units: Table/entity definitions with field types
- Families: Hierarchical structures containing outlets
- Templates: Reusable outlet definitions with inheritance
- Relationships: Diachronic and synchronous links between entities
- Data: Market data and metrics with temporal information
The parser converts tokens into AST nodes:
pub struct Parser {
tokens: Vec<Token>,
current: usize,
}
impl Parser {
pub fn parse(&mut self) -> Result<Program> {
// Parse top-level statements
while !self.is_at_end() {
statements.push(self.parse_statement()?);
}
Ok(Program::new(statements, position))
}
}
Parsing Strategy:
- Top-down: Start with program, parse statements, then expressions
- Error Recovery: Skip to next statement on error
- Source Position Tracking: Every AST node knows its location
- Robust Handling: Comments, semicolons, and whitespace
Comprehensive error types for all compilation phases:
pub enum Error {
Lexer(LexerError), // Unexpected characters, invalid tokens
Parser(ParserError), // Syntax errors, missing delimiters
Semantic(SemanticError), // Type errors, undefined variables
CodeGen(CodeGenError), // Generation failures
Io(String), // File I/O errors
}
Symbol Tables (src/semantic/symbol_table.rs
):
- Track variable definitions and scopes
- Resolve imports and cross-references
- Validate identifier usage
Type Checking (src/semantic/type_checker.rs
):
- Validate field types in units
- Check expression types
- Ensure template inheritance consistency
Validation (src/semantic/validator.rs
):
- Verify relationship integrity
- Check temporal consistency
- Validate data constraints
The IR provides a language-agnostic representation:
pub enum IRNode {
Import(ImportNode),
Table(TableNode),
Relationship(RelationshipNode),
Data(DataNode),
// ...
}
SQL Generator (src/codegen/sql.rs
):
- Generate CREATE TABLE statements from UNIT declarations
- Handle vocabulary tables with INSERT statements
- Support outlet data tables
- Map DSL types to SQL types (ID → INTEGER, TEXT → VARCHAR, etc.)
Cypher Generator (src/codegen/cypher.rs
):
- Create nodes and relationships
- Handle temporal properties
- Support complex graph queries
UNIT MediaOutlet {
id: ID PRIMARY KEY,
name: TEXT(120),
sector: NUMBER,
mandate: CATEGORY(
"Öffentlich-rechtlich",
"Privat-kommerziell"
)
}
FAMILY "Kronen Zeitung Family" {
OUTLET "Kronen Zeitung" EXTENDS TEMPLATE "AustrianNewspaper" {
id = 200001;
identity {
title = "Kronen Zeitung";
};
lifecycle {
status "active" FROM "1959-01-01" TO CURRENT {
precision_start = "known";
};
};
characteristics {
sector = "Tageszeitung";
distribution = {
primary_area = $austria_region;
};
};
};
}
TEMPLATE OUTLET "AustrianNewspaper" {
characteristics {
language = "de";
mandate = "Privat-kommerziell";
};
metadata {
steward = "js";
};
};
DIACHRONIC_LINK acquisition {
predecessor = 300001;
successor = 200001;
event_date = "1971-01-01" TO "1971-12-31";
relationship_type = "Akquisition";
};
-
Extend Tokens (
src/lexer/token.rs
):pub enum TokenKind { // Add new token types NewFeature(String), }
-
Update Lexer (
src/lexer/scanner.rs
):// Add scanning logic for new tokens fn scan_new_feature(&mut self) -> Result<Token> { // Implementation }
-
Extend AST (
src/parser/ast.rs
):pub enum Statement { // Add new statement types NewFeature(NewFeatureStatement), }
-
Implement Parser (
src/parser/recursive_descent.rs
):fn parse_new_feature(&mut self) -> Result<NewFeatureStatement> { // Implementation }
# Run all tests
cargo test
# Test specific component
cargo test lexer
cargo test parser
cargo test semantic
# Run with verbose output
cargo test -- --nocapture
- Start with
src/lexer/scanner.rs
- understand tokenization - Read
src/parser/ast.rs
- see the data structures - Examine
src/parser/recursive_descent.rs
- understand parsing - Look at
src/error.rs
- see error handling patterns
- Study the semantic analysis modules
- Understand the IR design
- Examine code generation strategies
- Look at the CLI implementation
- Read the error handling patterns
- Understand the module organization
- Study the testing approach
- Examine the feature flag system
- Complete Pipeline: From lexer to code generation
- Real-world Complexity: Handles sophisticated grammar with annotations, relationships, and temporal data
- Production Quality: Comprehensive error handling with source position tracking
- Educational: Clear separation of concerns with modular design
- Extensible: Feature-flagged code generation (SQL, Cypher)
- Working Implementation: Successfully parses complex MediaLanguage DSL files and generates SQL
This implementation serves as an excellent example of how to build a DSL compiler in Rust, demonstrating best practices in error handling, modular design, and comprehensive testing.