This repository contains work-in-progress on Language-Oriented Programming approaches for CDISC Analysis/Derivation Concepts, with a focus on the Thunderstruck DSL for authoring Statistical Analysis Plans (SAPs) in clinical trials.
This project explores using domain-specific languages (DSLs) and formal language design to capture Statistical Analysis Plans as typed, executable specifications. Three alternative approaches have been considered:
Document: LOP-PROPOSAL-CC.md
A comprehensive language design treating concepts as first-class types:
- BiomedicalConcept: Clinical/biological meaning (e.g., ADAS-Cog Total Score)
- AnalysisConcept: Analysis-space quantities (e.g., Change from Baseline)
- DerivationConcept: Computable transformations (e.g., LOCF Imputation)
Built on Langium with strong typing, functional paradigm, and multi-target code generation (R, SAS, Python). Emphasizes type safety, immutability, pure functions, and composability.
Document: LOP_PROPOSAL_CC_CUBE.md
Makes the W3C Data Cube standard the primary organizing principle:
- All data structures are cubes (SDTM, ADaM, Results)
- Operations are typed cube transformations
- Native RDF representation for interoperability
- Automatic validation via W3C integrity constraints (IC-1 through IC-21)
Provides semantic precision, provenance tracking, SPARQL queryability, and seamless integration with CDISC standards.
Document: LOP-PROPOSAL-GPT5.md
A more compact DSL specification with:
- Clean separation of concerns (concepts, cubes, derivations, analyses, displays)
- Functional pipelines using
|>operator - Module system for reusability
- Emphasis on pragmatic syntax for statistician authoring
| Aspect | Concept-Centric | Cube-Centric | Streamlined |
|---|---|---|---|
| Primary Focus | Type hierarchy | Data structure | Pragmatic syntax |
| Standard Basis | Custom concepts | W3C Data Cube | Hybrid |
| Best For | Clinicians/statisticians | Data engineers | Quick authoring |
| Strength | Rich type system | Interoperability | Simplicity |
Recommended Strategy: Combine approaches with concept-centric authoring that compiles to cube-centric intermediate representation for tooling and export.
Thunderstruck is an implementation of the cube-centric intermediate representation. An implementation of concept-centric authoring will be condidered later in this project and is OUT OF SCOPE currently.
Thunderstruck is a domain-specific language for authoring Statistical Analysis Plans using the W3C Data Cube standard. It enables statisticians to:
- Define analyses as typed cube operations with automatic validation
- Generate multi-format outputs (R, SAS, Python code; RDF/Turtle metadata)
- Ensure traceability from Protocol → Estimand → Endpoint → Data → Results
- Leverage W3C standards for semantic interoperability
Current Phase: Increment 5 - Advanced LSP Features (Complete ✅)
Latest Achievement: Full IDE experience with Langium's comprehensive LSP features (PR #TBD)
Completed Increments:
- ✅ Code completion (keywords, types, references)
- ✅ Hover information (type details, documentation)
- ✅ Go-to-definition (jump to referenced entities)
- ✅ Find-references (locate all usages)
- ✅ Document symbols (outline view)
- ✅ Real-time diagnostics and error reporting
- ✅ Comprehensive LSP feature tests
- ✅ Sub-100ms response time for all LSP operations
- See docs/INCREMENT_5_SUMMARY.md for complete details
- ✅ W3C Data Cube Integrity Constraints (5 ICs: IC-1, IC-2, IC-11, IC-12, IC-19)
- ✅ CDISC SDTM Validation (DM, AE, LB domains)
- ✅ CDISC ADaM Validation (ADSL, BDS structures)
- ✅ CDISC CORE Rules Engine (31 rules)
- ✅ Version Management (SDTM 3.2/3.3/3.4, ADaM 1.0/1.1/1.2/1.3)
- ✅ Validation Reporting (JSON, Text, Markdown formats)
- ✅ 402 passing tests with comprehensive integration and performance testing
- ✅ <100ms validation performance for typical programs
- See docs/INCREMENT_4_PLAN.md for complete details
- ✅ Type system foundation with inference and checking
- ✅ Symbol table with scoping and reference resolution
- ✅ Semantic validators (slice, model, dependency, expression, formula)
- ✅ Type compatibility checking and conversions
- ✅ Complete integration testing
- See docs/INCREMENT_3_PLAN.md for complete details
- ✅ Langium-based grammar for all core constructs
- ✅ VS Code extension with syntax highlighting
- ✅ LSP integration with real-time diagnostics
- ✅ Expression language and Wilkinson formula notation
- ✅ 10 comprehensive example files
- See docs/INCREMENT_2_REVIEW.md for assessment
- ✅ Project setup and architecture
- ✅ Basic grammar and parsing
- ✅ Development environment
Test Coverage: 403 tests passing (3 skipped)
Next Phase: Increment 6 - Standard Library + Examples
Key Features Now Available:
- IDE Experience: Code completion, go-to-definition, find-references, hover info
- Standards Validation: W3C Data Cube integrity constraints, CDISC compliance
- Type System: Full type checking with inference and conversions
- Semantic Validation: Models, slices, derivations, dependencies
- Version Management: SDTM 3.2/3.3/3.4, ADaM 1.0-1.3
- Validation Reporting: JSON, Text, Markdown formats
- Real-time Diagnostics: Errors and warnings in VS Code
- Performance: <100ms validation and LSP response times
- Full Documentation: README.thunderstruck.md
- Product Requirements: THUNDERSTRUCK_PRD.md
- Implementation Plan: THUNDERSTRUCK_PLAN.md
- W3C Data Cube Primer: W3C_CUBE_PRIMER.md
// Standards version declaration
standards {
SDTM: "3.4",
ADaM: "1.2",
W3C_Cube: "2014-01-16"
}
// CDISC-compliant ADaM cube with automatic validation
cube ADADAS {
namespace: "http://example.org/study/xyz#"
structure: {
dimensions: [
USUBJID: Identifier,
AVISITN: Integer,
TRT01A: CodedValue<TRTCD>
],
measures: [
AVAL: Numeric unit: "points",
CHG: Numeric unit: "points",
BASE: Numeric unit: "points"
],
attributes: [
EFFFL: Flag,
PARAMCD: CodedValue<PARAM>
]
}
}
// Type-safe slice with automatic IC-11 validation
slice Week24 from ADADAS {
fix: { AVISITN = 24 },
vary: [USUBJID, TRT01A],
where: EFFFL == "Y"
}
// Statistical model with Wilkinson notation
model ANCOVA {
input: Week24,
formula: CHG ~ TRT01A + BASE,
family: Gaussian,
link: Identity
}
Validation Features:
- W3C integrity constraints (IC-1, IC-2, IC-11, IC-12, IC-19)
- CDISC SDTM/ADaM conformance checking
- CORE rules validation
- Type checking and inference
- Real-time diagnostics in VS Code
See examples/ directory for complete analysis specifications and packages/thunderstruck-language/src/tests/fixtures/ for validation test examples.
acdc-wip/
├── packages/
│ ├── thunderstruck-language/ # Langium language definition
│ └── thunderstruck-vscode/ # VS Code extension
├── examples/ # Example .tsk files
├── docs/ # Supporting documentation
├── LOP-PROPOSAL-CC.md # Concept-centric approach
├── LOP_PROPOSAL_CC_CUBE.md # Cube-centric approach
├── LOP-PROPOSAL-GPT5.md # Streamlined DSL approach
├── THUNDERSTRUCK_PRD.md # Product requirements
├── THUNDERSTRUCK_PLAN.md # Implementation plan
└── README.thunderstruck.md # Full Thunderstruck documentation
This project is in early development. See THUNDERSTRUCK_PLAN.md for the implementation roadmap and current status.
MIT License - see LICENSE file for details