A robust, streaming SMILES (Simplified Molecular Input Line Entry System) parser implemented in Rust. This parser provides comprehensive lexical analysis, parsing, and AST generation for SMILES notation with advanced error handling and span tracking.
- Streaming Token-Based Parsing: Efficiently processes large multi-line inputs without upfront splitting
- Complete SMILES Grammar Support: Handles all standard SMILES constructs including atoms, bonds, rings, branches, and stereochemistry
- Advanced Error Recovery: Parse errors in one SMILES don't affect subsequent parsing
- Accurate Span Tracking: Every AST node maintains precise source location information
- Visitor Pattern: Easy AST traversal and analysis with built-in visitor support
- Recursive Descent Parser: Clean, maintainable parser design following grammar structure
- Semantic Validation: Ring bond matching, charge validation, isotope ranges, etc.
- Memory Efficient: Single lexer/parser instance for multi-line processing
- Zero-Copy String Handling: Source text references via spans, no unnecessary allocations
- Comprehensive Error Types: Separate lexical and parse errors with detailed context
Add this to your Cargo.toml:
[dependencies]
smiles = "0.1.0"use smiles::parser::Parser;
fn main() {
let mut parser = Parser::new("CCO");
match parser.parse_smiles() {
Ok(smiles) => {
println!("Successfully parsed: {}", smiles.chain.links.len() + 1);
println!("Number of atoms: {}", count_atoms(&smiles));
}
Err(error) => {
println!("Parse error: {}", error);
}
}
}use smiles::parser::Parser;
fn main() {
let multi_smiles = r#"
C
CCO
c1ccccc1
Invalid@SMILES
CC(C)C
"#;
let mut parser = Parser::new(multi_smiles);
let results = parser.parse_multiple_lines();
for (i, result) in results.iter().enumerate() {
match result {
Ok(smiles) => println!("SMILES {}: ✓ Parsed successfully", i + 1),
Err(errors) => {
println!("SMILES {}: ✗ Parse errors:", i + 1);
for error in errors {
println!(" - {}", error);
}
}
}
}
}For processing large files or streaming data, use the iterator-based API for memory-efficient line-by-line parsing:
use smiles::parser::Parser;
fn main() {
let multi_smiles = r#"
C
CCO
c1ccccc1
Invalid@SMILES
CC(C)C
"#;
let mut parser = Parser::new(multi_smiles);
// Iterator approach - memory efficient
for (line_num, result) in parser.parse_lines().enumerate() {
match result {
Ok(smiles) => println!("Line {}: ✓ Valid SMILES", line_num + 1),
Err(error) => println!("Line {}: ✗ {}", error.line_number, error.error_type),
}
}
// Collect only valid SMILES
let mut parser2 = Parser::new(multi_smiles);
let valid_smiles: Vec<_> = parser2.parse_lines()
.filter_map(|result| result.ok())
.collect();
println!("Found {} valid SMILES", valid_smiles.len());
// Convenience method for bulk processing
let mut parser3 = Parser::new(multi_smiles);
let all_results = parser3.parse_all_lines();
let (valid, errors): (Vec<_>, Vec<_>) = all_results.into_iter()
.partition(|result| result.is_ok());
println!("Processed: {} valid, {} errors", valid.len(), errors.len());
}use smiles::{parser::Parser, visitor::*, ast::*};
struct AtomCounter {
count: usize,
}
impl AtomCounter {
fn new() -> Self {
Self { count: 0 }
}
}
impl Visitor for AtomCounter {
fn visit_atom(&mut self, _atom: &Atom) {
self.count += 1;
}
}
fn main() {
let mut parser = Parser::new("c1ccccc1");
if let Ok(smiles) = parser.parse_smiles() {
let mut counter = AtomCounter::new();
counter.visit_smiles(&smiles);
println!("Benzene has {} atoms", counter.count); // Output: 6
}
}┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Lexer │───▶│ Parser │───▶│ AST │
│ │ │ │ │ │
│ • Tokenize │ │ • Recursive │ │ • Visitor │
│ • Errors │ │ Descent │ │ • Spans │
│ • Spans │ │ • Streaming │ │ • Types │
└─────────────┘ └─────────────┘ └─────────────┘
lexer.rs: Tokenization with error recovery and span trackingparser.rs: Recursive descent parser with streaming multi-line supportast.rs: Abstract syntax tree definitions with comprehensive SMILES supportvisitor.rs: Visitor pattern implementation for AST traversalspan.rs: Source location tracking with line/column informationerror.rs: Comprehensive error types (lexical vs parse errors)token.rs: Token definitions and utilities
The parser provides detailed error information with source locations:
use smiles::parser::Parser;
let mut parser = Parser::new("C1CCC"); // Unclosed ring
match parser.parse_smiles() {
Err(error) => {
println!("Error: {}", error);
println!("Location: line {}, column {}",
error.span.start_position(&parser.lexer.position_tracker()).line,
error.span.start_position(&parser.lexer.position_tracker()).column);
}
_ => {}
}- Invalid characters
- Unclosed brackets
- Invalid number formats
- Unexpected tokens
- Unmatched ring bonds
- Invalid atom specifications
- Structural inconsistencies
Run the included examples:
# Pretty-print parsed SMILES structures
cargo run --example pretty_printer
# Demonstrate streaming multi-line parsing
cargo run --example streaming_demo
# Calculate molecular formulas
cargo run --example molecular_formulaimpl Parser<'src> {
/// Create a new parser for the given source
pub fn new(source: &'src str) -> Self;
/// Parse a single SMILES string
pub fn parse_smiles(&mut self) -> ParseResult<Smiles>;
/// Parse multiple SMILES with streaming approach
pub fn parse_multiple_lines(&mut self) -> Vec<Result<Smiles, Vec<ParseError>>>;
}The AST provides a complete representation of SMILES structure:
pub struct Smiles {
pub chain: Chain,
pub span: Span,
}
pub struct Chain {
pub first_atom: BranchedAtom,
pub links: Vec<Link>,
pub span: Span,
}
pub struct BranchedAtom {
pub atom: Atom,
pub ring_bonds: Vec<RingBond>,
pub branches: Vec<Branch>,
pub span: Span,
}pub trait Visitor {
fn visit_smiles(&mut self, smiles: &Smiles) { walk_smiles(self, smiles); }
fn visit_chain(&mut self, chain: &Chain) { walk_chain(self, chain); }
fn visit_atom(&mut self, atom: &Atom) { walk_atom(self, atom); }
// ... more visit methods
}The streaming parser is designed for efficiency:
- Memory: O(1) for multi-line parsing (no upfront line splitting)
- Time: Linear O(n) in input size
- Error Recovery: Constant time skip to next line
// Efficient: Single parser instance, streaming tokens
let mut parser = Parser::new(large_multiline_input);
let results = parser.parse_multiple_lines();
// vs. inefficient: Creating parser per line
// ❌ for line in input.lines() { Parser::new(line) }Run the comprehensive test suite:
# Run all tests
cargo test
# Run with output
cargo test -- --nocapture
# Run specific test categories
cargo test parser
cargo test lexer
cargo test visitor- 120+ unit tests covering all parser functionality
- Comprehensive error scenarios for robust error handling
- Performance tests for streaming multi-line parsing
- Grammar compliance tests for SMILES specification adherence
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Add tests for your changes
- Ensure all tests pass (
cargo test) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow Rust naming conventions and idioms
- Add comprehensive tests for new functionality
- Include documentation for public APIs
- Ensure error messages are clear and actionable
- Maintain backward compatibility where possible
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenSMILES Specification for the grammar reference
- Daylight Chemical Information Systems for SMILES theory
- Rust community for excellent parsing libraries and patterns
- SMARTS Support: Extend to SMARTS (SMILES Arbitrary Target Specification)
- Canonicalization: Generate canonical SMILES representations
- Validation: Chemical validity checking (valence, aromaticity)
- Performance: SIMD optimizations for large-scale processing
- WebAssembly: Browser-compatible parsing
- Python Bindings: PyO3-based Python integration