Skip to content

msomierick/smiles

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SMILES Parser

A robust, streaming SMILES (Simplified Molecular Input Line Entry System) parser implemented in Rust. This parser provides comprehensive lexical analysis, parsing, and AST generation for SMILES notation with advanced error handling and span tracking.

Features

🚀 Core Capabilities

  • Streaming Token-Based Parsing: Efficiently processes large multi-line inputs without upfront splitting
  • Complete SMILES Grammar Support: Handles all standard SMILES constructs including atoms, bonds, rings, branches, and stereochemistry
  • Advanced Error Recovery: Parse errors in one SMILES don't affect subsequent parsing
  • Accurate Span Tracking: Every AST node maintains precise source location information
  • Visitor Pattern: Easy AST traversal and analysis with built-in visitor support

🔧 Technical Features

  • Recursive Descent Parser: Clean, maintainable parser design following grammar structure
  • Semantic Validation: Ring bond matching, charge validation, isotope ranges, etc.
  • Memory Efficient: Single lexer/parser instance for multi-line processing
  • Zero-Copy String Handling: Source text references via spans, no unnecessary allocations
  • Comprehensive Error Types: Separate lexical and parse errors with detailed context

Installation

Add this to your Cargo.toml:

[dependencies]
smiles = "0.1.0"

Quick Start

Parse a Single SMILES

use smiles::parser::Parser;

fn main() {
    let mut parser = Parser::new("CCO");
    match parser.parse_smiles() {
        Ok(smiles) => {
            println!("Successfully parsed: {}", smiles.chain.links.len() + 1);
            println!("Number of atoms: {}", count_atoms(&smiles));
        }
        Err(error) => {
            println!("Parse error: {}", error);
        }
    }
}

Parse Multiple SMILES (Streaming)

use smiles::parser::Parser;

fn main() {
    let multi_smiles = r#"
        C
        CCO
        c1ccccc1
        Invalid@SMILES
        CC(C)C
    "#;

    let mut parser = Parser::new(multi_smiles);
    let results = parser.parse_multiple_lines();

    for (i, result) in results.iter().enumerate() {
        match result {
            Ok(smiles) => println!("SMILES {}: ✓ Parsed successfully", i + 1),
            Err(errors) => {
                println!("SMILES {}: ✗ Parse errors:", i + 1);
                for error in errors {
                    println!("  - {}", error);
                }
            }
        }
    }
}

Iterator-Based Multi-Line Parsing

For processing large files or streaming data, use the iterator-based API for memory-efficient line-by-line parsing:

use smiles::parser::Parser;

fn main() {
    let multi_smiles = r#"
        C
        CCO
        c1ccccc1
        Invalid@SMILES
        CC(C)C
    "#;

    let mut parser = Parser::new(multi_smiles);
    
    // Iterator approach - memory efficient
    for (line_num, result) in parser.parse_lines().enumerate() {
        match result {
            Ok(smiles) => println!("Line {}: ✓ Valid SMILES", line_num + 1),
            Err(error) => println!("Line {}: ✗ {}", error.line_number, error.error_type),
        }
    }
    
    // Collect only valid SMILES
    let mut parser2 = Parser::new(multi_smiles);
    let valid_smiles: Vec<_> = parser2.parse_lines()
        .filter_map(|result| result.ok())
        .collect();
    
    println!("Found {} valid SMILES", valid_smiles.len());
    
    // Convenience method for bulk processing
    let mut parser3 = Parser::new(multi_smiles);
    let all_results = parser3.parse_all_lines();
    let (valid, errors): (Vec<_>, Vec<_>) = all_results.into_iter()
        .partition(|result| result.is_ok());
    
    println!("Processed: {} valid, {} errors", valid.len(), errors.len());
}

Using the Visitor Pattern

use smiles::{parser::Parser, visitor::*, ast::*};

struct AtomCounter {
    count: usize,
}

impl AtomCounter {
    fn new() -> Self {
        Self { count: 0 }
    }
}

impl Visitor for AtomCounter {
    fn visit_atom(&mut self, _atom: &Atom) {
        self.count += 1;
    }
}

fn main() {
    let mut parser = Parser::new("c1ccccc1");
    if let Ok(smiles) = parser.parse_smiles() {
        let mut counter = AtomCounter::new();
        counter.visit_smiles(&smiles);
        println!("Benzene has {} atoms", counter.count); // Output: 6
    }
}

Architecture

Core Components

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Lexer     │───▶│   Parser    │───▶│    AST      │
│             │    │             │    │             │
│ • Tokenize  │    │ • Recursive │    │ • Visitor   │
│ • Errors    │    │   Descent   │    │ • Spans     │
│ • Spans     │    │ • Streaming │    │ • Types     │
└─────────────┘    └─────────────┘    └─────────────┘

Module Structure

  • lexer.rs: Tokenization with error recovery and span tracking
  • parser.rs: Recursive descent parser with streaming multi-line support
  • ast.rs: Abstract syntax tree definitions with comprehensive SMILES support
  • visitor.rs: Visitor pattern implementation for AST traversal
  • span.rs: Source location tracking with line/column information
  • error.rs: Comprehensive error types (lexical vs parse errors)
  • token.rs: Token definitions and utilities

Error Handling

The parser provides detailed error information with source locations:

use smiles::parser::Parser;

let mut parser = Parser::new("C1CCC"); // Unclosed ring
match parser.parse_smiles() {
    Err(error) => {
        println!("Error: {}", error);
        println!("Location: line {}, column {}", 
                 error.span.start_position(&parser.lexer.position_tracker()).line,
                 error.span.start_position(&parser.lexer.position_tracker()).column);
    }
    _ => {}
}

Error Types

Lexical Errors

  • Invalid characters
  • Unclosed brackets
  • Invalid number formats

Parse Errors

  • Unexpected tokens
  • Unmatched ring bonds
  • Invalid atom specifications
  • Structural inconsistencies

Examples

Run the included examples:

# Pretty-print parsed SMILES structures
cargo run --example pretty_printer

# Demonstrate streaming multi-line parsing
cargo run --example streaming_demo

# Calculate molecular formulas
cargo run --example molecular_formula

API Documentation

Parser

impl Parser<'src> {
    /// Create a new parser for the given source
    pub fn new(source: &'src str) -> Self;
    
    /// Parse a single SMILES string
    pub fn parse_smiles(&mut self) -> ParseResult<Smiles>;
    
    /// Parse multiple SMILES with streaming approach
    pub fn parse_multiple_lines(&mut self) -> Vec<Result<Smiles, Vec<ParseError>>>;
}

AST Types

The AST provides a complete representation of SMILES structure:

pub struct Smiles {
    pub chain: Chain,
    pub span: Span,
}

pub struct Chain {
    pub first_atom: BranchedAtom,
    pub links: Vec<Link>,
    pub span: Span,
}

pub struct BranchedAtom {
    pub atom: Atom,
    pub ring_bonds: Vec<RingBond>,
    pub branches: Vec<Branch>,
    pub span: Span,
}

Visitor Pattern

pub trait Visitor {
    fn visit_smiles(&mut self, smiles: &Smiles) { walk_smiles(self, smiles); }
    fn visit_chain(&mut self, chain: &Chain) { walk_chain(self, chain); }
    fn visit_atom(&mut self, atom: &Atom) { walk_atom(self, atom); }
    // ... more visit methods
}

Performance

Benchmarks

The streaming parser is designed for efficiency:

  • Memory: O(1) for multi-line parsing (no upfront line splitting)
  • Time: Linear O(n) in input size
  • Error Recovery: Constant time skip to next line

Multi-line Processing

// Efficient: Single parser instance, streaming tokens
let mut parser = Parser::new(large_multiline_input);
let results = parser.parse_multiple_lines();

// vs. inefficient: Creating parser per line
// ❌ for line in input.lines() { Parser::new(line) }

Testing

Run the comprehensive test suite:

# Run all tests
cargo test

# Run with output
cargo test -- --nocapture

# Run specific test categories
cargo test parser
cargo test lexer
cargo test visitor

Test Coverage

  • 120+ unit tests covering all parser functionality
  • Comprehensive error scenarios for robust error handling
  • Performance tests for streaming multi-line parsing
  • Grammar compliance tests for SMILES specification adherence

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Add tests for your changes
  4. Ensure all tests pass (cargo test)
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

Development Guidelines

  • Follow Rust naming conventions and idioms
  • Add comprehensive tests for new functionality
  • Include documentation for public APIs
  • Ensure error messages are clear and actionable
  • Maintain backward compatibility where possible

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Roadmap

  • SMARTS Support: Extend to SMARTS (SMILES Arbitrary Target Specification)
  • Canonicalization: Generate canonical SMILES representations
  • Validation: Chemical validity checking (valence, aromaticity)
  • Performance: SIMD optimizations for large-scale processing
  • WebAssembly: Browser-compatible parsing
  • Python Bindings: PyO3-based Python integration

About

SMILES parser implemented in Rust

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages