SMILES Parser

A robust, streaming SMILES (Simplified Molecular Input Line Entry System) parser implemented in Rust. This parser provides comprehensive lexical analysis, parsing, and AST generation for SMILES notation with advanced error handling and span tracking.

Features

🚀 Core Capabilities

Streaming Token-Based Parsing: Efficiently processes large multi-line inputs without upfront splitting
Complete SMILES Grammar Support: Handles all standard SMILES constructs including atoms, bonds, rings, branches, and stereochemistry
Advanced Error Recovery: Parse errors in one SMILES don't affect subsequent parsing
Accurate Span Tracking: Every AST node maintains precise source location information
Visitor Pattern: Easy AST traversal and analysis with built-in visitor support

🔧 Technical Features

Recursive Descent Parser: Clean, maintainable parser design following grammar structure
Semantic Validation: Ring bond matching, charge validation, isotope ranges, etc.
Memory Efficient: Single lexer/parser instance for multi-line processing
Zero-Copy String Handling: Source text references via spans, no unnecessary allocations
Comprehensive Error Types: Separate lexical and parse errors with detailed context

Installation

Add this to your Cargo.toml:

[dependencies]
smiles = "0.1.0"

Quick Start

Parse a Single SMILES

use smiles::parser::Parser;

fn main() {
    let mut parser = Parser::new("CCO");
    match parser.parse_smiles() {
        Ok(smiles) => {
            println!("Successfully parsed: {}", smiles.chain.links.len() + 1);
            println!("Number of atoms: {}", count_atoms(&smiles));
        }
        Err(error) => {
            println!("Parse error: {}", error);
        }
    }
}

Parse Multiple SMILES (Streaming)

use smiles::parser::Parser;

fn main() {
    let multi_smiles = r#"
        C
        CCO
        c1ccccc1
        Invalid@SMILES
        CC(C)C
    "#;

    let mut parser = Parser::new(multi_smiles);
    let results = parser.parse_multiple_lines();

    for (i, result) in results.iter().enumerate() {
        match result {
            Ok(smiles) => println!("SMILES {}: ✓ Parsed successfully", i + 1),
            Err(errors) => {
                println!("SMILES {}: ✗ Parse errors:", i + 1);
                for error in errors {
                    println!("  - {}", error);
                }
            }
        }
    }
}

Iterator-Based Multi-Line Parsing

For processing large files or streaming data, use the iterator-based API for memory-efficient line-by-line parsing:

use smiles::parser::Parser;

fn main() {
    let multi_smiles = r#"
        C
        CCO
        c1ccccc1
        Invalid@SMILES
        CC(C)C
    "#;

    let mut parser = Parser::new(multi_smiles);
    
    // Iterator approach - memory efficient
    for (line_num, result) in parser.parse_lines().enumerate() {
        match result {
            Ok(smiles) => println!("Line {}: ✓ Valid SMILES", line_num + 1),
            Err(error) => println!("Line {}: ✗ {}", error.line_number, error.error_type),
        }
    }
    
    // Collect only valid SMILES
    let mut parser2 = Parser::new(multi_smiles);
    let valid_smiles: Vec<_> = parser2.parse_lines()
        .filter_map(|result| result.ok())
        .collect();
    
    println!("Found {} valid SMILES", valid_smiles.len());
    
    // Convenience method for bulk processing
    let mut parser3 = Parser::new(multi_smiles);
    let all_results = parser3.parse_all_lines();
    let (valid, errors): (Vec<_>, Vec<_>) = all_results.into_iter()
        .partition(|result| result.is_ok());
    
    println!("Processed: {} valid, {} errors", valid.len(), errors.len());
}

Using the Visitor Pattern

use smiles::{parser::Parser, visitor::*, ast::*};

struct AtomCounter {
    count: usize,
}

impl AtomCounter {
    fn new() -> Self {
        Self { count: 0 }
    }
}

impl Visitor for AtomCounter {
    fn visit_atom(&mut self, _atom: &Atom) {
        self.count += 1;
    }
}

fn main() {
    let mut parser = Parser::new("c1ccccc1");
    if let Ok(smiles) = parser.parse_smiles() {
        let mut counter = AtomCounter::new();
        counter.visit_smiles(&smiles);
        println!("Benzene has {} atoms", counter.count); // Output: 6
    }
}

Architecture

Core Components

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Lexer     │───▶│   Parser    │───▶│    AST      │
│             │    │             │    │             │
│ • Tokenize  │    │ • Recursive │    │ • Visitor   │
│ • Errors    │    │   Descent   │    │ • Spans     │
│ • Spans     │    │ • Streaming │    │ • Types     │
└─────────────┘    └─────────────┘    └─────────────┘

Module Structure

lexer.rs: Tokenization with error recovery and span tracking
parser.rs: Recursive descent parser with streaming multi-line support
ast.rs: Abstract syntax tree definitions with comprehensive SMILES support
visitor.rs: Visitor pattern implementation for AST traversal
span.rs: Source location tracking with line/column information
error.rs: Comprehensive error types (lexical vs parse errors)
token.rs: Token definitions and utilities

Error Handling

The parser provides detailed error information with source locations:

use smiles::parser::Parser;

let mut parser = Parser::new("C1CCC"); // Unclosed ring
match parser.parse_smiles() {
    Err(error) => {
        println!("Error: {}", error);
        println!("Location: line {}, column {}", 
                 error.span.start_position(&parser.lexer.position_tracker()).line,
                 error.span.start_position(&parser.lexer.position_tracker()).column);
    }
    _ => {}
}

Error Types

Lexical Errors

Invalid characters
Unclosed brackets
Invalid number formats

Parse Errors

Unexpected tokens
Unmatched ring bonds
Invalid atom specifications
Structural inconsistencies

Examples

Run the included examples:

# Pretty-print parsed SMILES structures
cargo run --example pretty_printer

# Demonstrate streaming multi-line parsing
cargo run --example streaming_demo

# Calculate molecular formulas
cargo run --example molecular_formula

API Documentation

Parser

impl Parser<'src> {
    /// Create a new parser for the given source
    pub fn new(source: &'src str) -> Self;
    
    /// Parse a single SMILES string
    pub fn parse_smiles(&mut self) -> ParseResult<Smiles>;
    
    /// Parse multiple SMILES with streaming approach
    pub fn parse_multiple_lines(&mut self) -> Vec<Result<Smiles, Vec<ParseError>>>;
}

AST Types

The AST provides a complete representation of SMILES structure:

pub struct Smiles {
    pub chain: Chain,
    pub span: Span,
}

pub struct Chain {
    pub first_atom: BranchedAtom,
    pub links: Vec<Link>,
    pub span: Span,
}

pub struct BranchedAtom {
    pub atom: Atom,
    pub ring_bonds: Vec<RingBond>,
    pub branches: Vec<Branch>,
    pub span: Span,
}

Visitor Pattern

pub trait Visitor {
    fn visit_smiles(&mut self, smiles: &Smiles) { walk_smiles(self, smiles); }
    fn visit_chain(&mut self, chain: &Chain) { walk_chain(self, chain); }
    fn visit_atom(&mut self, atom: &Atom) { walk_atom(self, atom); }
    // ... more visit methods
}

Performance

Benchmarks

The streaming parser is designed for efficiency:

Memory: O(1) for multi-line parsing (no upfront line splitting)
Time: Linear O(n) in input size
Error Recovery: Constant time skip to next line

Multi-line Processing

// Efficient: Single parser instance, streaming tokens
let mut parser = Parser::new(large_multiline_input);
let results = parser.parse_multiple_lines();

// vs. inefficient: Creating parser per line
// ❌ for line in input.lines() { Parser::new(line) }

Testing

Run the comprehensive test suite:

# Run all tests
cargo test

# Run with output
cargo test -- --nocapture

# Run specific test categories
cargo test parser
cargo test lexer
cargo test visitor

Test Coverage

120+ unit tests covering all parser functionality
Comprehensive error scenarios for robust error handling
Performance tests for streaming multi-line parsing
Grammar compliance tests for SMILES specification adherence

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Add tests for your changes
Ensure all tests pass (cargo test)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Follow Rust naming conventions and idioms
Add comprehensive tests for new functionality
Include documentation for public APIs
Ensure error messages are clear and actionable
Maintain backward compatibility where possible

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

OpenSMILES Specification for the grammar reference
Daylight Chemical Information Systems for SMILES theory
Rust community for excellent parsing libraries and patterns

Roadmap

SMARTS Support: Extend to SMARTS (SMILES Arbitrary Target Specification)
Canonicalization: Generate canonical SMILES representations
Validation: Chemical validity checking (valence, aromaticity)
Performance: SIMD optimizations for large-scale processing
WebAssembly: Browser-compatible parsing
Python Bindings: PyO3-based Python integration

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.vscode		.vscode
examples		examples
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
errors.md		errors.md
lexer.md		lexer.md

Folders and files

Latest commit

History

Repository files navigation

SMILES Parser

Features

🚀 Core Capabilities

🔧 Technical Features

Installation

Quick Start

Parse a Single SMILES

Parse Multiple SMILES (Streaming)

Iterator-Based Multi-Line Parsing

Using the Visitor Pattern

Architecture

Core Components

Module Structure

Error Handling

Error Types

Lexical Errors

Parse Errors

Examples

API Documentation

Parser

AST Types

Visitor Pattern

Performance

Benchmarks

Multi-line Processing

Testing

Test Coverage

Contributing

Development Guidelines

License

Acknowledgments

Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages