|
1 | | -# Phase 1: Chemical Perception |
| 1 | +# Phase 1: Chemical Perception Overview |
2 | 2 |
|
3 | | -Chemical perception starts with nothing more than `MolecularGraph` connectivity and ends with an `AnnotatedMolecule` packed with ring membership, Kekulé-expanded bonds, formal charges, conjugation flags, and hybridization. Every later phase (typing + topology building) simply reads the data created here, so correctness and determinism are critical. |
| 3 | +Chemical perception is the first stage in the `dreid-typer` pipeline. It transforms a minimal `MolecularGraph` (atoms + bonds) into an `AnnotatedMolecule`, a chemically-aware structure that records every property the typing and builder phases require. The orchestrator is `perception::perceive`, which executes six deterministic passes in a fixed order. |
4 | 4 |
|
5 | | -## The Ordered Pipeline |
6 | | - |
7 | | -`perception::perceive` executes six passes in a fixed order. Each pass depends on the artefacts emitted by the preceding one. |
| 5 | +## Pipeline Overview |
8 | 6 |
|
9 | 7 | ```mermaid |
10 | 8 | graph TD |
11 | | - A["<b>MolecularGraph</b>"] --> B("1. Rings<br /><code>rings::perceive</code>") |
12 | | - B --> C("2. Kekulization<br /><code>kekulize::perceive</code>") |
13 | | - C --> D("3. Electron Heuristics<br /><code>electrons::perceive</code>") |
14 | | - D --> E("4. Aromaticity<br /><code>aromaticity::perceive</code>") |
15 | | - E --> F("5. Resonance Mapping<br /><code>resonance::perceive</code>") |
16 | | - F --> G("6. Hybridization<br /><code>hybridization::perceive</code>") |
17 | | - G --> H["<b>AnnotatedMolecule</b>"] |
| 9 | + subgraph Input |
| 10 | + A["<b>MolecularGraph</b><br><i>Connectivity only</i>"] |
| 11 | + end |
| 12 | +
|
| 13 | + subgraph "Chemical Perception (perception::perceive)" |
| 14 | + RINGS["1. Rings<br><code>rings::perceive</code>"] |
| 15 | + KEK["2. Kekulé Expansion<br><code>kekulize::perceive</code>"] |
| 16 | + ELECTRONS["3. Electron Assignments<br><code>electrons::perceive</code>"] |
| 17 | + AROMA["4. Aromaticity<br><code>aromaticity::perceive</code>"] |
| 18 | + RESON["5. Resonance<br><code>resonance::perceive</code>"] |
| 19 | + HYBRID["6. Hybridization<br><code>hybridization::perceive</code>"] |
| 20 | + end |
| 21 | +
|
| 22 | + subgraph Output |
| 23 | + OUT["<b>AnnotatedMolecule</b><br><i>Ring + electronic context</i>"] |
| 24 | + end |
| 25 | +
|
| 26 | + A --> RINGS --> KEK --> ELECTRONS --> AROMA --> RESON --> HYBRID --> OUT |
18 | 27 | ``` |
19 | 28 |
|
20 | | -- **Rings** discover the Smallest Set of Smallest Rings (SSSR) so that later passes know which atoms are cyclic and what the smallest ring size is. |
21 | | -- **Kekulization** rewrites every aromatic bond into an explicit single/double assignment that satisfies valence limits. |
22 | | -- **Electron heuristics** assign formal charges, lone pairs, and special-case templates for nitro, sulfonyl, halogen oxyanions, etc. |
23 | | -- **Aromaticity** applies Hückel counting to each ring system and records whether atoms are aromatic or anti-aromatic. |
24 | | -- **Resonance** leverages the `pauling` crate plus local heuristics to mark conjugated atoms and damp unrealistic resonance participation (e.g., sulfate oxygens). |
25 | | -- **Hybridization** collapses degrees + lone pairs into a final hybridization label, with special handling for conjugated or aromatic atoms. |
26 | | - |
27 | | -The rest of this document dives into each pass. |
28 | | - |
29 | | -## Step 1 – Ring System Perception (`rings::perceive`) |
30 | | - |
31 | | -The first task is topological: determine which atoms reside in rings. The algorithm computes the cyclomatic number to short-circuit acyclic graphs, then enumerates candidate cycles by temporarily removing each bond and finding a shortest path between its endpoints. From this candidate list the code selects a minimal cycle basis via Gaussian elimination over GF(2), producing an SSSR set. |
32 | | - |
33 | | -For every atom in a detected ring the annotator sets `is_in_ring = true` and stores the smallest ring size observed. These flags are prerequisites for Kekulé expansion (aromatic bonds must live inside rings) and for aromaticity. |
34 | | - |
35 | | -## Step 2 – Kekulé Expansion (`kekulize::perceive`) |
36 | | - |
37 | | -Input molecules may use `BondOrder::Aromatic` to indicate resonance. Downstream components only understand concrete single/double orders, so this pass searches each connected aromatic system for a valid Kekulé assignment: |
38 | | - |
39 | | -1. Group aromatic bonds into systems via BFS. |
40 | | -2. Run a backtracking solver that tries all single/double assignments while enforcing per-element valence limits (e.g., neutral nitrogen allows at most one double bond inside the system). |
41 | | -3. Apply the chosen orders to both the bond list and adjacency lists. |
42 | | - |
43 | | -If any aromatic bond connects atoms that were not marked as ring members in Step 1, the pass emits `TyperError::PerceptionFailed` with details about the failing step. |
44 | | - |
45 | | -## Step 3 – Electron & Charge Heuristics (`electrons::perceive`) |
46 | | - |
47 | | -With concrete bond orders in place the perceiver can reason about electrons. Rather than a single generic formula, this crate implements a template-style pass that covers the tricky functional groups DREIDING cares about. |
48 | | - |
49 | | -- **Nitro / nitrone / carboxylate:** assigns the correct +1/−1 charge split between the heteroatoms and sets heteroatom lone pairs (e.g., nitro oxygen holds three lone pairs when singly bound). |
50 | | -- **Sulfur & halogen oxyanions:** handles `SO2`, sulfones, sulfonates, and perchlorate-style centers by distributing charges onto terminal oxygens and leaving the central atom positively charged when appropriate. |
51 | | -- **Ammonium/iminium/onium/phosphonium:** detects degree-four nitrogen, oxygen, sulfur, and phosphorus centers and enforces +1 charges with the right lone-pair count. |
52 | | -- **Phenolate/enolate anions:** forces single-bound oxygens attached to `sp2` carbons to carry −1 charge and extra lone pairs. |
53 | | - |
54 | | -Atoms touched here are marked as “processed” so they are not recomputed later in the same pass. All other atoms fall back to generic electron counting using valence, bond order, and formal charge. |
| 29 | +Each pass mutates the shared `AnnotatedMolecule`. Later stages can rely on the invariants produced by earlier ones (e.g., hybridization assumes resonance has already run). The following sections summarize the responsibilities of each pass. |
55 | 30 |
|
56 | | -## Step 4 – Aromaticity (`aromaticity::perceive`) |
| 31 | +## 1. Ring Detection — `rings::perceive` |
57 | 32 |
|
58 | | -Armed with ring membership, explicit bond orders, and lone-pair counts, the aromaticity pass evaluates each ring system: |
| 33 | +- **Goal:** Identify the Smallest Set of Smallest Rings (SSSR) so that downstream logic knows which atoms are cyclic and how large the ring is. |
| 34 | +- **How it works:** The pass enumerates candidates by temporarily removing bonds and searching for alternative paths, then selects a minimal cycle basis via bit-vector Gaussian elimination. Each ring is stored as a sorted list of atom IDs. Matching atoms are flagged with `is_in_ring = true` and `smallest_ring_size`. |
| 35 | +- **Why it matters:** Aromaticity, resonance, and hybridization all depend on knowing whether atoms participate in cyclic systems. |
59 | 36 |
|
60 | | -1. Build ring adjacency to cluster fused rings into systems. |
61 | | -2. For each system, ensure every atom is potentially planar (steric number ≤ 3, or special lone-pair cases). |
62 | | -3. Count π-electrons per atom: |
63 | | - - Endocyclic double bond → +1 |
64 | | - - Lone pair donated into the ring (no exocyclic π bond) → +2 |
65 | | - - Formal charge −1 → +2; formal charge +1 → +0 |
66 | | - - Resonant flag from previous steps → +1 (captures delocalized contributions) |
67 | | -4. Classify as aromatic if the total satisfies $4n+2$, anti-aromatic if $4n$, neither otherwise. |
| 37 | +## 2. Kekulé Expansion — `kekulize::perceive` |
68 | 38 |
|
69 | | -If an entire fused system fails the aromatic test, each constituent ring is re-evaluated individually. At the end, every atom records whether it is aromatic, anti-aromatic, or neither. |
| 39 | +- **Goal:** Replace every aromatic bond with an explicit single/double assignment that respects valence and heteroatom allowances. |
| 40 | +- **How it works:** The pass validates that every aromatic bond is fully contained within a ring, partitions the aromatic bonds into connected systems, and runs a backtracking Kekulé solver for each system. Nitrogen and phosphorus receive one "double-bond allowance" to enforce the correct valence counts. Successful assignments update both the bond table and the adjacency lists. |
| 41 | +- **Why it matters:** Electron counting, aromaticity, and resonance all rely on concrete bond multiplicities. Without Kekulé expansion, delocalized input would prevent later passes from recognizing π-bonds. |
70 | 42 |
|
71 | | -## Step 5 – Resonance & Conjugation (`resonance::perceive`) |
| 43 | +## 3. Electron Assignments — `electrons::perceive` |
72 | 44 |
|
73 | | -This phase bridges the gap between localized double bonds and the broader “is this atom part of a conjugated system?” question. |
| 45 | +- **Goal:** Populate `formal_charge` and `lone_pairs` for every atom via a mixture of targeted functional-group heuristics and a general valence fallback. |
| 46 | +- **How it works:** |
| 47 | + - Pattern recognizers detect nitrones, nitro groups, sulfoxides/sulfones, halogen oxyanions, phosphoryl fragments, carboxylates, ammonium/iminium, onium/phosphonium ions, and enolate/phenate anions. When a pattern matches, the participating atoms are marked as processed and assigned the chemically expected charges/lone pairs. |
| 48 | + - Atoms that remain unprocessed fall back to a valence-based routine that balances valence electrons, bond orders, and existing formal charges. |
| 49 | +- **Why it matters:** Accurate charges and lone-pair counts underpin aromaticity checks, resonance detection, and hybridization inference. |
74 | 50 |
|
75 | | -1. Call `pauling::find_resonance_systems` to obtain conjugated components and mark `is_in_conjugated_system` for all returned atoms. |
76 | | -2. Augment the raw result with chemistry-specific heuristics: |
77 | | - - Aromatic atoms are always conjugated. |
78 | | - - Amides, thioamides, and sulfonamides propagate conjugation from the carbonyl carbon into neighboring lone-pair donors. |
79 | | - - Halogen oxyanions clamp their terminal oxygens to **not** conjugate (prevents incorrectly labeling perchlorate oxygens as resonant). |
80 | | - - Purely σ-bound sulfurs are demoted out of conjugation even if upstream heuristics temporarily marked them. |
| 51 | +## 4. Aromaticity — `aromaticity::perceive` |
81 | 52 |
|
82 | | -This pass also sets `is_resonant` for atoms that originated from aromatic bonds before Kekulé expansion so that the typing engine can recognize `*_R` atom types without conflating resonance with aromaticity. |
| 53 | +- **Goal:** Classify fused ring systems as aromatic, anti-aromatic, or neither using a Hückel π-electron count with planarity heuristics. |
| 54 | +- **How it works:** Rings are grouped into systems that share atoms. For each system, the model counts π-electrons contributed by in-ring double bonds, lone pairs, or formal charges, while also checking for cross-conjugation and planarity (via steric number heuristics). If the system is aromatic (4n+2 electrons) every atom in the system receives `is_aromatic = true`. Anti-aromatic systems (4n electrons) instead set `is_anti_aromatic = true`. Mixed systems fall back to per-ring evaluation. |
| 55 | +- **Why it matters:** Aromatic flags influence resonance, hybridization, and ultimately the typing rules (e.g., `C_R`, `N_R`). |
83 | 56 |
|
84 | | -## Step 6 – Hybridization (`hybridization::perceive`) |
| 57 | +## 5. Resonance — `resonance::perceive` |
85 | 58 |
|
86 | | -The final pass converts the accumulated data into a definitive `Hybridization` enum per atom. The logic proceeds in priority order: |
| 59 | +- **Goal:** Mark atoms that participate in conjugated systems, even when they are not part of a strictly aromatic ring. |
| 60 | +- **How it works:** The pass delegates to the external `pauling` crate to discover resonance systems, then overlays project-specific heuristics: |
| 61 | + - Aromatic atoms are always marked conjugated. |
| 62 | + - Amide/thioamide and sulfonamide motifs promote their heteroatom donors into conjugation when lone pairs are available. |
| 63 | + - Hypervalent halogen oxyanions have their terminal oxygens demoted to avoid false conjugation. |
| 64 | + - Purely σ-bound sulfurs that slipped through the previous steps are also demoted. |
| 65 | +- **Why it matters:** Conjugation flags feed hybridization inference and help the typing engine distinguish resonant atoms from plain sp² centers. |
87 | 66 |
|
88 | | -1. **Non-hybridized elements:** alkali/alkaline metals, halogens, noble gases, and most transition metals default to `Hybridization::None`. |
89 | | -2. **Conjugated overrides:** atoms flagged as conjugated (and not anti-aromatic) with steric number ≤ 3—or steric 4 with a lone pair—downgrade to `Hybridization::Resonant`, collapsing their steric number to 3 so the builder emits impropers. |
90 | | -3. **Aromatic planarity:** aromatic atoms default to `SP2` even if their steric number would imply 4, enforcing planarity in ring systems. |
91 | | -4. **VSEPR fallback:** remaining atoms map steric numbers to `SP3`/`SP2`/`SP`/`None`. Steric numbers above 4 trigger `PerceptionError::HybridizationInference`, surfacing impossible inputs early. |
| 67 | +## 6. Hybridization — `hybridization::perceive` |
92 | 68 |
|
93 | | -Each atom's final steric number is normalized after hybridization so that later components can rely on `steric_number` without recalculating. |
| 69 | +- **Goal:** Assign the final `Hybridization` enum and normalized `steric_number` for every atom. |
| 70 | +- **How it works:** For each atom: |
| 71 | + - Elements that never hybridize (alkali metals, halogens, most transition metals) are stamped as `Hybridization::None`. |
| 72 | + - Conjugated atoms that are not anti-aromatic collapse to `Hybridization::Resonant`, even when their raw steric number is four (lone-pair donation collapses the geometry to trigonal). |
| 73 | + - Aromatic atoms default to `Hybridization::SP2`. |
| 74 | + - Remaining atoms fall back to VSEPR rules derived from `degree + lone_pairs`. |
| 75 | + - The stored `steric_number` is renormalized so downstream consumers can rely on 2/3/4 despite resonance collapsing a formal 4 to 3. |
| 76 | +- **Why it matters:** The typing rules operate primarily on the `hybridization`, aromatic flags, and neighbor information produced by this pass. The builder also copies the final hybridization into the emitted topology. |
94 | 77 |
|
95 | | -## Output Guarantees |
| 78 | +--- |
96 | 79 |
|
97 | | -After the six passes finish, `AnnotatedMolecule` contains: |
| 80 | +By the end of chemical perception every `AnnotatedAtom` contains: |
98 | 81 |
|
99 | | -- Fully expanded bond orders (no `BondOrder::Aromatic`). |
100 | | -- Ring membership and smallest ring sizes. |
101 | | -- Formal charges, lone-pair counts, and degree for every atom. |
102 | | -- Aromatic / anti-aromatic / conjugated / resonant flags. |
103 | | -- Final steric number and `Hybridization` assignment. |
| 82 | +- identity (`element`, `id`, `degree`) |
| 83 | +- ring context (`is_in_ring`, `smallest_ring_size`) |
| 84 | +- electronic structure (`formal_charge`, `lone_pairs`, `is_resonant`, `is_in_conjugated_system`) |
| 85 | +- aromaticity flags (`is_aromatic`, `is_anti_aromatic`) |
| 86 | +- geometry (`hybridization`, normalized `steric_number`) |
104 | 87 |
|
105 | | -This rich, immutable snapshot is the sole source of chemical truth for the Typing Engine and the Topology Builder. |
| 88 | +This richly annotated molecule is the single source of truth for both the typing engine and the topology builder. |
0 commit comments