|
| 1 | +# DRI-to-Symbol Matching Analysis |
| 2 | + |
| 3 | +## Executive Summary |
| 4 | + |
| 5 | +Matching DRI (Document Reference Index) records to database report symbols is challenging because: |
| 6 | +1. DRI tracks **document production forecasts** while DB contains **published documents** |
| 7 | +2. DRI has no symbol field - only internal SLOT # references |
| 8 | +3. Title formats differ significantly between sources |
| 9 | +4. Current matching achieves ~58% at 0.8 threshold, improvable to ~68% at 0.7 |
| 10 | + |
| 11 | +## Data Overview |
| 12 | + |
| 13 | +### DRI Dataset |
| 14 | +- **Total records**: 38,435 |
| 15 | +- **SG-related** (contains "secretary" + "report"): 4,261 |
| 16 | +- **With entity**: 4,119 |
| 17 | +- **Unique titles**: ~2,000 |
| 18 | + |
| 19 | +| Column | Description | |
| 20 | +|--------|-------------| |
| 21 | +| SLOT # | Internal DRI reference (e.g., F2510377) | |
| 22 | +| CASE # | EOSG case number (sparse) | |
| 23 | +| DATE | Document date | |
| 24 | +| ENTITY | Authoring department (111 unique) | |
| 25 | +| UNIT | Processing unit (SDU, PU, OCDC, RLU, OSAP, SPMU) | |
| 26 | +| DOCUMENT TITLE | Free-text title | |
| 27 | +| STATUS | Excluded/Not specified/Cleared/For Info/For Clearance | |
| 28 | + |
| 29 | +### Database |
| 30 | +- **Total reports**: 6,122 |
| 31 | +- **With "Secretary" in title**: 280 (4.6%) |
| 32 | +- **With "Report of the" in title**: 1,296 (21.2%) |
| 33 | + |
| 34 | +## DRI Title Pattern Variants |
| 35 | + |
| 36 | +| Pattern | Count | |
| 37 | +|---------|-------| |
| 38 | +| "report of the secretary-general" | 3,571 | |
| 39 | +| "note by the secretary-general" | 891 | |
| 40 | +| "secretary-general's report" | 51 | |
| 41 | +| "letter from secretary-general" | 42 | |
| 42 | +| "report from the secretary-general" | 1 | |
| 43 | + |
| 44 | +Note: Some DRI titles contain document symbols (30 records), e.g., "S/78/22", "A/78/986" |
| 45 | + |
| 46 | +## Title Format Differences |
| 47 | + |
| 48 | +### DRI Format Examples |
| 49 | +``` |
| 50 | +2024 report of the Secretary-General on strengthening of the coordination... |
| 51 | +Report of the Secretary-General on 1559 |
| 52 | +Secretary-General's Report on the Impact of Rising Military Expenditure... |
| 53 | +ACABQ Rpt: Progress on the functioning and development of the Umoja system |
| 54 | +``` |
| 55 | + |
| 56 | +### DB Format Examples |
| 57 | +``` |
| 58 | +Strengthening of the coordination of emergency humanitarian assistance... |
| 59 | +Report of the Secretary-General on Somalia |
| 60 | +Progress on the functioning and development of the Umoja system : |
| 61 | +``` |
| 62 | + |
| 63 | +**Key differences**: |
| 64 | +1. DRI often has year prefix ("2024 report...") |
| 65 | +2. DRI uses "Secretary-General's Report on X" while DB uses "X : report of the Secretary-General" |
| 66 | +3. DRI uses abbreviations (ACABQ Rpt:) |
| 67 | +4. Some DRI titles reference resolution numbers only ("Report of the SG on 1559") |
| 68 | + |
| 69 | +## Matching Approaches |
| 70 | + |
| 71 | +### Current Approach (populate_reporting_entities.py) |
| 72 | +1. Filter DRI to "report of the secretary-general" (exact phrase) |
| 73 | +2. Normalize: lowercase, remove brackets/quotes |
| 74 | +3. Pre-filter by 2+ shared words (excluding stopwords) |
| 75 | +4. Fuzzy match using `SequenceMatcher` |
| 76 | +5. Accept matches with score >= 0.8 |
| 77 | + |
| 78 | +**Results**: 481 symbols matched (from 1,935 DRI titles) |
| 79 | + |
| 80 | +### Improved Approach (Tested) |
| 81 | +1. Broader filter: "secretary" AND "report" |
| 82 | +2. Better normalization: |
| 83 | + - Strip year prefixes |
| 84 | + - Remove "Report of the Secretary-General on" boilerplate |
| 85 | + - Remove "ACABQ Rpt:" prefix |
| 86 | + - Remove duplicate markers |
| 87 | +3. Keyword-based pre-filtering |
| 88 | + |
| 89 | +**Results at different thresholds**: |
| 90 | + |
| 91 | +| Threshold | Matches | % of DRI | Unique Symbols | |
| 92 | +|-----------|---------|----------|----------------| |
| 93 | +| 1.0 (exact) | 340 | 17.7% | 340 | |
| 94 | +| 0.9 | 832 | 43.4% | - | |
| 95 | +| 0.8 | 1,102 | 57.5% | 735 | |
| 96 | +| 0.75 | 1,195 | 62.3% | - | |
| 97 | +| 0.7 | 1,304 | 68.0% | 809 | |
| 98 | + |
| 99 | +## Match Quality Analysis |
| 100 | + |
| 101 | +### Correct Borderline Matches (0.7-0.85) |
| 102 | +``` |
| 103 | +[0.81] Children and armed conflict in the DRC |
| 104 | + DRI: "Report of the Secretary-General on children and armed conflict..." |
| 105 | + DB: "Children and armed conflict in the Democratic Republic of the Congo" |
| 106 | + → CORRECT (same topic, different title format) |
| 107 | +
|
| 108 | +[0.80] World Population and Housing Census |
| 109 | + DRI: "Report of the Secretary-General: The 2020 and 2030 World Population..." |
| 110 | + DB: "2020 and 2030 World Population and Housing Census Programmes" |
| 111 | + → CORRECT (same topic) |
| 112 | +``` |
| 113 | + |
| 114 | +### Incorrect Matches (False Positives) |
| 115 | +``` |
| 116 | +[0.72] International Residual Mechanism |
| 117 | + DRI: "Construction of a new facility for the International Residual Mechanism..." |
| 118 | + DB: "Financing of the International Residual Mechanism..." |
| 119 | + → WRONG (different reports about same entity) |
| 120 | +``` |
| 121 | + |
| 122 | +### Non-Matchable DRI Records |
| 123 | +Some DRI titles cannot be matched because: |
| 124 | + |
| 125 | +1. **Resolution-only references**: |
| 126 | + - "Report of the Secretary-General on 1559" |
| 127 | + - "Report of the Secretary-General on 2139" |
| 128 | + - These need resolution-to-title mapping |
| 129 | + |
| 130 | +2. **Abbreviated titles**: |
| 131 | + - "Report of the Secretary-General on Abyei (UNISFA)" |
| 132 | + - "Report of the Secretary-General on Afghanistan" |
| 133 | + |
| 134 | +3. **Future documents** not yet in DB |
| 135 | + |
| 136 | +## Entity Distribution (DRI) |
| 137 | + |
| 138 | +| Entity | Count | |
| 139 | +|--------|-------| |
| 140 | +| DPPA | 889 | |
| 141 | +| DMSPC | 675 | |
| 142 | +| OHCHR | 604 | |
| 143 | +| DESA | 550 | |
| 144 | +| ODA | 248 | |
| 145 | +| OLA | 181 | |
| 146 | +| DPO | 113 | |
| 147 | +| EOSG | 77 | |
| 148 | +| UNW | 68 | |
| 149 | +| DGC | 59 | |
| 150 | + |
| 151 | +## Recommendations |
| 152 | + |
| 153 | +### Quick Wins |
| 154 | +1. **Lower threshold to 0.75** - increases matches from 58% to 62% with acceptable quality |
| 155 | +2. **Improve normalization** - strip SG boilerplate, year prefixes |
| 156 | +3. **Exact match first** - normalized titles match 340 records perfectly |
| 157 | + |
| 158 | +### Medium Effort |
| 159 | +4. **Resolution mapping** - build lookup for "1559" → "implementation of Security Council resolution 1559" |
| 160 | +5. **Keyword extraction** - match on 3+ significant keywords when fuzzy match fails |
| 161 | + |
| 162 | +### Larger Effort |
| 163 | +6. **Symbol extraction** - parse DB symbol patterns from DRI if available elsewhere |
| 164 | +7. **Manual curation** - create mapping table for common abbreviated titles |
| 165 | +8. **Date matching** - use DRI DATE to narrow candidates to same publication period |
| 166 | + |
| 167 | +## Keyword Matching for Short Titles |
| 168 | + |
| 169 | +For very short normalized titles (348 records with ≤3 words), keyword-based matching is effective: |
| 170 | + |
| 171 | +| DRI Title | Keywords | Best DB Match | |
| 172 | +|-----------|----------|---------------| |
| 173 | +| "Report of the SG on Libya" | {libya} | S/2025/611: Strategic review... Libya | |
| 174 | +| "Report of the SG on Somalia" | {somalia} | S/2025/613: Report of the SG on Somalia ✓ | |
| 175 | +| "Report of the SG on 1701" | {1701} | S/2025/738: Implementation of SC res 1701 ✓ | |
| 176 | +| "Multilingualism: Report of the SG" | {multilingualism} | A/78/790: Multilingualism ✓ | |
| 177 | +| "The Peacebuilding Fund: Report" | {peacebuilding, fund} | A/79/790: Peacebuilding Fund ✓ | |
| 178 | + |
| 179 | +**Approach**: Build inverted index of keywords → symbols, match by keyword overlap count. |
| 180 | + |
| 181 | +## Normalization Function (Improved) |
| 182 | + |
| 183 | +```python |
| 184 | +def normalize_title(t): |
| 185 | + if pd.isna(t): return '' |
| 186 | + t = str(t).lower() |
| 187 | + # Strip year prefix |
| 188 | + t = re.sub(r'^\d{4}\s+(progress\s+)?report\s+(of|on|from)\s+the\s+secretary[-\s]?general\s*(on|:)?\s*', '', t) |
| 189 | + # Strip SG boilerplate anywhere |
| 190 | + t = re.sub(r'report\s+(of|on|from)\s+the\s+secretary[-\s]?general\s*(on)?', '', t) |
| 191 | + t = re.sub(r'secretary[-\s]?general.?s?\s+report\s*(on)?', '', t) |
| 192 | + # Strip ACABQ prefix |
| 193 | + t = re.sub(r'acabq\s+r(e)?pt:', '', t) |
| 194 | + # Strip duplicate markers |
| 195 | + t = re.sub(r'\[duplicate.*?\]', '', t) |
| 196 | + # Clean punctuation |
| 197 | + t = re.sub(r'[^a-z0-9\s]', '', t) |
| 198 | + return ' '.join(t.split()) |
| 199 | +``` |
| 200 | + |
| 201 | +## Coverage Impact |
| 202 | + |
| 203 | +| Approach | DRI Match Rate | Unique Symbols | DB Coverage | |
| 204 | +|----------|----------------|----------------|-------------| |
| 205 | +| Current (0.8, basic norm) | 25% | 481 | 7.9% | |
| 206 | +| Improved norm, 0.8 threshold | 57.5% | 735 | 12.0% | |
| 207 | +| Improved norm, 0.7 threshold | 64.3% | 718 | 11.7% | |
| 208 | +| Combined (exact + fuzzy + keyword) | 53.3% | 581 | 9.5% | |
| 209 | + |
| 210 | +**Best approach**: Improved normalization with 0.7-0.8 threshold gives best balance. |
| 211 | + |
| 212 | +### Total Entity Coverage (DGACM + DRI) |
| 213 | + |
| 214 | +| Source | Symbols | DB Coverage | |
| 215 | +|--------|---------|-------------| |
| 216 | +| DGACM list only | 296 | 4.8% | |
| 217 | +| DRI improved only | 718 | 11.7% | |
| 218 | +| **Combined (deduplicated)** | ~850 | ~14% | |
| 219 | + |
| 220 | +**Note**: "Note by the Secretary-General" patterns (891 DRI records) are NOT in the current DB - these may be a different document type or need separate ingestion. |
| 221 | + |
| 222 | +## Next Steps |
| 223 | + |
| 224 | +### Immediate (Update Script) |
| 225 | +1. Update `populate_reporting_entities.py` with improved normalization function |
| 226 | +2. Lower threshold from 0.8 to 0.75 |
| 227 | +3. Broaden DRI filter from "report of the secretary-general" to "secretary" AND "report" |
| 228 | + |
| 229 | +### Short-term |
| 230 | +4. Build resolution number → title lookup table for "Report on 1701" style titles |
| 231 | +5. Add keyword fallback matching for short normalized titles |
| 232 | +6. Review and validate a sample of borderline matches (0.7-0.8) |
| 233 | + |
| 234 | +### Investigation |
| 235 | +7. Determine why 46% of DRI titles don't match - are these: |
| 236 | + - Future documents not yet published? |
| 237 | + - Documents in different DB? |
| 238 | + - Title variations needing manual mapping? |
| 239 | + |
| 240 | +## Conclusion |
| 241 | + |
| 242 | +The DRI→Symbol matching problem is fundamentally a **title reconciliation** challenge between two systems with different conventions. |
| 243 | + |
| 244 | +**Current state**: 481 symbols matched (7.9% DB coverage) |
| 245 | + |
| 246 | +**Achievable with improvements**: 718 symbols matched (11.7% DB coverage) - a **49% improvement** |
| 247 | + |
| 248 | +Key changes: |
| 249 | +- Better normalization (strip SG boilerplate, year prefixes) |
| 250 | +- Lower threshold (0.75 instead of 0.8) |
| 251 | +- Broader DRI filter |
| 252 | + |
| 253 | +For the remaining 46% unmatched DRI records, manual investigation is needed to determine if they represent future documents, different document types, or correctable title variations. |
0 commit comments