Skip to content

Commit 8e9d726

Browse files
update data source (manual -> dgacm)
1 parent 2a8436a commit 8e9d726

File tree

3 files changed

+260
-5
lines changed

3 files changed

+260
-5
lines changed

.gitignore

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,4 @@
1-
data/dri.xlsx
2-
data/manual_list.xlsx
3-
data/dgacm_list.xlsx
1+
data/
42
.playwright-mcp/
53

64
# See https://help.github.com/articles/ignoring-files/ for more about ignoring files.

docs/dri_matching_analysis.md

Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
# DRI-to-Symbol Matching Analysis
2+
3+
## Executive Summary
4+
5+
Matching DRI (Document Reference Index) records to database report symbols is challenging because:
6+
1. DRI tracks **document production forecasts** while DB contains **published documents**
7+
2. DRI has no symbol field - only internal SLOT # references
8+
3. Title formats differ significantly between sources
9+
4. Current matching achieves ~58% at 0.8 threshold, improvable to ~68% at 0.7
10+
11+
## Data Overview
12+
13+
### DRI Dataset
14+
- **Total records**: 38,435
15+
- **SG-related** (contains "secretary" + "report"): 4,261
16+
- **With entity**: 4,119
17+
- **Unique titles**: ~2,000
18+
19+
| Column | Description |
20+
|--------|-------------|
21+
| SLOT # | Internal DRI reference (e.g., F2510377) |
22+
| CASE # | EOSG case number (sparse) |
23+
| DATE | Document date |
24+
| ENTITY | Authoring department (111 unique) |
25+
| UNIT | Processing unit (SDU, PU, OCDC, RLU, OSAP, SPMU) |
26+
| DOCUMENT TITLE | Free-text title |
27+
| STATUS | Excluded/Not specified/Cleared/For Info/For Clearance |
28+
29+
### Database
30+
- **Total reports**: 6,122
31+
- **With "Secretary" in title**: 280 (4.6%)
32+
- **With "Report of the" in title**: 1,296 (21.2%)
33+
34+
## DRI Title Pattern Variants
35+
36+
| Pattern | Count |
37+
|---------|-------|
38+
| "report of the secretary-general" | 3,571 |
39+
| "note by the secretary-general" | 891 |
40+
| "secretary-general's report" | 51 |
41+
| "letter from secretary-general" | 42 |
42+
| "report from the secretary-general" | 1 |
43+
44+
Note: Some DRI titles contain document symbols (30 records), e.g., "S/78/22", "A/78/986"
45+
46+
## Title Format Differences
47+
48+
### DRI Format Examples
49+
```
50+
2024 report of the Secretary-General on strengthening of the coordination...
51+
Report of the Secretary-General on 1559
52+
Secretary-General's Report on the Impact of Rising Military Expenditure...
53+
ACABQ Rpt: Progress on the functioning and development of the Umoja system
54+
```
55+
56+
### DB Format Examples
57+
```
58+
Strengthening of the coordination of emergency humanitarian assistance...
59+
Report of the Secretary-General on Somalia
60+
Progress on the functioning and development of the Umoja system :
61+
```
62+
63+
**Key differences**:
64+
1. DRI often has year prefix ("2024 report...")
65+
2. DRI uses "Secretary-General's Report on X" while DB uses "X : report of the Secretary-General"
66+
3. DRI uses abbreviations (ACABQ Rpt:)
67+
4. Some DRI titles reference resolution numbers only ("Report of the SG on 1559")
68+
69+
## Matching Approaches
70+
71+
### Current Approach (populate_reporting_entities.py)
72+
1. Filter DRI to "report of the secretary-general" (exact phrase)
73+
2. Normalize: lowercase, remove brackets/quotes
74+
3. Pre-filter by 2+ shared words (excluding stopwords)
75+
4. Fuzzy match using `SequenceMatcher`
76+
5. Accept matches with score >= 0.8
77+
78+
**Results**: 481 symbols matched (from 1,935 DRI titles)
79+
80+
### Improved Approach (Tested)
81+
1. Broader filter: "secretary" AND "report"
82+
2. Better normalization:
83+
- Strip year prefixes
84+
- Remove "Report of the Secretary-General on" boilerplate
85+
- Remove "ACABQ Rpt:" prefix
86+
- Remove duplicate markers
87+
3. Keyword-based pre-filtering
88+
89+
**Results at different thresholds**:
90+
91+
| Threshold | Matches | % of DRI | Unique Symbols |
92+
|-----------|---------|----------|----------------|
93+
| 1.0 (exact) | 340 | 17.7% | 340 |
94+
| 0.9 | 832 | 43.4% | - |
95+
| 0.8 | 1,102 | 57.5% | 735 |
96+
| 0.75 | 1,195 | 62.3% | - |
97+
| 0.7 | 1,304 | 68.0% | 809 |
98+
99+
## Match Quality Analysis
100+
101+
### Correct Borderline Matches (0.7-0.85)
102+
```
103+
[0.81] Children and armed conflict in the DRC
104+
DRI: "Report of the Secretary-General on children and armed conflict..."
105+
DB: "Children and armed conflict in the Democratic Republic of the Congo"
106+
→ CORRECT (same topic, different title format)
107+
108+
[0.80] World Population and Housing Census
109+
DRI: "Report of the Secretary-General: The 2020 and 2030 World Population..."
110+
DB: "2020 and 2030 World Population and Housing Census Programmes"
111+
→ CORRECT (same topic)
112+
```
113+
114+
### Incorrect Matches (False Positives)
115+
```
116+
[0.72] International Residual Mechanism
117+
DRI: "Construction of a new facility for the International Residual Mechanism..."
118+
DB: "Financing of the International Residual Mechanism..."
119+
→ WRONG (different reports about same entity)
120+
```
121+
122+
### Non-Matchable DRI Records
123+
Some DRI titles cannot be matched because:
124+
125+
1. **Resolution-only references**:
126+
- "Report of the Secretary-General on 1559"
127+
- "Report of the Secretary-General on 2139"
128+
- These need resolution-to-title mapping
129+
130+
2. **Abbreviated titles**:
131+
- "Report of the Secretary-General on Abyei (UNISFA)"
132+
- "Report of the Secretary-General on Afghanistan"
133+
134+
3. **Future documents** not yet in DB
135+
136+
## Entity Distribution (DRI)
137+
138+
| Entity | Count |
139+
|--------|-------|
140+
| DPPA | 889 |
141+
| DMSPC | 675 |
142+
| OHCHR | 604 |
143+
| DESA | 550 |
144+
| ODA | 248 |
145+
| OLA | 181 |
146+
| DPO | 113 |
147+
| EOSG | 77 |
148+
| UNW | 68 |
149+
| DGC | 59 |
150+
151+
## Recommendations
152+
153+
### Quick Wins
154+
1. **Lower threshold to 0.75** - increases matches from 58% to 62% with acceptable quality
155+
2. **Improve normalization** - strip SG boilerplate, year prefixes
156+
3. **Exact match first** - normalized titles match 340 records perfectly
157+
158+
### Medium Effort
159+
4. **Resolution mapping** - build lookup for "1559" → "implementation of Security Council resolution 1559"
160+
5. **Keyword extraction** - match on 3+ significant keywords when fuzzy match fails
161+
162+
### Larger Effort
163+
6. **Symbol extraction** - parse DB symbol patterns from DRI if available elsewhere
164+
7. **Manual curation** - create mapping table for common abbreviated titles
165+
8. **Date matching** - use DRI DATE to narrow candidates to same publication period
166+
167+
## Keyword Matching for Short Titles
168+
169+
For very short normalized titles (348 records with ≤3 words), keyword-based matching is effective:
170+
171+
| DRI Title | Keywords | Best DB Match |
172+
|-----------|----------|---------------|
173+
| "Report of the SG on Libya" | {libya} | S/2025/611: Strategic review... Libya |
174+
| "Report of the SG on Somalia" | {somalia} | S/2025/613: Report of the SG on Somalia ✓ |
175+
| "Report of the SG on 1701" | {1701} | S/2025/738: Implementation of SC res 1701 ✓ |
176+
| "Multilingualism: Report of the SG" | {multilingualism} | A/78/790: Multilingualism ✓ |
177+
| "The Peacebuilding Fund: Report" | {peacebuilding, fund} | A/79/790: Peacebuilding Fund ✓ |
178+
179+
**Approach**: Build inverted index of keywords → symbols, match by keyword overlap count.
180+
181+
## Normalization Function (Improved)
182+
183+
```python
184+
def normalize_title(t):
185+
if pd.isna(t): return ''
186+
t = str(t).lower()
187+
# Strip year prefix
188+
t = re.sub(r'^\d{4}\s+(progress\s+)?report\s+(of|on|from)\s+the\s+secretary[-\s]?general\s*(on|:)?\s*', '', t)
189+
# Strip SG boilerplate anywhere
190+
t = re.sub(r'report\s+(of|on|from)\s+the\s+secretary[-\s]?general\s*(on)?', '', t)
191+
t = re.sub(r'secretary[-\s]?general.?s?\s+report\s*(on)?', '', t)
192+
# Strip ACABQ prefix
193+
t = re.sub(r'acabq\s+r(e)?pt:', '', t)
194+
# Strip duplicate markers
195+
t = re.sub(r'\[duplicate.*?\]', '', t)
196+
# Clean punctuation
197+
t = re.sub(r'[^a-z0-9\s]', '', t)
198+
return ' '.join(t.split())
199+
```
200+
201+
## Coverage Impact
202+
203+
| Approach | DRI Match Rate | Unique Symbols | DB Coverage |
204+
|----------|----------------|----------------|-------------|
205+
| Current (0.8, basic norm) | 25% | 481 | 7.9% |
206+
| Improved norm, 0.8 threshold | 57.5% | 735 | 12.0% |
207+
| Improved norm, 0.7 threshold | 64.3% | 718 | 11.7% |
208+
| Combined (exact + fuzzy + keyword) | 53.3% | 581 | 9.5% |
209+
210+
**Best approach**: Improved normalization with 0.7-0.8 threshold gives best balance.
211+
212+
### Total Entity Coverage (DGACM + DRI)
213+
214+
| Source | Symbols | DB Coverage |
215+
|--------|---------|-------------|
216+
| DGACM list only | 296 | 4.8% |
217+
| DRI improved only | 718 | 11.7% |
218+
| **Combined (deduplicated)** | ~850 | ~14% |
219+
220+
**Note**: "Note by the Secretary-General" patterns (891 DRI records) are NOT in the current DB - these may be a different document type or need separate ingestion.
221+
222+
## Next Steps
223+
224+
### Immediate (Update Script)
225+
1. Update `populate_reporting_entities.py` with improved normalization function
226+
2. Lower threshold from 0.8 to 0.75
227+
3. Broaden DRI filter from "report of the secretary-general" to "secretary" AND "report"
228+
229+
### Short-term
230+
4. Build resolution number → title lookup table for "Report on 1701" style titles
231+
5. Add keyword fallback matching for short normalized titles
232+
6. Review and validate a sample of borderline matches (0.7-0.8)
233+
234+
### Investigation
235+
7. Determine why 46% of DRI titles don't match - are these:
236+
- Future documents not yet published?
237+
- Documents in different DB?
238+
- Title variations needing manual mapping?
239+
240+
## Conclusion
241+
242+
The DRI→Symbol matching problem is fundamentally a **title reconciliation** challenge between two systems with different conventions.
243+
244+
**Current state**: 481 symbols matched (7.9% DB coverage)
245+
246+
**Achievable with improvements**: 718 symbols matched (11.7% DB coverage) - a **49% improvement**
247+
248+
Key changes:
249+
- Better normalization (strip SG boilerplate, year prefixes)
250+
- Lower threshold (0.75 instead of 0.8)
251+
- Broader DRI filter
252+
253+
For the remaining 46% unmatched DRI records, manual investigation is needed to determine if they represent future documents, different document types, or correctable title variations.

python/populate_reporting_entities.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -170,17 +170,21 @@ def match_dri_to_db(dri_df: pd.DataFrame, db_df: pd.DataFrame, threshold: float
170170
return matches
171171

172172

173-
def populate_table(conn, manual_df: pd.DataFrame, dri_matches: dict):
173+
def populate_table(conn, dgacm_df: pd.DataFrame, dri_matches: dict):
174174
"""Populate the reporting_entities table."""
175175
print("Populating reporting_entities table...")
176176

177177
cur = conn.cursor()
178178

179+
# Clear existing data
180+
cur.execute(f"TRUNCATE TABLE {DB_SCHEMA}.reporting_entities")
181+
print(" Cleared existing data")
182+
179183
# Collect all data
180184
all_data = {}
181185

182186
# Add DGACM list data (higher priority)
183-
for _, row in manual_df.iterrows():
187+
for _, row in dgacm_df.iterrows():
184188
symbol = row["symbol"]
185189
entity = row["entity"]
186190
if symbol and entity:

0 commit comments

Comments
 (0)