Skip to content

Commit 01c6ca1

Browse files
committed
Add experimental research synthesis skill set
1 parent 607683c commit 01c6ca1

File tree

4 files changed

+723
-0
lines changed

4 files changed

+723
-0
lines changed
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
id: research.extract-findings
2+
version: 0.1.0
3+
name: Extract Findings
4+
description: >
5+
Extract structured analytical findings from a normalized corpus of research
6+
items. For each item, identifies claims, evidence signals, risks, opportunities,
7+
and uncertainty markers. Also extracts keywords and classifies each item by
8+
dominant theme. Output is a findings_by_item structure suitable for downstream
9+
synthesis, conflict detection, and thematic grouping.
10+
11+
inputs:
12+
normalized_items:
13+
type: array
14+
required: true
15+
description: >
16+
Normalized items as produced by research.normalize-corpus. Each item must
17+
include at minimum: id, content, type.
18+
19+
topic:
20+
type: string
21+
required: false
22+
description: >
23+
Optional topic or research question anchoring the extraction. When provided,
24+
findings are prioritized relative to this topic.
25+
26+
goal:
27+
type: string
28+
required: false
29+
description: >
30+
Optional goal describing the intended use of the synthesis (e.g. "evaluate
31+
expansion into market X", "assess regulatory risk"). Influences claim
32+
relevance scoring.
33+
34+
focus:
35+
type: string
36+
required: false
37+
description: >
38+
Optional extraction focus hint. Examples: "prioritize risks", "prioritize
39+
contradictions", "emphasize opportunities".
40+
41+
max_detail:
42+
type: string
43+
required: false
44+
description: "Extraction depth: brief | standard | detailed. Defaults to standard."
45+
46+
outputs:
47+
findings_by_item:
48+
type: array
49+
required: true
50+
description: >
51+
Array of per-item findings objects. Each entry includes: item_id, item_type,
52+
claims (array of extracted claims with evidence_strength weak|moderate|strong),
53+
risks (array), opportunities (array), uncertainty_signals (array of phrases or
54+
areas that are unclear or weakly supported), dominant_theme, and keywords.
55+
56+
extraction_stats:
57+
type: object
58+
required: true
59+
description: >
60+
Summary of the extraction pass: items_processed, total_claims, items_with_risks,
61+
items_with_opportunities, items_with_uncertainty, dominant_themes (array).
62+
63+
steps:
64+
65+
- id: extract_structured_findings
66+
uses: model.output.generate
67+
input:
68+
instruction: >
69+
For each normalized item, extract structured analytical findings anchored
70+
to the provided topic and goal. For each item produce: item_id, item_type,
71+
claims (each with text and evidence_strength: weak|moderate|strong),
72+
risks (array of strings), opportunities (array of strings),
73+
uncertainty_signals (areas unclear or weakly supported), dominant_theme,
74+
and keywords. Apply focus hint if provided. Respect max_detail level.
75+
Do not fabricate claims not supported by the item content.
76+
Also produce extraction_stats: items_processed, total_claims,
77+
items_with_risks, items_with_opportunities, items_with_uncertainty,
78+
dominant_themes.
79+
context_items: inputs.normalized_items
80+
output_schema:
81+
type: object
82+
properties:
83+
findings_by_item:
84+
type: array
85+
extraction_stats:
86+
type: object
87+
required:
88+
- findings_by_item
89+
- extraction_stats
90+
detail_level: inputs.max_detail
91+
constraints:
92+
no_fabrication: true
93+
evidence_required: true
94+
output:
95+
output.findings_by_item: outputs.findings_by_item
96+
output.extraction_stats: outputs.extraction_stats
97+
warnings: vars.extraction_warnings
98+
99+
metadata:
100+
status: experimental
101+
tags:
102+
- research
103+
- extraction
104+
- findings
105+
- claims
106+
- analysis
107+
category: research
108+
use_cases:
109+
- Extract structured claims and risks from research documents
110+
- Prepare per-item findings for downstream synthesis or comparison
111+
- Identify uncertainty signals in a research corpus before making decisions
112+
classification:
113+
role: utility
114+
invocation: direct
115+
effect_mode: read_only
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
id: research.normalize-corpus
2+
version: 0.1.0
3+
name: Normalize Corpus
4+
description: >
5+
Normalize a heterogeneous collection of research items into a clean, uniform
6+
representation ready for analysis. Handles text extraction from raw content,
7+
chunking of long items, deduplication of near-identical entries, and language
8+
detection. Items may arrive as pre-extracted text or as source references
9+
(url, pdf_path, fs_path, memory_key) which are resolved during normalization.
10+
11+
inputs:
12+
items:
13+
type: array
14+
required: true
15+
description: >
16+
List of input items to normalize. Each item must include an `id` and either
17+
a `content` string (pre-extracted text) or a `source_ref` object with `type`
18+
and `location` fields for lazy extraction. Optional fields per item: title,
19+
type (article | report | note | web_page | search_result | agent_output |
20+
transcript | raw_text), source, relevance_hint, metadata.
21+
22+
max_chunk_size:
23+
type: number
24+
required: false
25+
description: >
26+
Maximum character length per chunk when splitting long items. Items shorter
27+
than this threshold are not split. Defaults to a runtime-defined value.
28+
29+
outputs:
30+
normalized_items:
31+
type: array
32+
required: true
33+
description: >
34+
List of normalized items, each containing: id, content (extracted text),
35+
type, language, chunks (array), source, title, and any retained metadata.
36+
37+
normalization_stats:
38+
type: object
39+
required: true
40+
description: >
41+
Summary statistics of the normalization pass including total_input,
42+
extracted_count, chunked_count, deduplicated_count, and language_distribution.
43+
44+
steps:
45+
46+
- id: assemble_normalized
47+
uses: model.output.generate
48+
input:
49+
instruction: >
50+
Return one object with exactly two top-level fields: normalized_items and
51+
normalization_stats. normalized_items must be an array where each item includes:
52+
id, content, type (inferred from input if not provided, defaulting to raw_text),
53+
language, chunks, source, title, and any retained metadata. normalization_stats
54+
must include: total_input, extracted_count, chunked_count, deduplicated_count,
55+
and language_distribution. Do not rename these fields, omit them, or wrap them
56+
under any additional structure.
57+
context_items: inputs.items
58+
output_schema:
59+
type: object
60+
properties:
61+
normalized_items:
62+
type: array
63+
normalization_stats:
64+
type: object
65+
required:
66+
- normalized_items
67+
- normalization_stats
68+
detail_level: standard
69+
constraints:
70+
max_chunk_size: inputs.max_chunk_size
71+
output:
72+
output.normalized_items: outputs.normalized_items
73+
output.normalization_stats: outputs.normalization_stats
74+
75+
metadata:
76+
status: experimental
77+
tags:
78+
- research
79+
- normalization
80+
- corpus
81+
- preprocessing
82+
category: research
83+
use_cases:
84+
- Prepare heterogeneous research material for downstream analysis
85+
- Normalize agent outputs, PDFs, web pages, and notes into a uniform format
86+
- Deduplicate and chunk long documents before synthesis
87+
classification:
88+
role: utility
89+
invocation: direct
90+
effect_mode: read_only

0 commit comments

Comments
 (0)