Idea2Paper Project Summary Document

English | 简体中文

Note: Scripts are now organized under scripts/tools/ and scripts/demos/. Legacy paths (e.g., scripts/build_entity_v3.py) still work via thin wrappers.

📋 Project Overview

Project Name: Idea2Paper - Automated Academic Paper Generation System Based on Knowledge Graph

Core Goal: Automatically transform a user's research Idea into a submission-ready paper Story (Narrative Skeleton) that meets top-tier conference (ICLR) standards.

Tech Stack:

Knowledge Graph: NetworkX
Vector Retrieval: Embedding (Qwen3-Embedding-4B)
Large Language Models: Qwen3-14B, Qwen2.5-7B-Instruct
Data Source: ICLR 2025 Paper Dataset (8,285 papers)

Idea2Paper Project Summary Document

1. System Architecture

1.1 Overall Flowchart

┌─────────────────────────────────────────────────────────────────────────┐
│                       Idea2Paper Complete Workflow                      │
└─────────────────────────────────────────────────────────────────────────┘

User Input Idea
    │
    ├──────────────────────────────────────────────────────────────────────┐
    │                 Phase 1: Knowledge Graph Construction                │
    │                (One-time build, reusable subsequently)               │
    ├──────────────────────────────────────────────────────────────────────┤
    │                                                                      │
    │  1. Load ICLR Paper Data (8,285 papers)                              │
    │      ↓                                                               │
    │  2. Construct 4 Types of Nodes                                       │
    │      ├─ Idea Nodes (8,284)                                           │
    │      ├─ Pattern Nodes (124, LLM-Enhanced)                            │
    │      ├─ Domain Nodes (98)                                            │
    │      └─ Paper Nodes (8,285)                                          │
    │      ↓                                                               │
    │  3. Construct Edge Relations (444,872 edges)                         │
    │      ├─ Basic Connection Edges (Paper→Idea/Pattern/Domain)           │
    │      └─ Retrieval Auxiliary Edges (Idea→Domain, Pattern→Domain)      │
    │      ↓                                                               │
    │  4. Output Knowledge Graph                                           │
    │                                                                      │
    └──────────────────────────────────────────────────────────────────────┘
    │
    ├──────────────────────────────────────────────────────────────────────┐
    │                      Phase 2: Three-Way Retrieval                    │
    │                         (Per run, approx. 27s)                       │
    ├──────────────────────────────────────────────────────────────────────┤
    │                                                                      │
    │  ┌─────────────┬─────────────┬─────────────┐                         │
    │  │   Path 1    │   Path 2    │   Path 3    │                         │
    │  │ Similar Idea│ Domain Rel. │Similar Paper│                         │
    │  │ (Weight 0.4)│ (Weight 0.2)│ (Weight 0.4)│                         │
    │  └─────────────┴─────────────┴─────────────┘                         │
    │       │              │              │                                │
    │       │              │              │                                │
    │  Coarse: Jaccard Match Domain  Coarse: Jaccard                       │
    │  Top-100         Top-5         Top-100                               │
    │       ↓              ↓              ↓                                │
    │  Fine: Embedding Find Pattern  Fine: Embedding                       │
    │  Top-10          works_well    Top-20                                │
    │       ↓              ↓              ↓                                │
    │  Get Pattern     Get Pattern   Get Pattern                           │
    │  Score           Score         Score                                 │
    │       │              │              │                                │
    │       └──────────────┴──────────────┘                                │
    │                      ↓                                               │
    │             Weighted Fusion & Fine Ranking                           │
    │                      ↓                                               │
    │              Top-10 Patterns                                         │
    │                                                                      │
    └──────────────────────────────────────────────────────────────────────┘
    │
    ├──────────────────────────────────────────────────────────────────────┐
    │                Phase 3: Story Generation & Refinemen                 │
    │                      (3-10 minutes)                                  │
    ├──────────────────────────────────────────────────────────────────────┤
    │                                                                      │
    │  1. Multi-dimensional Pattern Classification                         │
    │      ├─ Stability                                                    │
    │      ├─ Novelty                                                      │
    │      └─ Cross-Domain                                                 │
    │      ↓                                                               │
    │  2. Select Initial Pattern → Generate Draft Story                    │
    │      ↓                                                               │
    │  3. Multi-Agent Critic Review (Methodology/Novelty/Storyteller)      │
    │      ↓                                                               │
    │  4. Decision: Score >= 7.0?                                          │
    │      ├─[Yes]→ Proceed to Phase 4                                     │
    │      └─[No] → Intelligent Refinement                                 │
    │                 │                                                    │
    │                 ├─ Novelty Stagnated? → [Novelty Mode]               │
    │                 │   ├─ Traverse Novelty Patterns                     │
    │                 │   ├─ Idea Fusion                                   │
    │                 │   ├─ Story Reflection (Quality Assessment)         │
    │                 │   ├─ Regenerate Story                              │
    │                 │   ├─ Critic Review                                 │
    │                 │   ├─ Score Dropped? → Rollback                     │
    │                 │   └─ Fallback: Select Highest Score Version        │
    │                 │                                                    │
    │                 └─ Ordinary Refinement → Inject Complementary Tricks │
    │                     ├─ Lacks Novelty → Tail Injection (Rank 5-10)    │
    │                     ├─ Lacks Stability → Head Injection (Rank 1-3)   │
    │                     └─ Return to Step 2                              │
    │                                                                      │
    └──────────────────────────────────────────────────────────────────────┘
    │
    ├──────────────────────────────────────────────────────────────────────┐
    │                    Phase 4: RAG Novelty Verification                 │
    │                            (Approx. 30s)                             │
    ├──────────────────────────────────────────────────────────────────────┤
    │                                                                      │
    │  1. Extract Key Methods → Retrieve Papers from Top Confs (Last 3 Yrs)│
    │      ↓                                                               │
    │  2. Decision: Similarity > 0.75?                                     │
    │      ├─[No] → Output Final Story                                     │
    │      └─[Yes]→ Collision! Pivot Avoidance                             │
    │                 ├─ Analyze Collision Points                          │
    │                 ├─ Generate Constraints (Disable Tech/Domain Shift)  │
    │                 └─ Return to Phase 3, Step 2                         │
    │                                                                      │
    └──────────────────────────────────────────────────────────────────────┘
    │
    ▼
Output Final Story (JSON format)

Workflow Explanation:

Phase 1: Offline construction, run only once.
Phase 2: Real-time retrieval, 13x speedup (27 seconds).
Phase 3: Core generation, intelligent refinement mechanism.
Phase 4: Deduplication/Novelty verification to avoid collision.

1.2 Core Modules

Layer	Module	File/Script	Function
Data Layer	Knowledge Graph Construction	`build_entity_v3.py`, `build_edges.py`	Construct nodes and edges
Retrieval Layer	Three-Way Retrieval System	`recall_system.py`	Retrieve relevant Patterns
Generation Layer	Pattern Selection	`pattern_selector.py`	Multi-dimensional Pattern classification
Generation Layer	Idea Fusion	`planner.py`	Fuse innovative Ideas
Generation Layer	Story Generation	`story_generator.py`	Generate Paper Story
Generation Layer	Story Reflection	`story_reflector.py`	Assess fusion quality
Generation Layer	Critic Review	`critic.py`	Multi-agent review
Generation Layer	Intelligent Refinement	`refinement.py`	Iterative optimization
Generation Layer	RAG Verification	`verifier.py`	Deduplication and avoidance
Orchestration Layer	Pipeline Management	`manager.py`, `idea2story_pipeline.py`	Workflow orchestration

2. Knowledge Graph Construction

2.1 Data Scale

Knowledge Graph Statistics:
├─ Total Nodes: 16,791
│  ├─ Idea:    8,284 (100% coverage)
│  ├─ Pattern: 124 (Generated via clustering)
│  ├─ Domain:  98 (Generated via aggregation)
│  └─ Paper:   8,285
└─ Total Edges:   444,872
   ├─ Basic Connection Edges: ~25,000
   └─ Retrieval Auxiliary Edges: ~420,000

2.2 Node Definitions

Idea Node: The core innovation of the paper

{
  "idea_id": "idea_0",
  "description": "Core idea description...",
  "base_problem": "Base problem...",
  "solution_pattern": "Solution pattern...",
  "pattern_ids": ["pattern_9", ...]
}

Pattern Node: Writing trope/Method unit template

{
  "pattern_id": "pattern_24",
  "name": "Reframing Graph Learning Scalability",
  "size": 331,
  "llm_enhanced_summary": {
    "representative_ideas": "Inductive summary...",
    "common_tricks": ["Trick 1", "Trick 2"]
  }
}

Domain Node: Research domain

{
  "domain_id": "domain_0",
  "name": "Natural Language Processing",
  "paper_count": 1076,
  "sub_domains": ["Text Classification", ...]
}

Paper Node: Concrete paper

{
  "paper_id": "RUzSobdYy0V",
  "title": "Quantifying and Mitigating...",
  "domain": "Fairness & Accountability",
  "idea": "Core idea...",
  "pattern_id": "pattern_9"
}

2.3 Edge Definitions

Basic Connection Edges:

Paper → Idea (implements): The paper implements this Idea.
Paper → Pattern (uses_pattern): The paper uses this Pattern.
Paper → Domain (in_domain): The paper belongs to this Domain.

Retrieval Auxiliary Edges:

Idea → Domain (belongs_to): Domain the Idea belongs to, weight = proportion.
Pattern → Domain (works_well_in): Effectiveness of Pattern in this Domain, weight = effectiveness.
Idea → Paper (similar_to_paper): Similarity weight (calculated in real-time in Path 3).

2.4 Execution Method

# 1. Build Nodes
python scripts/build_entity_v3.py
# Output: output/nodes_*.json (4 files)

# 2. Build Edges
python scripts/build_edges.py
# Output: output/edges.json, output/knowledge_graph_v2.gpickle

Execution Time: Node construction 15 minutes (including LLM enhancement) + Edge construction 3 minutes.

3. Three-Way Retrieval System

3.1 Retrieval Strategy

Path	Matching Object	Capture Dimension	Weight	Retrieval Count
Path 1	Idea Description	Core idea similarity	0.4	Top-10 Pattern
Path 2	Domain & Sub-domains	Domain generalization	0.2	Top-5 Pattern
Path 3	Paper Title	Research theme similarity	0.4	Top-10 Pattern

3.2 Two-Stage Retrieval Optimization

Performance Comparison:

Full Embedding: ~7 minutes (8,284 API calls)
Two-Stage Retrieval: ~27 seconds (100 API calls)
Speedup Ratio: 13x

Process:

Coarse Ranking: Jaccard fast filtering Top-100 (Milliseconds)
    ↓
Fine Ranking: Embedding precise sorting Top-10/20 (~27 seconds)

3.3 Similarity Calculation

Jaccard Similarity (Coarse Ranking):

Jaccard(A, B) = |A ∩ B| / |A ∪ B|

Embedding Similarity (Fine Ranking):

Cosine(A, B) = dot(emb_A, emb_B) / (norm(emb_A) * norm(emb_B))

3.4 Execution Method

# Run independently
python scripts/simple_recall_demo.py "Your Research Idea"

# Use as a class
from recall_system import RecallSystem
system = RecallSystem()
results = system.recall(user_idea, verbose=True)

Output: List of Top-10 Patterns, each containing (pattern_id, pattern_info, score).

4. Idea2Story Pipeline

4.1 Core Mechanisms

(1) Multi-dimensional Pattern Classification

Goal: Ensure Pattern diversity.

Dimensions:

Stability: Rank Top-3 + Cluster Size ≥ 15.
Novelty: Cluster Size < 10.
Cross-Domain: From Path 2/3 + Different Domain.

(2) Idea Fusion

Goal: Organic fusion at the conceptual level, not just technical stacking.

Process:

Original Idea + New Pattern → LLM Generated Fused Idea
    ↓
Fused Idea contains:
  - fused_core_idea: Core idea after fusion
  - conceptual_bridge: Conceptual bridge
  - reframed_problem: Reframed problem
  - innovation_angle: Unique innovation angle

Example:

Original Idea: Use LLM for data augmentation
New Pattern: Curriculum Learning
Fused Idea: Difficulty-adaptive curriculum learning framework generated based on LLM

(3) Story Reflection

Goal: Assess fusion quality and ensure conceptual unity.

Scoring:

fusion_quality = 0.4 × Coherence + 0.4 × Fusion Richness + 0.2 × Fusion Idea Reward

Threshold: fusion_quality >= 0.65 is considered a successful fusion.

(4) Multi-Agent Critic Review

Roles:

Reviewer A (Methodology): Technical soundness.
Reviewer B (Novelty): Innovation.
Reviewer C (Storyteller): Narrative completeness.

Pass Standard: Average Score >= 7.0.

(5) Intelligent Refinement

Novelty Mode:

Trigger: Novelty score stagnation (≤ Previous Round + 0.5).
Process: Traverse all Novelty Patterns, each undergoing Fusion → Reflection → Generation → Critic.
Fallback: Select the version with the highest score.

Score Degradation Rollback:

Trigger: Any dimension score drops > 0.1.
Process: Restore Story + Mark failure + Delete Tricks + Continue iteration.

Ordinary Refinement:

Tail Injection: Lacks novelty → Inject unpopular Patterns (Rank 5-10).
Head Injection: Lacks stability → Inject mature Patterns (Rank 1-3).

(6) RAG Novelty Verification & Avoidance

Verification: Retrieve top conference papers from the last 3 years; Similarity > 0.75 is considered a collision.

Avoidance: Pivot strategy to generate constraints (Domain shift, setting limitations, etc.), then regenerate Story.

4.2 Execution Method

python scripts/idea2story_pipeline.py "Your Research Idea"

Output:

output/
├── final_story.json          # Final Paper Story
├── pipeline_result.json      # Complete Pipeline Result
└── log.json                  # Detailed Log

Execution Time: 3-10 minutes (depending on iteration count).

5. Configuration Overview

5.1 Knowledge Graph Construction

# scripts/build_entity_v3.py

# Data source paths
DATA_DIR = PROJECT_ROOT / "data" / "ICLR_25"
ASSIGNMENTS_FILE = DATA_DIR / "assignments.jsonl"
CLUSTER_LIBRARY_FILE = DATA_DIR / "cluster_library_sorted.jsonl"
PATTERN_DETAILS_FILE = DATA_DIR / "iclr_patterns_full.jsonl"

# LLM API Config
SILICONFLOW_API_KEY = os.getenv("SILICONFLOW_API_KEY")
LLM_API_URL = "https://api.siliconflow.cn/v1/chat/completions"
LLM_MODEL = "Qwen/Qwen2.5-7B-Instruct"

5.2 Retrieval System

# scripts/recall_system.py

class RecallConfig:
    # Path Weights
    PATH1_WEIGHT = 0.4  # Similar Idea
    PATH2_WEIGHT = 0.2  # Domain Relevance
    PATH3_WEIGHT = 0.4  # Similar Paper

    # Retrieval Counts
    PATH1_TOP_K_IDEAS = 10
    PATH1_FINAL_TOP_K = 10
    PATH2_TOP_K_DOMAINS = 5
    PATH2_FINAL_TOP_K = 5
    PATH3_TOP_K_PAPERS = 20
    PATH3_FINAL_TOP_K = 10
    FINAL_TOP_K = 10

    # Two-Stage Retrieval
    USE_EMBEDDING = True
    TWO_STAGE_RECALL = True
    COARSE_RECALL_SIZE = 100
    FINE_RECALL_SIZE = 20

5.3 Pipeline

# scripts/pipeline/config.py

class PipelineConfig:
    # Pattern Selection
    SELECT_PATTERN_COUNT = 3
    CONSERVATIVE_RANK_RANGE = (0, 2)
    INNOVATIVE_CLUSTER_SIZE_THRESHOLD = 10

    # Critic Threshold
    PASS_SCORE = 7.0
    MAX_REFINE_ITERATIONS = 3

    # Novelty Mode
    NOVELTY_MODE_MAX_PATTERNS = 10
    NOVELTY_SCORE_THRESHOLD = 6.0
    NOVELTY_STAGNATION_DELTA = 0.5

    # Reflection
    FUSION_QUALITY_THRESHOLD = 0.65

    # Rollback
    SCORE_DEGRADATION_THRESHOLD = 0.1

    # RAG Verification
    COLLISION_THRESHOLD = 0.75

    # Refinement Strategy
    TAIL_INJECTION_RANK_RANGE = (4, 9)
    HEAD_INJECTION_RANK_RANGE = (0, 2)
    HEAD_INJECTION_CLUSTER_THRESHOLD = 15

# LLM Config
LLM_API_KEY = os.getenv("SILICONFLOW_API_KEY")
LLM_API_URL = "https://api.siliconflow.cn/v1/chat/completions"
LLM_MODEL = "Qwen/Qwen3-14B"

6. Complete Workflow

6.1 Environment Setup

# 1. Clone Project
cd /Users/gaoge/code/mycode/Idea2Paper/Paper-KG-Pipeline

# 2. Install Dependencies
pip install -r requirements.txt

# 3. Set Environment Variable
export SILICONFLOW_API_KEY="your_api_key_here"

6.2 One-Time Build

# Build Knowledge Graph (Run only once)
python scripts/build_entity_v3.py   # 15 minutes
python scripts/build_edges.py       # 3 minutes

6.3 Use Pipeline

# Generate Paper Story
python scripts/idea2story_pipeline.py "Your Research Idea Description"

# Example
python scripts/idea2story_pipeline.py "Optimizing Large Model Inference Efficiency with Reinforcement Learning"

6.4 View Results

# View Final Story
cat output/final_story.json

# View Complete Pipeline
cat output/pipeline_result.json

# View Detailed Log
cat output/log.json | jq '.'

7. Core Innovations

7.1 Knowledge Graph Level

✅ LLM-Enhanced Pattern: Generate inductive summaries for each Pattern cluster.
✅ Dual-Layer Description: Concrete examples + Global summary, enabling both learning and understanding.
✅ Quality-Oriented Edge Weights: Calculate edge weights based on paper quality and Pattern effectiveness.

7.2 Retrieval Level

✅ Three-Way Complementary Retrieval: Capture relevance from Idea, Domain, and Paper dimensions.
✅ Two-Stage Optimization: Jaccard coarse ranking + Embedding fine ranking, 13x speedup.
✅ Real-Time Path 3 Calculation: Avoid pre-building redundant edges, ensuring complementarity.

7.3 Generation Level

✅ Idea Fusion: Organic fusion at the conceptual level rather than technical stacking.
✅ Story Reflection: Reflect on fusion quality to assess conceptual unity.
✅ Novelty-First Mode: Automatically upgrade to systemically improve innovation when stagnated.
✅ Intelligent Rollback: Avoid ineffective refinement to improve iteration efficiency.
✅ Fallback Strategy: Guarantee output quality by selecting the highest-scoring version.

8. System Advantages

8.1 High Degree of Automation

✅ Fully automated process, no manual intervention required.
✅ Intelligent decision mechanisms (Novelty Mode, Rollback, Fallback).
✅ Adaptive parameter adjustment.

8.2 Multi-Layer Quality Assurance

Pattern Layer: LLM-enhanced high-quality Pattern library.
Retrieval Layer: Three-way complementary retrieval, comprehensive coverage.
Fusion Layer: Idea Fusion ensures conceptual unity.
Reflection Layer: Story Reflection assesses fusion quality.
Review Layer: Three-role Critic for comprehensive evaluation.
Verification Layer: RAG avoids collision.

8.3 Extensive Efficiency Optimization

✅ Two-stage retrieval speeds up by 13x (7 mins → 27 secs).
✅ Intelligent rollback avoids ineffective iterations.
✅ Pattern failure marking avoids repeated attempts.
✅ LLM response caching reduces API calls.

8.4 Strong Scalability

✅ Modular design, easy to add new features.
✅ Supports incremental updates to the knowledge graph.
✅ Adaptable to other conference data sources.
✅ Can add new retrieval paths.

9. Current Limitations & Future Directions

9.1 Data Level

Current Limitation:

⚠️ Domain granularity is too coarse; 98 Domains cover 8,285 papers.

Future Direction:

📌 Introduce Domain hierarchy (Main Domain → Sub-domain).
📌 Use sub_domains for fine-grained matching.
📌 Extend to Review data from more conferences.

9.2 Retrieval Level

Current Limitation:

⚠️ Path 2 Domain matching is based on keywords, which may not be precise.
⚠️ Retrieval speed still has room for optimization (27 seconds).

Future Direction:

📌 Use Embedding to calculate semantic similarity between Idea and Domain.
📌 Introduce vector database (Faiss/Milvus), speed up to 1-3 seconds.
📌 Pre-compute and cache all Embeddings.

9.3 Generation Level

Current Limitation:

⚠️ Fusion quality scoring relies on LLM, which may be unstable.
⚠️ Novelty Mode traversing 10 Patterns may be time-consuming.

Future Direction:

📌 Introduce a learnable fusion quality scoring model.
📌 Optimize Pattern selection order based on historical data.
📌 Generate multiple Story candidates in parallel.

9.4 Review Level

Current Limitation:

⚠️ Critic scoring relies on LLM and may fluctuate.
⚠️ No user feedback mechanism.

Future Direction:

📌 Collect real review data to train dedicated Critic models.
📌 Introduce user feedback for online learning and weight adjustment.
📌 A/B test effects of different strategies.

10. Documentation Index

10.1 Core Documentation

Document	Path	Content
Project Summary	`docs/00_PROJECT_OVERVIEW.md`	This document, overall overview
KG Construction	`docs/01_KG_CONSTRUCTION.md`	Data source, nodes, edges, execution method
Retrieval System	`docs/02_RECALL_SYSTEM.md`	Three-way retrieval, similarity calculation, config
Idea2Story Pipeline	`docs/03_IDEA2STORY_PIPELINE.md`	Pattern selection, Fusion, Reflection, Critic

10.2 Auxiliary Documentation

Document	Path	Content
Edge Types	`docs/EDGE_TYPES.md`	Detailed edge definitions and weight calculations
Pattern Scoring	`docs/PATTERN_SCORING_EXPLAINED.md`	Pattern score calculation logic
Two-Stage Retrieval	`docs/TWO_STAGE_RECALL_OPTIMIZATION.md`	Retrieval performance optimization details
Data Format	`docs/Data_Format_Comparison.md`	V2 vs V3 data format changes

10.3 Historical Documentation (Archived)

The following documents record system evolution history, but core content has been integrated into the 4 main documents above:

NOVELTY_MODE_FIX.md
REFLECTION_REGENERATION_FIX.md
WORKFLOW_CORRECTION_2025-01-25.md
REFINE_SYSTEM_UPGRADE.md
RECALL_USAGE_V3.md
etc.

11. Code Structure

Paper-KG-Pipeline/
├── data/                           # Data Sources
│   └── ICLR_25/
│       ├── assignments.jsonl
│       ├── cluster_library_sorted.jsonl
│       └── iclr_patterns_full.jsonl
│
├── output/                         # Output Files
│   ├── nodes_*.json               # 4 types of nodes
│   ├── edges.json                 # Edge data
│   ├── knowledge_graph_v2.gpickle # NetworkX graph
│   ├── final_story.json           # Final Story
│   └── pipeline_result.json       # Pipeline results
│
├── scripts/                        # Core Scripts
│   ├── build_entity_v3.py         # Build nodes
│   ├── build_edges.py             # Build edges
│   ├── recall_system.py           # Retrieval system (Class encapsulation)
│   ├── simple_recall_demo.py      # Retrieval Demo
│   ├── idea2story_pipeline.py     # Pipeline Main Entry
│   │
│   └── pipeline/                   # Pipeline Modules
│       ├── config.py              # Configuration parameters
│       ├── manager.py             # Workflow orchestration
│       ├── pattern_selector.py    # Pattern classification
│       ├── planner.py             # Idea Fusion
│       ├── story_generator.py     # Story generation
│       ├── story_reflector.py     # Story reflection
│       ├── critic.py              # Critic review
│       ├── refinement.py          # Intelligent refinement
│       ├── verifier.py            # RAG verification
│       └── utils.py               # Utility functions
│
├── docs/                           # Documentation
│   ├── 00_PROJECT_OVERVIEW.md     # Project Summary (This file)
│   ├── 01_KG_CONSTRUCTION.md      # KG Construction
│   ├── 02_RECALL_SYSTEM.md        # Retrieval System
│   └── 03_IDEA2STORY_PIPELINE.md  # Idea2Story Pipeline
│
└── requirements.txt                # Dependencies

12. Key Metrics

12.1 Data Scale

Knowledge Graph:
  - Nodes: 16,791
  - Edges: 444,872
  - Pattern: 124 (124 LLM-enhanced)
  - Idea Coverage: 100% (8,284/8,285)

12.2 Performance Metrics

Retrieval Speed:
  - Full Embedding: ~7 minutes
  - Two-Stage Retrieval: ~27 seconds
  - Speedup Ratio: 13x

Pipeline Execution Time:
  - Fastest: 3 minutes (First pass)
  - Typical: 5-7 minutes (2-3 refinement rounds)
  - Slowest: 10 minutes (Novelty Mode)

12.3 Quality Metrics

Critic Review:
  - Pass Standard: Average Score >= 7.0
  - Dimensions: Methodology, Novelty, Storyteller
  - Novelty Mode Boost: 0.5-1.5 points

Fusion Quality:
  - Threshold: >= 0.65
  - Typical Value: 0.68-0.75
  - Scoring Dimensions: Coherence (40%) + Fusion Richness (40%) + Fusion Idea Reward (20%)

13. Usage Recommendations

13.1 Quick Start

# 1. First Run (Build Knowledge Graph)
python scripts/build_entity_v3.py
python scripts/build_edges.py

# 2. Generate Paper Story
python scripts/idea2story_pipeline.py "Your Research Idea"

# 3. View Results
cat output/final_story.json

13.2 Parameter Tuning

Improve Novelty:

# Increase Novelty Mode attempts
PipelineConfig.NOVELTY_MODE_MAX_PATTERNS = 15  # Default 10

# Increase Novelty weight
RecallConfig.PATH1_WEIGHT = 0.5  # Default 0.4, increase Similar Idea weight

Improve Stability:

# Lower Fusion Quality Threshold
PipelineConfig.FUSION_QUALITY_THRESHOLD = 0.60  # Default 0.65

# Increase Head Pattern weight
RecallConfig.PATH3_WEIGHT = 0.5  # Default 0.4, increase High-Quality Paper weight

Accelerate Retrieval:

# Reduce Retrieval Count
RecallConfig.PATH1_TOP_K_IDEAS = 5   # Default 10
RecallConfig.PATH3_TOP_K_PAPERS = 10 # Default 20

13.3 Monitoring Key Events

# ✅ Novelty mode activated
grep "激活【新颖性模式】" output/log.json

# 📊 Fusion quality evaluation
grep "融合质量评分" output/log.json

# 🔁 Rollback triggered 
grep "【ROLLBACK TRIGGERED】" output/log.json

# 🎉 Final Pass
grep "🎉 Critic 评审通过" output/log.json

14. Troubleshooting

14.1 Environment Issues

Q: API key invalid

# Check Environment Variable
echo $SILICONFLOW_API_KEY

# Set Environment Variable
export SILICONFLOW_API_KEY="your_key_here"

Q: Missing dependencies

# Reinstall dependencies
pip install -r requirements.txt --upgrade

14.2 Data Issues

Q: Node files do not exist

# Rebuild Knowledge Graph
python scripts/build_entity_v3.py
python scripts/build_edges.py

Q: Retrieval result is empty

# Check if Knowledge Graph is built successfully
ls -lh output/nodes_*.json
ls -lh output/knowledge_graph_v2.gpickle

14.3 Pipeline Issues

Q: Fusion quality always below threshold

# Lower threshold or improve Fusion Prompt
PipelineConfig.FUSION_QUALITY_THRESHOLD = 0.60

Q: Novelty Mode traversed all but still did not pass

# Check fallback strategy in log
grep "兜底策略" output/log.json
# System automatically selects the highest scoring version to output

15. Summary

15.1 Core Achievements

✅ Complete Knowledge Graph System: 16,791 nodes, 444,872 edges.
✅ Efficient Retrieval System: 13x speedup, second-level response.
✅ Intelligent Generation Pipeline: Fusion + Reflection + Critic + Intelligent Refinement.
✅ Quality Assurance Mechanism: Multi-layer checks, automatic rollback, fallback strategy.
✅ Complete Documentation System: 4 core documents covering construction, retrieval, generation.

15.2 Technical Highlights

✅ Conceptual Level Fusion: Idea Fusion achieves organic unity rather than technical stacking.
✅ Fusion Quality Reflection: Story Reflector assesses fusion effectiveness.
✅ Novelty First: Automatically upgrades to Novelty Mode when stagnated.
✅ Intelligent Rollback: Avoids ineffective refinement, improving efficiency.
✅ LLM-Enhanced Pattern: Dual-layer description improves usability.

15.3 Application Value

✅ Research Assistance: Helps researchers quickly generate paper frameworks.
✅ Innovation Exploration: Discovers new research directions through Pattern fusion.
✅ Writing Guidance: Provides structured paper organization suggestions.
✅ Literature Survey: Quickly locates relevant work based on Knowledge Graph.

15.4 Future Outlook

📌 Data Expansion: Integrate data from more conferences (CVPR, NeurIPS, ACL, etc.).
📌 Model Optimization: Train dedicated Fusion and Critic models.
📌 User Interaction: Introduce user feedback for online learning and optimization.
📌 Multi-modal Support: Integrate charts, formulas, code, and other multi-modal information.

16. Acknowledgements

Thanks to the ICLR 2025 paper dataset for support, and SiliconFlow for providing LLM API services.

Generated Date: 2026-01-25 Version: V1.0 Author: Idea2Paper Team

Contact: Refer to core documents for detailed technical support.

FilesExpand file tree

00_PROJECT_OVERVIEW.md

Latest commit

History

00_PROJECT_OVERVIEW.md

File metadata and controls

Idea2Paper Project Summary Document

📋 Project Overview

Table of Contents

1. System Architecture

1.1 Overall Flowchart

1.2 Core Modules

2. Knowledge Graph Construction

2.1 Data Scale

2.2 Node Definitions

2.3 Edge Definitions

2.4 Execution Method

3. Three-Way Retrieval System

3.1 Retrieval Strategy

3.2 Two-Stage Retrieval Optimization

3.3 Similarity Calculation

3.4 Execution Method

4. Idea2Story Pipeline

4.1 Core Mechanisms

(1) Multi-dimensional Pattern Classification

(2) Idea Fusion

(3) Story Reflection

(4) Multi-Agent Critic Review

(5) Intelligent Refinement

(6) RAG Novelty Verification & Avoidance

4.2 Execution Method

5. Configuration Overview

5.1 Knowledge Graph Construction

5.2 Retrieval System

5.3 Pipeline

6. Complete Workflow

6.1 Environment Setup

6.2 One-Time Build

6.3 Use Pipeline

6.4 View Results

7. Core Innovations

7.1 Knowledge Graph Level

7.2 Retrieval Level

7.3 Generation Level

8. System Advantages

8.1 High Degree of Automation

8.2 Multi-Layer Quality Assurance

8.3 Extensive Efficiency Optimization

8.4 Strong Scalability

9. Current Limitations & Future Directions

9.1 Data Level

9.2 Retrieval Level

9.3 Generation Level

9.4 Review Level

10. Documentation Index

10.1 Core Documentation

10.2 Auxiliary Documentation

10.3 Historical Documentation (Archived)

11. Code Structure

12. Key Metrics

12.1 Data Scale

12.2 Performance Metrics

12.3 Quality Metrics

13. Usage Recommendations

13.1 Quick Start

13.2 Parameter Tuning

13.3 Monitoring Key Events

14. Troubleshooting

14.1 Environment Issues

14.2 Data Issues

14.3 Pipeline Issues

15. Summary

15.1 Core Achievements

15.2 Technical Highlights

15.3 Application Value

15.4 Future Outlook

16. Acknowledgements