Skip to content

Research Directions

dippatel1994 edited this page Feb 13, 2026 · 1 revision

Research Directions

This page outlines potential improvements to PaperBanana's architecture and areas where research contributions would be valuable. These range from incremental enhancements to fundamental architectural changes.

Retrieval Improvements

Current Limitation

The Retriever uses semantic similarity between methodology texts to select reference examples. This works reasonably well but can miss relevant examples when the surface-level text differs despite similar underlying diagram structures.

Research Directions

Multi-modal retrieval: Instead of text-only similarity, compute similarity using both the methodology text AND the diagram image. This would help retrieve structurally similar diagrams even when the text descriptions differ.

Learned retrieval: Fine-tune an embedding model specifically for methodology-to-diagram matching. The current approach uses general-purpose embeddings that aren't optimized for this task.

Hierarchical retrieval: First retrieve by category (agent, vision, generative, science), then by structural similarity within category. This two-stage approach could improve precision.

Dynamic example count: Instead of fixed top-k retrieval, dynamically determine how many examples to retrieve based on query complexity and available context window.

Planning Improvements

Current Limitation

The Planner generates diagram descriptions via in-context learning from retrieved examples. Quality depends heavily on example quality and relevance.

Research Directions

Structured intermediate representation: Instead of free-form text descriptions, have the Planner output a structured format (graph representation, component list, connection matrix). This would make the description more precise and easier to verify.

Compositional planning: Break complex diagrams into sub-components, plan each separately, then compose. This mirrors how humans often design complex figures.

Plan verification: Add a verification step between Planner and Stylist that checks the plan for completeness and consistency against the source text before proceeding.

Multi-plan generation: Generate multiple candidate plans and select the best one, either via self-consistency or a dedicated ranker.

Styling Improvements

Current Limitation

The Stylist applies static NeurIPS-style guidelines. There's no adaptation to different venues, diagram types, or user preferences.

Research Directions

Style transfer from examples: Given a reference diagram, extract its visual style (colors, fonts, layout conventions) and apply to new diagrams.

Venue-specific styling: Load different guidelines for different venues (ICLR, CVPR, Nature, etc.) with their specific formatting conventions.

User preference learning: Learn individual user preferences over time and apply them automatically.

Adaptive color schemes: Generate color palettes that work well together and respect accessibility guidelines (colorblind-safe, print-friendly).

Visualization Improvements

Current Limitation

The Visualizer relies on Gemini's image generation, which has limited control over fine details like exact positioning, font rendering, and alignment.

Research Directions

Hybrid rendering: Generate a rough layout with image generation, then refine with programmatic rendering (SVG, TikZ, or Matplotlib for precise elements).

Code generation path: For simpler diagrams, generate TikZ or Graphviz code instead of using image generation. This gives pixel-perfect control and editable outputs.

Layout optimization: Add a post-processing step that detects and corrects layout issues (overlapping labels, misaligned boxes) using computer vision.

Vector output: Generate SVG output instead of raster images for publication-quality figures that scale perfectly.

Critique Improvements

Current Limitation

The Critic provides textual feedback but may miss subtle issues or provide vague corrections that don't translate well to the next iteration.

Research Directions

Spatial grounding: Have the Critic output bounding boxes or coordinates pointing to specific issues, not just text descriptions.

Structured critique: Output a checklist of pass/fail criteria rather than free-form feedback. This makes it easier to verify improvements.

Multi-perspective critique: Run multiple critique passes focusing on different aspects (layout, content accuracy, aesthetics) and aggregate.

Human-in-the-loop: Allow users to provide critique that feeds into the refinement loop, combining automated and human feedback.

Reference Dataset Improvements

Current Limitation

Only 13 curated examples vs. the paper's 292. Limited diversity affects output quality across different diagram types.

Research Directions

Automated curation pipeline: Build robust tooling to automatically extract methodology sections and diagrams from papers at scale.

Quality scoring: Develop metrics to automatically assess reference example quality, enabling larger-scale curation with quality control.

Synthetic augmentation: Generate synthetic reference examples to fill gaps in underrepresented categories.

Community curation: Build tooling for distributed community curation with quality review workflows.

Evaluation Improvements

Current Limitation

VLM-as-Judge evaluation correlates with human preferences but isn't a perfect proxy.

Research Directions

Human correlation studies: Conduct formal studies measuring correlation between VLM-Judge scores and human rankings.

Additional metrics: Add automated metrics for specific failure modes (text legibility, component completeness, connection accuracy).

Benchmark dataset: Create a public benchmark with human-annotated ground truth for standardized evaluation.

Architectural Changes

Parallel Agent Execution

Currently, agents execute sequentially. For some tasks (like generating multiple plan candidates), parallel execution could improve both speed and quality.

Adaptive Iteration Count

Instead of fixed N iterations, dynamically decide when to stop based on critique scores and improvement rate.

Multi-modal Context

Allow users to provide reference images (not just text) that influence the output style and structure.

Edit-based Refinement

Instead of regenerating the entire image each iteration, make targeted edits to specific regions identified by the Critic.

Contributing

If you're interested in working on any of these directions:

  1. Open a Discussion to coordinate with others
  2. For significant architectural changes, propose the design before implementing
  3. Include evaluation results showing improvement over the baseline

We're especially interested in contributions that include:

  • Quantitative comparisons on a held-out test set
  • Ablation studies showing which components matter
  • User studies for subjective quality improvements

Clone this wiki locally