Skip to content

Add GTN training agent to ChatGXY#22097

Draft
dannon wants to merge 22 commits intogalaxyproject:devfrom
dannon:agent-based-ai-gtn
Draft

Add GTN training agent to ChatGXY#22097
dannon wants to merge 22 commits intogalaxyproject:devfrom
dannon:agent-based-ai-gtn

Conversation

@dannon
Copy link
Member

@dannon dannon commented Mar 13, 2026

Summary

Adds a GTN training agent so ChatGXY can answer training questions using real content from the Galaxy Training Network rather than generic LLM responses.

The agent is backed by a SQLite FTS5 database built from the GTN repository (~400 tutorials and FAQs). The database is hosted on depot.galaxyproject.org and downloaded automatically on first use — nothing is bundled in the repo. A build_database.py script is included to rebuild it from a fresh GTN clone when content changes.

When a user asks something like "How do I do RNA-seq analysis?", the router recognizes it as a training question and hands off to the GTN agent. The agent searches the database, reads the 1-2 most relevant tutorials, and synthesizes a step-by-step answer with links back to the full tutorials on the GTN site. If the database can't be fetched or is corrupt, the agent disables itself gracefully rather than crashing.

Draft status

Still working on two things:

  • Token usage — certain queries cause the agent to fetch too much tutorial content, inflating context. I've tightened defaults and prompt guidance but want to benchmark more before finalizing.
  • Database delivery — the download-on-first-use mechanism works but I want to flesh out the versioning and update story (dated files on depot with symlinks, automated rebuild pipeline, etc.).

SQLite FTS5 database for Galaxy Training Network tutorials and FAQs.
Finds relevant Galaxy Training Network tutorials using SQLite FTS5 search.
- Add import re at module level
- Remove unused SearchResult import
- Use GTNSearchDB | None type annotation
- Replace bare except Exception with specific FileNotFoundError/OSError
- Fix unnecessary f-string prefix
The router now routes analysis workflow and tutorial questions to the
GTN training agent instead of answering them directly. This lets the
GTN agent use its tutorial database to find relevant training materials.
Updated the ChatGXY component to show response metadata on the right
side of the footer: which agent handled the query, the model used, and
token count. Also fixed the router prompt to use natural language
instead of explicit function names which was causing the model to
output the function name as text instead of calling it.
Added helper methods to BaseGalaxyAgent (_build_metadata and _build_response)
that ensure consistent metadata structure across all agent responses. Every
response now includes model name, method, and token usage when available.

Also formalized the handoff pattern in the router with _serialize_handoff(),
and added TokenUsage and HandoffInfo schema models. Agent-specific data is
now available both at the top level (backwards compat) and namespaced under
agent_data for structured access.
Two fixes for the GTN training agent:

1. Suggestions now link to specific tutorials instead of the generic GTN
   homepage. When parsing simple text responses, we look up mentioned
   tutorial names in the GTN database to get their actual URLs.

2. Added normalize_llm_text() to handle literal \n strings in LLM output,
   which was causing wonky formatting in the UI.
…trip verbosity

Context managers for all SQLite connections in build_database.py to prevent
leaks on exceptions. Narrowed bare except Exception to specific types.
Switched FTS5 tables to content=/content_rowid= so rowid alignment is
guaranteed by SQLite rather than assumed. Added re.escape for regex safety
in extract_section, deduplicated version into DB_VERSION constant.

GTNSearchDB now downloads the database from a configurable URL when the
local file is missing, so the 25MB .db no longer needs to live in git.
Removed it from tracking and added .gitignore entry.

Consolidated dead FileNotFoundError/OSError branches in gtn_training.py,
removed redundant safety checks, replaced IMPORTANT directive docstrings
with concise descriptions throughout.
The YAML parser was treating quoted empty strings (e.g. zenodo_link: "")
as list headers because quote stripping happened before the empty-value
check. Now tracks whether quotes were present so `key: ""` produces an
empty string while `key:` followed by `- items` still creates a list.

Also coerces hands_on to bool (some tutorials use "external") and widens
per-row exception handling to catch ValueError/TypeError so a single bad
tutorial doesn't kill the whole build.
…content fetches

SearchResult.to_dict() now returns only the 7 fields the LLM needs (title, topic, tutorial, url, difficulty, time_estimation, snippet) instead of all 12. Default search limit drops from 10 to 5, tutorial content cap from 2000 to 1500 chars. The system prompt now explicitly tells the agent to fetch content for only 1-2 tutorials instead of all search results.
The FTS5 snippet() function wraps matches in <mark> tags which would show
as literal text in the chat UI. Strip them in to_dict() at the serialization
boundary. Also remove GTNSearchRequest which was defined but never used.
def __init__(self, db_path: Optional[str] = None, download_url: Optional[str] = None):
if db_path is None:
current_dir = Path(__file__).parent
self.db_path = current_dir / "data" / "gtn_search.db"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be configurable, so that admins can put it in a mutable-data directory.


def _get_connection(self) -> sqlite3.Connection:
"""Get a database connection."""
conn = sqlite3.connect(str(self.db_path))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set here an isolation_level ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants