Add GTN training agent to ChatGXY#22097
Draft
dannon wants to merge 22 commits intogalaxyproject:devfrom
Draft
Conversation
SQLite FTS5 database for Galaxy Training Network tutorials and FAQs.
Finds relevant Galaxy Training Network tutorials using SQLite FTS5 search.
- Add import re at module level - Remove unused SearchResult import - Use GTNSearchDB | None type annotation - Replace bare except Exception with specific FileNotFoundError/OSError - Fix unnecessary f-string prefix
The router now routes analysis workflow and tutorial questions to the GTN training agent instead of answering them directly. This lets the GTN agent use its tutorial database to find relevant training materials.
Updated the ChatGXY component to show response metadata on the right side of the footer: which agent handled the query, the model used, and token count. Also fixed the router prompt to use natural language instead of explicit function names which was causing the model to output the function name as text instead of calling it.
Added helper methods to BaseGalaxyAgent (_build_metadata and _build_response) that ensure consistent metadata structure across all agent responses. Every response now includes model name, method, and token usage when available. Also formalized the handoff pattern in the router with _serialize_handoff(), and added TokenUsage and HandoffInfo schema models. Agent-specific data is now available both at the top level (backwards compat) and namespaced under agent_data for structured access.
Two fixes for the GTN training agent: 1. Suggestions now link to specific tutorials instead of the generic GTN homepage. When parsing simple text responses, we look up mentioned tutorial names in the GTN database to get their actual URLs. 2. Added normalize_llm_text() to handle literal \n strings in LLM output, which was causing wonky formatting in the UI.
…trip verbosity Context managers for all SQLite connections in build_database.py to prevent leaks on exceptions. Narrowed bare except Exception to specific types. Switched FTS5 tables to content=/content_rowid= so rowid alignment is guaranteed by SQLite rather than assumed. Added re.escape for regex safety in extract_section, deduplicated version into DB_VERSION constant. GTNSearchDB now downloads the database from a configurable URL when the local file is missing, so the 25MB .db no longer needs to live in git. Removed it from tracking and added .gitignore entry. Consolidated dead FileNotFoundError/OSError branches in gtn_training.py, removed redundant safety checks, replaced IMPORTANT directive docstrings with concise descriptions throughout.
The YAML parser was treating quoted empty strings (e.g. zenodo_link: "") as list headers because quote stripping happened before the empty-value check. Now tracks whether quotes were present so `key: ""` produces an empty string while `key:` followed by `- items` still creates a list. Also coerces hands_on to bool (some tutorials use "external") and widens per-row exception handling to catch ValueError/TypeError so a single bad tutorial doesn't kill the whole build.
…content fetches SearchResult.to_dict() now returns only the 7 fields the LLM needs (title, topic, tutorial, url, difficulty, time_estimation, snippet) instead of all 12. Default search limit drops from 10 to 5, tutorial content cap from 2000 to 1500 chars. The system prompt now explicitly tells the agent to fetch content for only 1-2 tutorials instead of all search results.
The FTS5 snippet() function wraps matches in <mark> tags which would show as literal text in the chat UI. Strip them in to_dict() at the serialization boundary. Also remove GTNSearchRequest which was defined but never used.
bgruening
reviewed
Mar 15, 2026
| def __init__(self, db_path: Optional[str] = None, download_url: Optional[str] = None): | ||
| if db_path is None: | ||
| current_dir = Path(__file__).parent | ||
| self.db_path = current_dir / "data" / "gtn_search.db" |
Member
There was a problem hiding this comment.
This should be configurable, so that admins can put it in a mutable-data directory.
bgruening
reviewed
Mar 15, 2026
|
|
||
| def _get_connection(self) -> sqlite3.Connection: | ||
| """Get a database connection.""" | ||
| conn = sqlite3.connect(str(self.db_path)) |
Member
There was a problem hiding this comment.
Should we set here an isolation_level ?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a GTN training agent so ChatGXY can answer training questions using real content from the Galaxy Training Network rather than generic LLM responses.
The agent is backed by a SQLite FTS5 database built from the GTN repository (~400 tutorials and FAQs). The database is hosted on depot.galaxyproject.org and downloaded automatically on first use — nothing is bundled in the repo. A
build_database.pyscript is included to rebuild it from a fresh GTN clone when content changes.When a user asks something like "How do I do RNA-seq analysis?", the router recognizes it as a training question and hands off to the GTN agent. The agent searches the database, reads the 1-2 most relevant tutorials, and synthesizes a step-by-step answer with links back to the full tutorials on the GTN site. If the database can't be fetched or is corrupt, the agent disables itself gracefully rather than crashing.
Draft status
Still working on two things: