Skip to content

Commit 901bb31

Browse files
majdyzclaude
andauthored
feat(backend): parameterize activity status generation with customizable prompts (#11407)
## Summary Implement comprehensive parameterization of the activity status generation system to enable custom prompts for admin analytics dashboard. ## Changes Made ### Core Function Enhancement (`activity_status_generator.py`) - **Extract hardcoded prompts to constants**: `DEFAULT_SYSTEM_PROMPT` and `DEFAULT_USER_PROMPT` - **Add prompt parameters**: `system_prompt`, `user_prompt` with defaults to maintain backward compatibility - **Template substitution system**: User prompt supports `{{GRAPH_NAME}}` and `{{EXECUTION_DATA}}` placeholders - **Skip existing flag**: `skip_existing` parameter allows admin to force regeneration of existing data - **Maintain manager compatibility**: All existing calls continue to work with default parameters ### Admin API Enhancement (`execution_analytics_routes.py`) - **Custom prompt fields**: `system_prompt` and `user_prompt` optional fields in `ExecutionAnalyticsRequest` - **Skip existing control**: `skip_existing` boolean flag for admin regeneration option - **Template documentation**: Clear documentation of placeholder system in field descriptions - **Backward compatibility**: All existing API calls work unchanged ### Template System Design - **Simple placeholder replacement**: `{{GRAPH_NAME}}` → actual graph name, `{{EXECUTION_DATA}}` → JSON execution data - **No dependencies**: Uses simple `string.replace()` for maximum compatibility - **JSON safety**: Execution data properly serialized as indented JSON - **Validation tested**: Template substitution verified to work correctly ## Key Features ### For Regular Users (Manager Integration) - **No changes required**: Existing manager.py calls work unchanged - **Default behavior preserved**: Same prompts and logic as before - **Feature flag compatibility**: LaunchDarkly integration unchanged ### For Admin Analytics Dashboard - **Custom system prompts**: Admins can override the AI evaluation criteria - **Custom user prompts**: Admins can modify the analysis instructions with execution data templates - **Force regeneration**: `skip_existing=False` allows reprocessing existing executions with new prompts - **Complete model list**: Access to all LLM models from `llm.py` (70+ models including GPT, Claude, Gemini, etc.) ## Technical Validation - ✅ Template substitution tested and working - ✅ Default behavior preserved for existing code - ✅ Admin API parameter validation working - ✅ All imports and function signatures correct - ✅ Backward compatibility maintained ## Use Cases Enabled - **A/B testing**: Compare different prompt strategies on same execution data - **Custom evaluation**: Tailor success criteria for specific graph types - **Prompt optimization**: Iterate on prompt design based on admin feedback - **Bulk reprocessing**: Regenerate activity status with improved prompts ## Testing - Template substitution functionality verified - Function signatures and imports validated - Code formatting and linting passed - Backward compatibility confirmed ## Breaking Changes None - all existing functionality preserved with default parameters. ## Related Issues Resolves the requirement to expose prompt customization on the frontend execution analytics dashboard. --------- Co-authored-by: Claude <[email protected]>
1 parent 9438817 commit 901bb31

File tree

5 files changed

+535
-123
lines changed

5 files changed

+535
-123
lines changed

autogpt_platform/backend/backend/data/execution.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -460,6 +460,7 @@ def to_node_execution_entry(
460460
async def get_graph_executions(
461461
graph_exec_id: Optional[str] = None,
462462
graph_id: Optional[str] = None,
463+
graph_version: Optional[int] = None,
463464
user_id: Optional[str] = None,
464465
statuses: Optional[list[ExecutionStatus]] = None,
465466
created_time_gte: Optional[datetime] = None,
@@ -476,6 +477,8 @@ async def get_graph_executions(
476477
where_filter["userId"] = user_id
477478
if graph_id:
478479
where_filter["agentGraphId"] = graph_id
480+
if graph_version is not None:
481+
where_filter["agentGraphVersion"] = graph_version
479482
if created_time_gte or created_time_lte:
480483
where_filter["createdAt"] = {
481484
"gte": created_time_gte or datetime.min.replace(tzinfo=timezone.utc),

autogpt_platform/backend/backend/executor/activity_status_generator.py

Lines changed: 128 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,101 @@
2727
logger = logging.getLogger(__name__)
2828

2929

30+
# Default system prompt template for activity status generation
31+
DEFAULT_SYSTEM_PROMPT = """You are an AI assistant analyzing what an agent execution accomplished and whether it worked correctly.
32+
You need to provide both a user-friendly summary AND a correctness assessment.
33+
34+
FOR THE ACTIVITY STATUS:
35+
- Write from the user's perspective about what they accomplished, NOT about technical execution details
36+
- Focus on the ACTUAL TASK the user wanted done, not the internal workflow steps
37+
- Avoid technical terms like 'workflow', 'execution', 'components', 'nodes', 'processing', etc.
38+
- Keep it to 3 sentences maximum. Be conversational and human-friendly
39+
40+
FOR THE CORRECTNESS SCORE:
41+
- Provide a score from 0.0 to 1.0 indicating how well the execution achieved its intended purpose
42+
- Use this scoring guide:
43+
0.0-0.2: Failure - The result clearly did not meet the task requirements
44+
0.2-0.4: Poor - Major issues; only small parts of the goal were achieved
45+
0.4-0.6: Partial Success - Some objectives met, but with noticeable gaps or inaccuracies
46+
0.6-0.8: Mostly Successful - Largely achieved the intended outcome, with minor flaws
47+
0.8-1.0: Success - Fully met or exceeded the task requirements
48+
- Base the score on actual outputs produced, not just technical completion
49+
50+
UNDERSTAND THE INTENDED PURPOSE:
51+
- FIRST: Read the graph description carefully to understand what the user wanted to accomplish
52+
- The graph name and description tell you the main goal/intention of this automation
53+
- Use this intended purpose as your PRIMARY criteria for success/failure evaluation
54+
- Ask yourself: 'Did this execution actually accomplish what the graph was designed to do?'
55+
56+
CRITICAL OUTPUT ANALYSIS:
57+
- Check if blocks that should produce user-facing results actually produced outputs
58+
- Blocks with names containing 'Output', 'Post', 'Create', 'Send', 'Publish', 'Generate' are usually meant to produce final results
59+
- If these critical blocks have NO outputs (empty recent_outputs), the task likely FAILED even if status shows 'completed'
60+
- Sub-agents (AgentExecutorBlock) that produce no outputs usually indicate failed sub-tasks
61+
- Most importantly: Does the execution result match what the graph description promised to deliver?
62+
63+
SUCCESS EVALUATION BASED ON INTENTION:
64+
- If the graph is meant to 'create blog posts' → check if blog content was actually created
65+
- If the graph is meant to 'send emails' → check if emails were actually sent
66+
- If the graph is meant to 'analyze data' → check if analysis results were produced
67+
- If the graph is meant to 'generate reports' → check if reports were generated
68+
- Technical completion ≠ goal achievement. Focus on whether the USER'S INTENDED OUTCOME was delivered
69+
70+
IMPORTANT: Be HONEST about what actually happened:
71+
- If the input was invalid/nonsensical, say so directly
72+
- If the task failed, explain what went wrong in simple terms
73+
- If errors occurred, focus on what the user needs to know
74+
- Only claim success if the INTENDED PURPOSE was genuinely accomplished AND produced expected outputs
75+
- Don't sugar-coat failures or present them as helpful feedback
76+
- ESPECIALLY: If the graph's main purpose wasn't achieved, this is a failure regardless of 'completed' status
77+
78+
Understanding Errors:
79+
- Node errors: Individual steps may fail but the overall task might still complete (e.g., one data source fails but others work)
80+
- Graph error (in overall_status.graph_error): This means the entire execution failed and nothing was accomplished
81+
- Missing outputs from critical blocks: Even if no errors, this means the task failed to produce expected results
82+
- Focus on whether the graph's intended purpose was fulfilled, not whether technical steps completed"""
83+
84+
# Default user prompt template for activity status generation
85+
DEFAULT_USER_PROMPT = """A user ran '{{GRAPH_NAME}}' to accomplish something. Based on this execution data,
86+
provide both an activity summary and correctness assessment:
87+
88+
{{EXECUTION_DATA}}
89+
90+
ANALYSIS CHECKLIST:
91+
1. READ graph_info.description FIRST - this tells you what the user intended to accomplish
92+
2. Check overall_status.graph_error - if present, the entire execution failed
93+
3. Look for nodes with 'Output', 'Post', 'Create', 'Send', 'Publish', 'Generate' in their block_name
94+
4. Check if these critical blocks have empty recent_outputs arrays - this indicates failure
95+
5. Look for AgentExecutorBlock (sub-agents) with no outputs - this suggests sub-task failures
96+
6. Count how many nodes produced outputs vs total nodes - low ratio suggests problems
97+
7. MOST IMPORTANT: Does the execution outcome match what graph_info.description promised?
98+
99+
INTENTION-BASED EVALUATION:
100+
- If description mentions 'blog writing' → did it create blog content?
101+
- If description mentions 'email automation' → were emails actually sent?
102+
- If description mentions 'data analysis' → were analysis results produced?
103+
- If description mentions 'content generation' → was content actually generated?
104+
- If description mentions 'social media posting' → were posts actually made?
105+
- Match the outputs to the stated intention, not just technical completion
106+
107+
PROVIDE:
108+
activity_status: 1-3 sentences about what the user accomplished, such as:
109+
- 'I analyzed your resume and provided detailed feedback for the IT industry.'
110+
- 'I couldn't complete the task because critical steps failed to produce any results.'
111+
- 'I failed to generate the content you requested due to missing API access.'
112+
- 'I extracted key information from your documents and organized it into a summary.'
113+
- 'The task failed because the blog post creation step didn't produce any output.'
114+
115+
correctness_score: A float score from 0.0 to 1.0 based on how well the intended purpose was achieved:
116+
- 0.0-0.2: Failure (didn't meet requirements)
117+
- 0.2-0.4: Poor (major issues, minimal achievement)
118+
- 0.4-0.6: Partial Success (some objectives met with gaps)
119+
- 0.6-0.8: Mostly Successful (largely achieved with minor flaws)
120+
- 0.8-1.0: Success (fully met or exceeded requirements)
121+
122+
BE CRITICAL: If the graph's intended purpose (from description) wasn't achieved, use a low score (0.0-0.4) even if status is 'completed'."""
123+
124+
30125
class ErrorInfo(TypedDict):
31126
"""Type definition for error information."""
32127

@@ -93,6 +188,9 @@ async def generate_activity_status_for_execution(
93188
execution_status: ExecutionStatus | None = None,
94189
model_name: str = "gpt-4o-mini",
95190
skip_feature_flag: bool = False,
191+
system_prompt: str = DEFAULT_SYSTEM_PROMPT,
192+
user_prompt: str = DEFAULT_USER_PROMPT,
193+
skip_existing: bool = True,
96194
) -> ActivityStatusResponse | None:
97195
"""
98196
Generate an AI-based activity status summary and correctness assessment for a graph execution.
@@ -108,10 +206,15 @@ async def generate_activity_status_for_execution(
108206
db_client: Database client for fetching data
109207
user_id: User ID for LaunchDarkly feature flag evaluation
110208
execution_status: The overall execution status (COMPLETED, FAILED, TERMINATED)
209+
model_name: AI model to use for generation (default: gpt-4o-mini)
210+
skip_feature_flag: Whether to skip LaunchDarkly feature flag check
211+
system_prompt: Custom system prompt template (default: DEFAULT_SYSTEM_PROMPT)
212+
user_prompt: Custom user prompt template with placeholders (default: DEFAULT_USER_PROMPT)
213+
skip_existing: Whether to skip if activity_status and correctness_score already exist
111214
112215
Returns:
113216
AI-generated activity status response with activity_status and correctness_status,
114-
or None if feature is disabled
217+
or None if feature is disabled or skipped
115218
"""
116219
# Check LaunchDarkly feature flag for AI activity status generation with full context support
117220
if not skip_feature_flag and not await is_feature_enabled(
@@ -120,6 +223,20 @@ async def generate_activity_status_for_execution(
120223
logger.debug("AI activity status generation is disabled via LaunchDarkly")
121224
return None
122225

226+
# Check if we should skip existing data (for admin regeneration option)
227+
if (
228+
skip_existing
229+
and execution_stats.activity_status
230+
and execution_stats.correctness_score is not None
231+
):
232+
logger.debug(
233+
f"Skipping activity status generation for {graph_exec_id}: already exists"
234+
)
235+
return {
236+
"activity_status": execution_stats.activity_status,
237+
"correctness_score": execution_stats.correctness_score,
238+
}
239+
123240
# Check if we have OpenAI API key
124241
try:
125242
settings = Settings()
@@ -157,94 +274,23 @@ async def generate_activity_status_for_execution(
157274
execution_status,
158275
)
159276

277+
# Prepare execution data as JSON for template substitution
278+
execution_data_json = json.dumps(execution_data, indent=2)
279+
280+
# Perform template substitution for user prompt
281+
user_prompt_content = user_prompt.replace("{{GRAPH_NAME}}", graph_name).replace(
282+
"{{EXECUTION_DATA}}", execution_data_json
283+
)
284+
160285
# Prepare prompt for AI with structured output requirements
161286
prompt = [
162287
{
163288
"role": "system",
164-
"content": (
165-
"You are an AI assistant analyzing what an agent execution accomplished and whether it worked correctly. "
166-
"You need to provide both a user-friendly summary AND a correctness assessment.\n\n"
167-
"FOR THE ACTIVITY STATUS:\n"
168-
"- Write from the user's perspective about what they accomplished, NOT about technical execution details\n"
169-
"- Focus on the ACTUAL TASK the user wanted done, not the internal workflow steps\n"
170-
"- Avoid technical terms like 'workflow', 'execution', 'components', 'nodes', 'processing', etc.\n"
171-
"- Keep it to 3 sentences maximum. Be conversational and human-friendly\n\n"
172-
"FOR THE CORRECTNESS SCORE:\n"
173-
"- Provide a score from 0.0 to 1.0 indicating how well the execution achieved its intended purpose\n"
174-
"- Use this scoring guide:\n"
175-
" 0.0-0.2: Failure - The result clearly did not meet the task requirements\n"
176-
" 0.2-0.4: Poor - Major issues; only small parts of the goal were achieved\n"
177-
" 0.4-0.6: Partial Success - Some objectives met, but with noticeable gaps or inaccuracies\n"
178-
" 0.6-0.8: Mostly Successful - Largely achieved the intended outcome, with minor flaws\n"
179-
" 0.8-1.0: Success - Fully met or exceeded the task requirements\n"
180-
"- Base the score on actual outputs produced, not just technical completion\n\n"
181-
"UNDERSTAND THE INTENDED PURPOSE:\n"
182-
"- FIRST: Read the graph description carefully to understand what the user wanted to accomplish\n"
183-
"- The graph name and description tell you the main goal/intention of this automation\n"
184-
"- Use this intended purpose as your PRIMARY criteria for success/failure evaluation\n"
185-
"- Ask yourself: 'Did this execution actually accomplish what the graph was designed to do?'\n\n"
186-
"CRITICAL OUTPUT ANALYSIS:\n"
187-
"- Check if blocks that should produce user-facing results actually produced outputs\n"
188-
"- Blocks with names containing 'Output', 'Post', 'Create', 'Send', 'Publish', 'Generate' are usually meant to produce final results\n"
189-
"- If these critical blocks have NO outputs (empty recent_outputs), the task likely FAILED even if status shows 'completed'\n"
190-
"- Sub-agents (AgentExecutorBlock) that produce no outputs usually indicate failed sub-tasks\n"
191-
"- Most importantly: Does the execution result match what the graph description promised to deliver?\n\n"
192-
"SUCCESS EVALUATION BASED ON INTENTION:\n"
193-
"- If the graph is meant to 'create blog posts' → check if blog content was actually created\n"
194-
"- If the graph is meant to 'send emails' → check if emails were actually sent\n"
195-
"- If the graph is meant to 'analyze data' → check if analysis results were produced\n"
196-
"- If the graph is meant to 'generate reports' → check if reports were generated\n"
197-
"- Technical completion ≠ goal achievement. Focus on whether the USER'S INTENDED OUTCOME was delivered\n\n"
198-
"IMPORTANT: Be HONEST about what actually happened:\n"
199-
"- If the input was invalid/nonsensical, say so directly\n"
200-
"- If the task failed, explain what went wrong in simple terms\n"
201-
"- If errors occurred, focus on what the user needs to know\n"
202-
"- Only claim success if the INTENDED PURPOSE was genuinely accomplished AND produced expected outputs\n"
203-
"- Don't sugar-coat failures or present them as helpful feedback\n"
204-
"- ESPECIALLY: If the graph's main purpose wasn't achieved, this is a failure regardless of 'completed' status\n\n"
205-
"Understanding Errors:\n"
206-
"- Node errors: Individual steps may fail but the overall task might still complete (e.g., one data source fails but others work)\n"
207-
"- Graph error (in overall_status.graph_error): This means the entire execution failed and nothing was accomplished\n"
208-
"- Missing outputs from critical blocks: Even if no errors, this means the task failed to produce expected results\n"
209-
"- Focus on whether the graph's intended purpose was fulfilled, not whether technical steps completed"
210-
),
289+
"content": system_prompt,
211290
},
212291
{
213292
"role": "user",
214-
"content": (
215-
f"A user ran '{graph_name}' to accomplish something. Based on this execution data, "
216-
f"provide both an activity summary and correctness assessment:\n\n"
217-
f"{json.dumps(execution_data, indent=2)}\n\n"
218-
"ANALYSIS CHECKLIST:\n"
219-
"1. READ graph_info.description FIRST - this tells you what the user intended to accomplish\n"
220-
"2. Check overall_status.graph_error - if present, the entire execution failed\n"
221-
"3. Look for nodes with 'Output', 'Post', 'Create', 'Send', 'Publish', 'Generate' in their block_name\n"
222-
"4. Check if these critical blocks have empty recent_outputs arrays - this indicates failure\n"
223-
"5. Look for AgentExecutorBlock (sub-agents) with no outputs - this suggests sub-task failures\n"
224-
"6. Count how many nodes produced outputs vs total nodes - low ratio suggests problems\n"
225-
"7. MOST IMPORTANT: Does the execution outcome match what graph_info.description promised?\n\n"
226-
"INTENTION-BASED EVALUATION:\n"
227-
"- If description mentions 'blog writing' → did it create blog content?\n"
228-
"- If description mentions 'email automation' → were emails actually sent?\n"
229-
"- If description mentions 'data analysis' → were analysis results produced?\n"
230-
"- If description mentions 'content generation' → was content actually generated?\n"
231-
"- If description mentions 'social media posting' → were posts actually made?\n"
232-
"- Match the outputs to the stated intention, not just technical completion\n\n"
233-
"PROVIDE:\n"
234-
"activity_status: 1-3 sentences about what the user accomplished, such as:\n"
235-
"- 'I analyzed your resume and provided detailed feedback for the IT industry.'\n"
236-
"- 'I couldn't complete the task because critical steps failed to produce any results.'\n"
237-
"- 'I failed to generate the content you requested due to missing API access.'\n"
238-
"- 'I extracted key information from your documents and organized it into a summary.'\n"
239-
"- 'The task failed because the blog post creation step didn't produce any output.'\n\n"
240-
"correctness_score: A float score from 0.0 to 1.0 based on how well the intended purpose was achieved:\n"
241-
"- 0.0-0.2: Failure (didn't meet requirements)\n"
242-
"- 0.2-0.4: Poor (major issues, minimal achievement)\n"
243-
"- 0.4-0.6: Partial Success (some objectives met with gaps)\n"
244-
"- 0.6-0.8: Mostly Successful (largely achieved with minor flaws)\n"
245-
"- 0.8-1.0: Success (fully met or exceeded requirements)\n\n"
246-
"BE CRITICAL: If the graph's intended purpose (from description) wasn't achieved, use a low score (0.0-0.4) even if status is 'completed'."
247-
),
293+
"content": user_prompt_content,
248294
},
249295
]
250296

0 commit comments

Comments
 (0)