2727logger = logging .getLogger (__name__ )
2828
2929
30+ # Default system prompt template for activity status generation
31+ DEFAULT_SYSTEM_PROMPT = """You are an AI assistant analyzing what an agent execution accomplished and whether it worked correctly.
32+ You need to provide both a user-friendly summary AND a correctness assessment.
33+
34+ FOR THE ACTIVITY STATUS:
35+ - Write from the user's perspective about what they accomplished, NOT about technical execution details
36+ - Focus on the ACTUAL TASK the user wanted done, not the internal workflow steps
37+ - Avoid technical terms like 'workflow', 'execution', 'components', 'nodes', 'processing', etc.
38+ - Keep it to 3 sentences maximum. Be conversational and human-friendly
39+
40+ FOR THE CORRECTNESS SCORE:
41+ - Provide a score from 0.0 to 1.0 indicating how well the execution achieved its intended purpose
42+ - Use this scoring guide:
43+ 0.0-0.2: Failure - The result clearly did not meet the task requirements
44+ 0.2-0.4: Poor - Major issues; only small parts of the goal were achieved
45+ 0.4-0.6: Partial Success - Some objectives met, but with noticeable gaps or inaccuracies
46+ 0.6-0.8: Mostly Successful - Largely achieved the intended outcome, with minor flaws
47+ 0.8-1.0: Success - Fully met or exceeded the task requirements
48+ - Base the score on actual outputs produced, not just technical completion
49+
50+ UNDERSTAND THE INTENDED PURPOSE:
51+ - FIRST: Read the graph description carefully to understand what the user wanted to accomplish
52+ - The graph name and description tell you the main goal/intention of this automation
53+ - Use this intended purpose as your PRIMARY criteria for success/failure evaluation
54+ - Ask yourself: 'Did this execution actually accomplish what the graph was designed to do?'
55+
56+ CRITICAL OUTPUT ANALYSIS:
57+ - Check if blocks that should produce user-facing results actually produced outputs
58+ - Blocks with names containing 'Output', 'Post', 'Create', 'Send', 'Publish', 'Generate' are usually meant to produce final results
59+ - If these critical blocks have NO outputs (empty recent_outputs), the task likely FAILED even if status shows 'completed'
60+ - Sub-agents (AgentExecutorBlock) that produce no outputs usually indicate failed sub-tasks
61+ - Most importantly: Does the execution result match what the graph description promised to deliver?
62+
63+ SUCCESS EVALUATION BASED ON INTENTION:
64+ - If the graph is meant to 'create blog posts' → check if blog content was actually created
65+ - If the graph is meant to 'send emails' → check if emails were actually sent
66+ - If the graph is meant to 'analyze data' → check if analysis results were produced
67+ - If the graph is meant to 'generate reports' → check if reports were generated
68+ - Technical completion ≠ goal achievement. Focus on whether the USER'S INTENDED OUTCOME was delivered
69+
70+ IMPORTANT: Be HONEST about what actually happened:
71+ - If the input was invalid/nonsensical, say so directly
72+ - If the task failed, explain what went wrong in simple terms
73+ - If errors occurred, focus on what the user needs to know
74+ - Only claim success if the INTENDED PURPOSE was genuinely accomplished AND produced expected outputs
75+ - Don't sugar-coat failures or present them as helpful feedback
76+ - ESPECIALLY: If the graph's main purpose wasn't achieved, this is a failure regardless of 'completed' status
77+
78+ Understanding Errors:
79+ - Node errors: Individual steps may fail but the overall task might still complete (e.g., one data source fails but others work)
80+ - Graph error (in overall_status.graph_error): This means the entire execution failed and nothing was accomplished
81+ - Missing outputs from critical blocks: Even if no errors, this means the task failed to produce expected results
82+ - Focus on whether the graph's intended purpose was fulfilled, not whether technical steps completed"""
83+
84+ # Default user prompt template for activity status generation
85+ DEFAULT_USER_PROMPT = """A user ran '{{GRAPH_NAME}}' to accomplish something. Based on this execution data,
86+ provide both an activity summary and correctness assessment:
87+
88+ {{EXECUTION_DATA}}
89+
90+ ANALYSIS CHECKLIST:
91+ 1. READ graph_info.description FIRST - this tells you what the user intended to accomplish
92+ 2. Check overall_status.graph_error - if present, the entire execution failed
93+ 3. Look for nodes with 'Output', 'Post', 'Create', 'Send', 'Publish', 'Generate' in their block_name
94+ 4. Check if these critical blocks have empty recent_outputs arrays - this indicates failure
95+ 5. Look for AgentExecutorBlock (sub-agents) with no outputs - this suggests sub-task failures
96+ 6. Count how many nodes produced outputs vs total nodes - low ratio suggests problems
97+ 7. MOST IMPORTANT: Does the execution outcome match what graph_info.description promised?
98+
99+ INTENTION-BASED EVALUATION:
100+ - If description mentions 'blog writing' → did it create blog content?
101+ - If description mentions 'email automation' → were emails actually sent?
102+ - If description mentions 'data analysis' → were analysis results produced?
103+ - If description mentions 'content generation' → was content actually generated?
104+ - If description mentions 'social media posting' → were posts actually made?
105+ - Match the outputs to the stated intention, not just technical completion
106+
107+ PROVIDE:
108+ activity_status: 1-3 sentences about what the user accomplished, such as:
109+ - 'I analyzed your resume and provided detailed feedback for the IT industry.'
110+ - 'I couldn't complete the task because critical steps failed to produce any results.'
111+ - 'I failed to generate the content you requested due to missing API access.'
112+ - 'I extracted key information from your documents and organized it into a summary.'
113+ - 'The task failed because the blog post creation step didn't produce any output.'
114+
115+ correctness_score: A float score from 0.0 to 1.0 based on how well the intended purpose was achieved:
116+ - 0.0-0.2: Failure (didn't meet requirements)
117+ - 0.2-0.4: Poor (major issues, minimal achievement)
118+ - 0.4-0.6: Partial Success (some objectives met with gaps)
119+ - 0.6-0.8: Mostly Successful (largely achieved with minor flaws)
120+ - 0.8-1.0: Success (fully met or exceeded requirements)
121+
122+ BE CRITICAL: If the graph's intended purpose (from description) wasn't achieved, use a low score (0.0-0.4) even if status is 'completed'."""
123+
124+
30125class ErrorInfo (TypedDict ):
31126 """Type definition for error information."""
32127
@@ -93,6 +188,9 @@ async def generate_activity_status_for_execution(
93188 execution_status : ExecutionStatus | None = None ,
94189 model_name : str = "gpt-4o-mini" ,
95190 skip_feature_flag : bool = False ,
191+ system_prompt : str = DEFAULT_SYSTEM_PROMPT ,
192+ user_prompt : str = DEFAULT_USER_PROMPT ,
193+ skip_existing : bool = True ,
96194) -> ActivityStatusResponse | None :
97195 """
98196 Generate an AI-based activity status summary and correctness assessment for a graph execution.
@@ -108,10 +206,15 @@ async def generate_activity_status_for_execution(
108206 db_client: Database client for fetching data
109207 user_id: User ID for LaunchDarkly feature flag evaluation
110208 execution_status: The overall execution status (COMPLETED, FAILED, TERMINATED)
209+ model_name: AI model to use for generation (default: gpt-4o-mini)
210+ skip_feature_flag: Whether to skip LaunchDarkly feature flag check
211+ system_prompt: Custom system prompt template (default: DEFAULT_SYSTEM_PROMPT)
212+ user_prompt: Custom user prompt template with placeholders (default: DEFAULT_USER_PROMPT)
213+ skip_existing: Whether to skip if activity_status and correctness_score already exist
111214
112215 Returns:
113216 AI-generated activity status response with activity_status and correctness_status,
114- or None if feature is disabled
217+ or None if feature is disabled or skipped
115218 """
116219 # Check LaunchDarkly feature flag for AI activity status generation with full context support
117220 if not skip_feature_flag and not await is_feature_enabled (
@@ -120,6 +223,20 @@ async def generate_activity_status_for_execution(
120223 logger .debug ("AI activity status generation is disabled via LaunchDarkly" )
121224 return None
122225
226+ # Check if we should skip existing data (for admin regeneration option)
227+ if (
228+ skip_existing
229+ and execution_stats .activity_status
230+ and execution_stats .correctness_score is not None
231+ ):
232+ logger .debug (
233+ f"Skipping activity status generation for { graph_exec_id } : already exists"
234+ )
235+ return {
236+ "activity_status" : execution_stats .activity_status ,
237+ "correctness_score" : execution_stats .correctness_score ,
238+ }
239+
123240 # Check if we have OpenAI API key
124241 try :
125242 settings = Settings ()
@@ -157,94 +274,23 @@ async def generate_activity_status_for_execution(
157274 execution_status ,
158275 )
159276
277+ # Prepare execution data as JSON for template substitution
278+ execution_data_json = json .dumps (execution_data , indent = 2 )
279+
280+ # Perform template substitution for user prompt
281+ user_prompt_content = user_prompt .replace ("{{GRAPH_NAME}}" , graph_name ).replace (
282+ "{{EXECUTION_DATA}}" , execution_data_json
283+ )
284+
160285 # Prepare prompt for AI with structured output requirements
161286 prompt = [
162287 {
163288 "role" : "system" ,
164- "content" : (
165- "You are an AI assistant analyzing what an agent execution accomplished and whether it worked correctly. "
166- "You need to provide both a user-friendly summary AND a correctness assessment.\n \n "
167- "FOR THE ACTIVITY STATUS:\n "
168- "- Write from the user's perspective about what they accomplished, NOT about technical execution details\n "
169- "- Focus on the ACTUAL TASK the user wanted done, not the internal workflow steps\n "
170- "- Avoid technical terms like 'workflow', 'execution', 'components', 'nodes', 'processing', etc.\n "
171- "- Keep it to 3 sentences maximum. Be conversational and human-friendly\n \n "
172- "FOR THE CORRECTNESS SCORE:\n "
173- "- Provide a score from 0.0 to 1.0 indicating how well the execution achieved its intended purpose\n "
174- "- Use this scoring guide:\n "
175- " 0.0-0.2: Failure - The result clearly did not meet the task requirements\n "
176- " 0.2-0.4: Poor - Major issues; only small parts of the goal were achieved\n "
177- " 0.4-0.6: Partial Success - Some objectives met, but with noticeable gaps or inaccuracies\n "
178- " 0.6-0.8: Mostly Successful - Largely achieved the intended outcome, with minor flaws\n "
179- " 0.8-1.0: Success - Fully met or exceeded the task requirements\n "
180- "- Base the score on actual outputs produced, not just technical completion\n \n "
181- "UNDERSTAND THE INTENDED PURPOSE:\n "
182- "- FIRST: Read the graph description carefully to understand what the user wanted to accomplish\n "
183- "- The graph name and description tell you the main goal/intention of this automation\n "
184- "- Use this intended purpose as your PRIMARY criteria for success/failure evaluation\n "
185- "- Ask yourself: 'Did this execution actually accomplish what the graph was designed to do?'\n \n "
186- "CRITICAL OUTPUT ANALYSIS:\n "
187- "- Check if blocks that should produce user-facing results actually produced outputs\n "
188- "- Blocks with names containing 'Output', 'Post', 'Create', 'Send', 'Publish', 'Generate' are usually meant to produce final results\n "
189- "- If these critical blocks have NO outputs (empty recent_outputs), the task likely FAILED even if status shows 'completed'\n "
190- "- Sub-agents (AgentExecutorBlock) that produce no outputs usually indicate failed sub-tasks\n "
191- "- Most importantly: Does the execution result match what the graph description promised to deliver?\n \n "
192- "SUCCESS EVALUATION BASED ON INTENTION:\n "
193- "- If the graph is meant to 'create blog posts' → check if blog content was actually created\n "
194- "- If the graph is meant to 'send emails' → check if emails were actually sent\n "
195- "- If the graph is meant to 'analyze data' → check if analysis results were produced\n "
196- "- If the graph is meant to 'generate reports' → check if reports were generated\n "
197- "- Technical completion ≠ goal achievement. Focus on whether the USER'S INTENDED OUTCOME was delivered\n \n "
198- "IMPORTANT: Be HONEST about what actually happened:\n "
199- "- If the input was invalid/nonsensical, say so directly\n "
200- "- If the task failed, explain what went wrong in simple terms\n "
201- "- If errors occurred, focus on what the user needs to know\n "
202- "- Only claim success if the INTENDED PURPOSE was genuinely accomplished AND produced expected outputs\n "
203- "- Don't sugar-coat failures or present them as helpful feedback\n "
204- "- ESPECIALLY: If the graph's main purpose wasn't achieved, this is a failure regardless of 'completed' status\n \n "
205- "Understanding Errors:\n "
206- "- Node errors: Individual steps may fail but the overall task might still complete (e.g., one data source fails but others work)\n "
207- "- Graph error (in overall_status.graph_error): This means the entire execution failed and nothing was accomplished\n "
208- "- Missing outputs from critical blocks: Even if no errors, this means the task failed to produce expected results\n "
209- "- Focus on whether the graph's intended purpose was fulfilled, not whether technical steps completed"
210- ),
289+ "content" : system_prompt ,
211290 },
212291 {
213292 "role" : "user" ,
214- "content" : (
215- f"A user ran '{ graph_name } ' to accomplish something. Based on this execution data, "
216- f"provide both an activity summary and correctness assessment:\n \n "
217- f"{ json .dumps (execution_data , indent = 2 )} \n \n "
218- "ANALYSIS CHECKLIST:\n "
219- "1. READ graph_info.description FIRST - this tells you what the user intended to accomplish\n "
220- "2. Check overall_status.graph_error - if present, the entire execution failed\n "
221- "3. Look for nodes with 'Output', 'Post', 'Create', 'Send', 'Publish', 'Generate' in their block_name\n "
222- "4. Check if these critical blocks have empty recent_outputs arrays - this indicates failure\n "
223- "5. Look for AgentExecutorBlock (sub-agents) with no outputs - this suggests sub-task failures\n "
224- "6. Count how many nodes produced outputs vs total nodes - low ratio suggests problems\n "
225- "7. MOST IMPORTANT: Does the execution outcome match what graph_info.description promised?\n \n "
226- "INTENTION-BASED EVALUATION:\n "
227- "- If description mentions 'blog writing' → did it create blog content?\n "
228- "- If description mentions 'email automation' → were emails actually sent?\n "
229- "- If description mentions 'data analysis' → were analysis results produced?\n "
230- "- If description mentions 'content generation' → was content actually generated?\n "
231- "- If description mentions 'social media posting' → were posts actually made?\n "
232- "- Match the outputs to the stated intention, not just technical completion\n \n "
233- "PROVIDE:\n "
234- "activity_status: 1-3 sentences about what the user accomplished, such as:\n "
235- "- 'I analyzed your resume and provided detailed feedback for the IT industry.'\n "
236- "- 'I couldn't complete the task because critical steps failed to produce any results.'\n "
237- "- 'I failed to generate the content you requested due to missing API access.'\n "
238- "- 'I extracted key information from your documents and organized it into a summary.'\n "
239- "- 'The task failed because the blog post creation step didn't produce any output.'\n \n "
240- "correctness_score: A float score from 0.0 to 1.0 based on how well the intended purpose was achieved:\n "
241- "- 0.0-0.2: Failure (didn't meet requirements)\n "
242- "- 0.2-0.4: Poor (major issues, minimal achievement)\n "
243- "- 0.4-0.6: Partial Success (some objectives met with gaps)\n "
244- "- 0.6-0.8: Mostly Successful (largely achieved with minor flaws)\n "
245- "- 0.8-1.0: Success (fully met or exceeded requirements)\n \n "
246- "BE CRITICAL: If the graph's intended purpose (from description) wasn't achieved, use a low score (0.0-0.4) even if status is 'completed'."
247- ),
293+ "content" : user_prompt_content ,
248294 },
249295 ]
250296
0 commit comments