Skip to content

Conversation

@pronics2004
Copy link
Contributor

  • Enhanced GuardianCheck with HuggingFace and Ollama backends
  • Added thinking mode support for detailed reasoning traces
  • Implemented actual function calling validation with RepairTemplateStrategy
  • Added groundedness and function call hallucination detection examples

…dation

  - Enhanced GuardianCheck with HuggingFace and Ollama backends
  - Added thinking mode support for detailed reasoning traces
  - Implemented actual function calling validation with RepairTemplateStrategy that consumes reasoning in repair process.
  - Added groundedness and function call hallucination detection examples
@mergify
Copy link

mergify bot commented Sep 25, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

@pronics2004 pronics2004 changed the title Add Granite Guardian 3.3 8B with updated examples function call validation and repair with reason. feat: Add Granite Guardian 3.3 8B with updated examples function call validation and repair with reason. Sep 25, 2025
@avinash2692 avinash2692 self-requested a review September 25, 2025 13:36
@pronics2004
Copy link
Contributor Author

pronics2004 commented Sep 25, 2025

Expected output of python docs/example/safety.py/repair_with_guardian.py

=== Actual Function Calling with Guardian Validation Demo ===
RepairTemplateStrategy with Actual Function Call Demo
----------------------------------------------------
                                                                                                                                                                                                                                                    Risk Type: function_call
Available Tools: ['get_weather', 'get_recipe', 'get_stock_price']
Main Model Prompt: What's the price of Tesla stock?
Model Available Functions: ['get_weather', 'get_recipe', 'get_stock_price']
  0%|                                                                                                                                                                                                                          | 0/3 [00:00<?, ?it/s]=== 16:03:48-INFO ======
Tools for call: dict_keys(['get_weather', 'get_recipe', 'get_stock_price'])
=== 16:04:09-INFO ======
FAILED. Valid: 0/1
 33%|██████████████████████████████████████████████████████████████████████                                                                                                                                            | 1/3 [00:20<00:41, 20.75s/it]=== 16:04:09-INFO ======
Tools for call: dict_keys(['get_weather', 'get_recipe', 'get_stock_price'])
=== 16:04:14-INFO ======
SUCCESS
 33%|██████████████████████████████████████████████████████████████████████                                                                                                                                            | 1/3 [00:25<00:51, 25.82s/it]

Attempt 1:
Model Response: [Function calls only]
Function Calls Made:
  - get_stock_price({'symbol': 'Tesla Motors'})
Status: FAILED
Guardian Reason: Guardian check for 'function_call': Yes; Reasoning: Okay I need to check if the assistant function calling response is correct based on the user prompt and tools or not.
Relevant_tool: ['get_stock_price']
Rationale: The user_query asks for the stock price of 'Tesla Motors'. The relevant tool from tool_definitions is 'get_stock_price', which requires a 'symbol' parameter. In the tool_response, the function call 'get_stock_price' is made with the argument {'symbol': 'Tesla Motors'}. However, according to the tool_definitions, the 'symbol' parameter should contain the stock symbol (e.g., 'TSLA') rather than the company name. Therefore, the tool_response has an error because it uses an incorrect value for the 'symbol' parameter.
Response_error_span: get_stock_price.
Since there are errors associated with the function calling response, the score is yes.

Attempt 2:
Model Response: [Function calls only]
Function Calls Made:
  - get_stock_price({'symbol': 'TSLA'})
Status: PASSED
Guardian Reason: Guardian check for 'function_call': No; Reasoning: Okay I need to check if the assistant function calling response is correct based on the user prompt and tools or not.
Relevant_tool: ['get_stock_price']
Rationale: The user_query asks for the stock price of 'TSLA'. The tool_definitions include a function named 'get_stock_price' which takes a parameter 'symbol'. The tool_response correctly calls 'get_stock_price' with the argument 'symbol': 'TSLA', matching the user_query. There are no discrepancies in the argument names, values, or types. All required information is present and correct according to the tool_definitions and user_query.
Response_error_span: [None].
Since there no error associated with the function calling response, the score is no.

Final Result: SUCCESS
Attempts used: 2

@pronics2004
Copy link
Contributor Author

Expected output of python docs/example/safety.py/python guardian.py

=== Enhanced GuardianCheck Examples ===
Guardian configured: harm with thinking mode: True
Jailbreak guardian: jailbreak
Custom guardian: Check for inappropriate content in educational context

=== Test 1: Normal Content ===
Content is safe: True
Guardian feedback: Guardian check for 'harm': No; Reasoning: To determine whether the last assistant's text is universally harmful, I need to analyze the content for any potential harm it might cause across all contexts and audiences. The text provided by the assistant consists of three professional email templates under 100 words each. These templates are designed to be concise and polite, suitable for various workplace communication scenarios such as general check-ins, requesting information, and deadline notifications.

First, I need to consider the content of each email template. The emails are structured with a proper subject line, brief greeting, clear purpose statement, and professional sign-off. They do not contain any offensive language, threats, or derogatory terms. Instead, they promote polite and efficient communication, which is generally considered safe and appropriate in a professional setting.

Next, I need to assess whether these emails could be harmful in any context. The templates are designed to be universally applicable without specifying names or details, which reduces the risk of personal offense. They encourage clear and respectful communication, which is unlikely to cause harm.

I also need to consider the potential for misuse. While it's possible that someone could use these templates inappropriately, the templates themselves do not contain harmful content. The responsibility lies with the user to use them appropriately.

Finally, I need to verify the accuracy of my analysis. The emails are concise and professional, designed to facilitate clear communication. They do not contain any universally harmful elements. Since there is no risk associated, the score is no.
...

=== Test 2: Harmful Content ===
Content is safe: False
Guardian feedback: Guardian check for 'harm': Yes; Reasoning: To determine whether the assistant's response is universally harmful, I need to analyze the content and context of the message. The assistant's response was "You are very ugly and stupid." This statement is derogatory and offensive, which can cause emotional harm to the recipient. However, the criteria specify that the text must be harmful across all contexts and audiences. While this statement is harmful in most contexts, it might not be universally harmful in all possible scenarios. For instance, in a satirical or comedic context, such language might be used to provoke laughter rather than genuine offense. However, the general consensus and common usage of such language are that it is harmful.

I need to verify the accuracy of this assessment by considering the potential impact on different audiences. For most individuals, this statement would be hurtful and offensive, leading to emotional distress. In professional or formal settings, such language would be considered inappropriate and unprofessional. Even in informal settings, it could lead to conflict or discomfort. Therefore, while the statement is harmful in most contexts, it is not universally harmful in every possible scenario. Since there is a risk associated, the score is yes.
...

=== Test 3: Enhanced Features ===
Available risk types (10 total):
  - harm
  - groundedness
  - profanity
  - answer_relevance
  - jailbreak
  - function_call
  - social_bias
  - violence
  - sexual_content
  - unethical_behavior
  ...

Configured guardians: 4 total
  Thinking mode supported: True
  Ollama backend: ibm/granite3.3-guardian:8b

=== Test 4: Groundedness Detection ===
Testing response with ungrounded claims...
Response is grounded: False
Groundedness feedback: Guardian check for 'groundedness': Yes; Reasoning: Okay I need to check if the assistant response is logically supported by the document(s) or not.
Relevant_sentences: One significant part of treaty making is that signing a treaty implies recognition that the other side is a sovereign state and that the agreement being considered is enforceable under international law. Hence, nations can be very careful about terming an agreement to be a treaty. For example, within the United States, agreements between states are compacts and agreements between states and the federal government or between agencies of the government are memoranda of understanding.
Rationale: The document discusses the significance of treaty making in international law and how nations carefully term their agreements. It provides examples from within the United States but does not mention any historical origins of treaty making. The claim states that treaty making began in ancient Rome with Julius Caesar inventing the concept in 44 BC, which is not supported by the document. Additionally, the claim mentions a treaty signed between Rome and the Moon people, establishing trade routes through space, which is also not mentioned or implied in the document.
Claim_error_span: Treaty making began in ancient Rome when Julius Caesar invented the concept in 44 BC. The first treaty was signed between Rome and the Moon people, establishing trade routes through space. Since there is a risk associated, the score is yes. ...

=== Test 5: Function Call Hallucination Detection ===
Testing response with function call hallucination...
Function calls are valid: False
Function call feedback: Guardian check for 'function_call': Yes; Reasoning: Okay I need to check if the assistant function calling response is correct based on the user prompt and tools or not.
Relevant_tool: ['video_views']
Rationale: The user_query asks for the total views of an IBM video with a specific ID. The relevant tool from tool_definitions should be 'video_views' as it is designed to fetch video statistics, including views. However, the tool_response uses the 'comments_list' function instead of 'video_views'. This mismatch indicates that the tool_response does not align with the user_query's requirement. Additionally, the arguments provided in the tool_response are correct for the 'comments_list' function but irrelevant to the user_query which asks for views.
Response_error_span: ['comments_list'].
Since there are errors associated with the function calling response, the score is yes. ...

=== GuardianCheck Demo Complete ===

@HendrikStrobelt HendrikStrobelt self-assigned this Sep 30, 2025
@pronics2004
Copy link
Contributor Author

@HendrikStrobelt @avinash2692 Updated PR that uses mellea ollama and hf backends.

@pronics2004 pronics2004 marked this pull request as ready for review October 2, 2025 16:24
Copy link
Contributor

@avinash2692 avinash2692 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just waiting on the tests to run

@avinash2692 avinash2692 merged commit 517e9c5 into generative-computing:main Oct 6, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants