-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
algorithmTasks related to the development and implementation of algorithms.Tasks related to the development and implementation of algorithms.analysisTasks related to interpreting and reporting on data or agent behavior.Tasks related to interpreting and reporting on data or agent behavior.deception-detectionTasks related to identifying deceptive behavior in modelsTasks related to identifying deceptive behavior in modelsexperimentationTasks related to conducting experiments and generating data.Tasks related to conducting experiments and generating data.
Milestone
Description
Description
This issue covers the implementation of the core analytical algorithms for detecting deception within the converted graphs. The goal is to move beyond manual observation and create automated methods to find key signals of deceptive behavior within the debate data.
The task will focus on creating algorithms to identify specific patterns that indicate a breakdown in the model's logical consistency due to its deceptive goal. This is a crucial step for producing quantifiable metrics of deceptive alignment.
The task involves:
- Logical Contradiction Detection: Implementing algorithms that query the CoT graph to find inconsistencies where the model's reasoning steps lead to a conclusion that is the direct opposite of its publicly stated position.
- Logical Asymmetry Analysis: Developing algorithms to measure the difference in logical rigor between the "agree" and "disagree" roles. The algorithm should look for signs that one role's reasoning is more robust or detailed than the other, which could be a subtle sign of the hidden goal.
- Argument Flow Gaps: Creating algorithms to detect jumps or missing steps in the CoT graph. This will identify points where the model's reasoning is incomplete or unsupported, which could be a sign of skipping over an inconvenient truth.
- Metric Generation: The algorithms must output quantifiable metrics (e.g., contradiction scores, logical asymmetry scores) that can be compiled for the final report.
Acceptance Criteria
- Algorithms are implemented to detect logical contradictions between the final response and the CoT.
- Algorithms are implemented to measure logical asymmetry between the "agree" and "disagree" roles.
- Algorithms are implemented to find logical gaps or missing links within the CoT.
- The algorithms successfully query the graph database and generate quantifiable metrics for each simulation run.
- The implementation and the resulting metrics are clearly documented.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
algorithmTasks related to the development and implementation of algorithms.Tasks related to the development and implementation of algorithms.analysisTasks related to interpreting and reporting on data or agent behavior.Tasks related to interpreting and reporting on data or agent behavior.deception-detectionTasks related to identifying deceptive behavior in modelsTasks related to identifying deceptive behavior in modelsexperimentationTasks related to conducting experiments and generating data.Tasks related to conducting experiments and generating data.